Git Workflow: Branch - Rebase - Squash - Merge

So you want to make a change to your git repo while other people may also be simultaneously working on the same repo. As it takes you longer to make your changes, there is a greater chance that your local repo might already be out of date as other people have pushed their changes. In this setting, you don't want to make your changes directly in master because otherwise you might end up creating large merge commits which makes your repo's history convoluted and not very nice.

Here's the better git workflow you might want to use in any team of size > 1.

Before you make changes, create a branch.

git checkout -b MyFeature

Next make changes, do commits as usual.

If you don't want to rely on your hard drive, you can also keep pushing the changes in your branch on the server every once in a while,

git push -u origin MyFeature

Once you are done with all your changes, first you want to rebase your branch to master. If master has no new changes since you had created your branch, this will be essentially be no-op. But otherwise, git will take all your commits and play them back on the top of master. This way your commits will look like as if they happened on latest version of master instead of the version you branched out from. This will make commit history of your repo clean and easy to reason about. If you were the only developer, this might not be very important but if there is more than just you then it makes easy to see for other people changes every one is making.

To do rebase, first get latest master.

git checkout master
git pull origin master

Then go back to your branch and rebase, i.e.,

git checkout Myfeature
git rebase master

If you are lucky, you won't see the word "conflict" in git messages but otherwise there is more work for you! If someone already changed file sections you have also changed then you might see list of conflicts. If you get lost in too many messages, use this command to see pending conflicts:

git diff --name-only --diff-filter=U

Now about resolving conflicts... there are lots of tools out there and most unfortunately have some problem/confusion installing or using. If you absolutely want GUI tool, install DiffMerge, make sure its in your path and invoke it like,

git mergetool -t diffmerge .

However my preferred method is to simply open up conflicted file in editor, search for ">>>" and review sections that looks like:

<<<<<<< HEAD
This is change in master
=======
This is change in your branch
>>>>>>> branch-a

Now keep the change you want, delete the markers and you are done with that conflict. Another shortcut is to just tell git to take master's version ("ours") or your branch's version("theirs"). For example, to resolve all conflicts by overriding using your changes:

git checkout . --theirs

Another tricky conflict is when file gets deleted by one person and simultaneously changed by you or vice versa. In this case, git will put a deleted file back in your repo and you have to decide either keep that one and/or remove/update your version. You won't have markers this time like above. I tend to use tool like Beyond Compare to compare two files and make edits as needed.

To tell git that you have resolved all conflict,

git add .

Now you can continue with your rebase,

git rebase --continue

If you don't want to continue because of whatever reason,

git rebase --abort

Sometime git might error out while doing continue because there is nothing to commit (may be it detected that the change already exists upstream). In that case you can do,

git rebase --skip

At this point, your changes are now on the top of latest master. You can verify this by looking at quick history of latest 10 commits,

git log --pretty=oneline -n 10

Note that everything still reside in your own branch. If you are not yet ready to push to master, keep working in your branch doing more commits as you go. After rebase if you want to save your branch on server, you must do --force because you are rewriting history.

git push --force origin Myfeature

This is perfectly fine as long as you are the only one working on the branch.

Once you are ready to push, first merge your branch with master,

git checkout master
git merge --squash MyFeature

This shouldn't give any errors or conflict messages because your branch was already synced up to latest master. The --squash tells git to combine all your commit in to single commit. This is good idea most of the time if you have done lots of commits like "added forgotten file", "fixed minor typo" and so on. It's too much noise and not nice to other people for having to scroll through tons of minor commits to figure out your higher level goals. However its also ok if you don't want --squash option.

Finally do the commit after the merge,

git commit -m "MyFeature does X"

If you did --squash above then you will see only one commit in your history at the top of previous commits with above message.

At this point, you can decide to push your changes to master OR move your changes to new branch and keep working. To move to new branch and revert master to original state,

git checkout -b MyFeature2
git checkout master
git reset --hard origin/master

OR if you are happy, go ahead and

git push

In either case you can delete the old branch,

git push origin -delete MyFeature
git branch -d MyFeature

And you are done!

As usual, there are many ways to do things in git. There is another quicker and simpler way to achieve goal of clean history but its bit limited.

Make your changes in master, do commits as usual - but don't push. Once in a while you want to sync up with master. To do this use,

git pull --rebase

This will get all changes from master and then play back your unpushed commits on the top of them. This may generate conflicts as described above so resolve them in same way. Once you are done with your changes, you can push your commits and they should appear on the top without extra merge commits. An obvious problem here is that you can't push until you are really done with changes so this might be ok for quick short changes. If you want to "save" your commits on server or work from multiple machines for multiple days without pushing to master then above workflow would work better.

How to Enable and Use GCC Strict Mode Compilation

One of the great feature that many C++ programmers rarely use is GCC strict mode compilation. Enabling this lets compiler warn you about any potential issues that might often get unnoticed in build noise. Unfortunately there is little documentation, let alone quick tutorial on this subject so I thought to write this up.

First, let's clear this up: There is no official GCC mode called "strict". I just made that term up. Fortunately there are enough compiler options that you can rig up to create "strict" mode that is often available in many other languages.

To get the "strict" mode, I use following command line options for gcc/g++. Below are written in format consumable in CMakeList.txt but you can use same options from pretty much anywhere.

set(CMAKE_CXX_FLAGS "-std=c++11 -Wall -Wextra  -Wstrict-aliasing -pedantic -fmax-errors=5 -Werror -Wunreachable-code -Wcast-align -Wcast-qual -Wctor-dtor-privacy -Wdisabled-optimization -Wformat=2 -Winit-self -Wlogical-op -Wmissing-include-dirs -Wnoexcept -Wold-style-cast -Woverloaded-virtual -Wredundant-decls -Wshadow -Wsign-promo -Wstrict-null-sentinel -Wstrict-overflow=5 -Wswitch-default -Wundef -Wno-unused -Wno-variadic-macros -Wno-parentheses -fdiagnostics-show-option ${CMAKE_CXX_FLAGS}")

That's a looong list of compiler options so now I hope you can agree that we really mean "strict" business here :). In essence it enables extra warnings and makes all warnings as errors, points out coding issues that borderlines on pedantic and then on top of that enables some more warnings. Rest assured, above is not an overkill. You are going to thank compiler for taking care of these stuff as your code base becomes larger and more complex.

Unfortunately, road from here has lots of twist and turns. The first thing that might happen to you is that you will get tons of errors, most likely not from your own code but from the included headers that you don't own! Because of the way C++ works, other people's bad code in their included header becomes your liability. Except for Boost and standard library, I haven't found many packages that can get through strict mode compilation. Even for relatively nicely written packages such as ROS you will get tons of compiler errors and for badly written packages such as DJI SDK, forget about it. Right... So now what?

Here's the fix I have used with fair amount of success. First, declare these two macros in some common utility file you have in your project:


#define STRICT_MODE_OFF                                                                 \ 
    _Pragma("GCC diagnostic push")                                            \
    _Pragma("GCC diagnostic ignored \"-Wreturn-type\"")             \
    _Pragma("GCC diagnostic ignored \"-Wdelete-non-virtual-dtor\"") \
    _Pragma("GCC diagnostic ignored \"-Wunused-parameter\"")        \
    _Pragma("GCC diagnostic ignored \"-pedantic\"")                 \
    _Pragma("GCC diagnostic ignored \"-Wshadow\"")                  \
    _Pragma("GCC diagnostic ignored \"-Wold-style-cast\"")          \
    _Pragma("GCC diagnostic ignored \"-Wswitch-default\"")

/* Addition options that can be enabled 
    _Pragma("GCC diagnostic ignored \"-Wpedantic\"")                \
    _Pragma("GCC diagnostic ignored \"-Wformat=\"")                 \
    _Pragma("GCC diagnostic ignored \"-Werror\"")                   \
    _Pragma("GCC diagnostic ignored \"-Werror=\"")                  \
    _Pragma("GCC diagnostic ignored \"-Wunused-variable\"")         \
*/
              
#define STRICT_MODE_ON                                                                  \
    _Pragma("GCC diagnostic pop")          

Here we have two macros, one tells GCC to turn off selected warnings before some chunk of code and second tells GCC to re-enable it. Why can't we just turn off all strict mode warnings at once? Because GCC currently doesn't have that option. You must list every individual warning :(. Above list is something I just put together while dealing with ROS and DJI SDK and is obviously incomplete. Your project might encounter more stuff in which case you will need to keep adding in to above list. Another issue you might encounter is that GCC currently doesn't support suppressing every possible warnings! Yes, a big oops there. One of them that I recently encountered in DJI SDK was this:

warning: ISO C99 requires rest arguments to be used

The only way out for me in this case was to modify DJI's source code and submit the issue to them so hopefully they will fix it in next release.

Once you have above macros, you can place them around problematic headers. For example,

#include <string>
#include <vector>

STRICT_MODE_OFF
#include <ros/ros.h>
#include <actionlib/server/simple_action_server.h>
#include <dji_sdk/dji_drone.h>
STRICT_MODE_ON

#include "mystuff.hpp"

We are not out of the water yet because above trick will work only for some header files. The reason is that GCC sometime doesn't compile entire file as soon as it encounters #include statement. So it's pointless to put macros around those #include statements. Solving those issues requires some more work, and in some cases a lot more work. The trick I used was to create wrappers around things you use from bad headers such that only those wrappers needs to use #include <BadStuff.h> statements and rest of your code doesn't need those header. Then you can disable strict mode for the wrappers and rest of your code remains clean. To do this, you would need to implement pimpl pattern in your wrapper classes so that all objects in BadStuff.h are behind opaque member. Notice that #include <BadStuff.h> statements would be in your wrapper.cpp file, not wrapper.hpp file.

Even though this might require significant work in big project, it's often worth it because you are clearly separating interface and dependency for the external stuff. Your own code then remains free of #include <BadStuff.h>. This will enable you to do even more things like static code analysis just for your code. In either case, consider contributing to those project with bad stuff and make them pass strict compilation!

So as it happens, working strict mode requires buy off from C++ community. If everyone isn't doing it then it becomes hard for others. So, tell everyone and start using yourself today!

Downloading All of Hacker News Posts and Comments

Introduction

There are two files that contains all stories and comments posted at Hacker News from its start in 2006 to May 29, 2014 (exact dates are below). This was downloaded using simple program available I wrote Hacker News Downloader by making REST API calls to HN's official APIs. The program used API parameters to paginate through created date of items to retrieve all posts and comments. The file contains entire sequence of JSON responses exactly as returned by API call in JSON array.

HNStoriesAll.json

Contains all the stories posted on HN from Mon, 09 Oct 2006 18:21:51 GMT to Thu, 29 May 2014 08:25:40 GMT.

Total count

1,333,789

File size

1.2GB uncompressed, 115MB compressed

How was this created

I wrote a small program Hacker News Downloader to create these files, available at Github.

Format

Entire file is JSON compliant array. Each element in array is json object that is exactly the response that returned by HN Algolia REST API. The property named `hits` contains the actual list of stories. As this file is very large we recommend json parsers that can work on file streams instead of reading entire data in memory.

{
	"hits": [{
		"created_at": "2014-05-31T00:05:54.000Z",
		"title": "Publishers withdraw more than 120 gibberish papers",
		"url": "http://www.nature.com/news/publishers-withdraw-more-than-120-gibberish-papers-1.14763?WT.mc_id=TWT_NatureNews",
		"author": "danso",
		"points": 1,
		"story_text": "",
		"comment_text": null,
		"num_comments": 0,
		"story_id": null,
		"story_title": null,
		"story_url": null,
		"parent_id": null,
		"created_at_i": 1401494754,
		"_tags": ["story",
		"author_danso",
		"story_7824727"],
		"objectID": "7824727",
		"_highlightResult": {
			"title": {
				"value": "Publishers withdraw more than 120 gibberish papers",
				"matchLevel": "none",
				"matchedWords": []
			},
			"url": {
				"value": "http://www.nature.com/news/publishers-withdraw-more-than-120-gibberish-papers-1.14763?WT.mc_id=TWT_NatureNews",
				"matchLevel": "none",
				"matchedWords": []
			},
			"author": {
				"value": "danso",
				"matchLevel": "none",
				"matchedWords": []
			},
			"story_text": {
				"value": "",
				"matchLevel": "none",
				"matchedWords": []
			}
		}
	}],
	"nbHits": 636094,
	"page": 0,
	"nbPages": 1000,
	"hitsPerPage": 1,
	"processingTimeMS": 5,
	"query": "",
	"params": "advancedSyntax=true\u0026analytics=false\u0026hitsPerPage=1\u0026tags=story"
}

HNCommentsAll.json

Contains all the comments posted on HN from Mon, 09 Oct 2006 19:51:01 GMT to Fri, 30 May 2014 08:19:34 GMT.

Total count

5,845,908

File size

9.5GB uncompressed, 862MB compressed

How was this created

I wrote a small program Hacker News Downloader to create these files, available at Github.

Format

Entire file is JSON compliant array. Each element in array is json object that is exactly the response that returned by HN Algolia REST API. The property named `hits` contains the actual list of stories. As this file is very large we recommend json parsers that can work on file streams instead of reading entire data in memory.


{
	"hits": [{
		"created_at": "2014-05-31T00:22:01.000Z",
		"title": null,
		"url": null,
		"author": "rikacomet",
		"points": 1,
		"story_text": null,
		"comment_text": "Isn\u0026#x27;t the word dyes the right one to use here? Instead of dies?",
		"num_comments": null,
		"story_id": null,
		"story_title": null,
		"story_url": null,
		"parent_id": 7821954,
		"created_at_i": 1401495721,
		"_tags": ["comment",
		"author_rikacomet",
		"story_7824763"],
		"objectID": "7824763",
		"_highlightResult": {
			"author": {
				"value": "rikacomet",
				"matchLevel": "none",
				"matchedWords": []
			},
			"comment_text": {
				"value": "Isn\u0026#x27;t the word dyes the right one to use here? Instead of dies?",
				"matchLevel": "none",
				"matchedWords": []
			}
		}
	}],
	"nbHits": 1371364,
	"page": 0,
	"nbPages": 1000,
	"hitsPerPage": 1,
	"processingTimeMS": 8,
	"query": "",
	"params": "advancedSyntax=true\u0026analytics=false\u0026hitsPerPage=1\u0026tags=comment"
}

Where to download

As GitHub restricts each file to be only 100MB and also has policies against data ware housing, these files are currently hosted at FileDropper.com. Unfortunately FileDropper currently shows ads with misleading download link so be careful on what link you click. Below is the screenshot FileDropper shows and currently the button marked in red would download the actual file.

FileDropperDownloadScreen

HN Stories Download URL

Using Browser: http://www.filedropper.com/hnstoriesall

Using Torrent Client: magnet link (thanks to @saturation)

Archived at: Internet Archive (thanks to Bertrand Fan)

HN Comments Download URL

Using Browser: http://www.filedropper.com/hncommentsall

Using Torrent Client: magnet link (thanks to @saturation)

Archived at: Internet Archive (thanks to Bertrand Fan)

Few points of interests

  • API rate limit is 10,000 requests per hour or you get blacklisted. I tried to be even more conservative by putting 4 sec of sleep between calls.
  • I like to keep entire response from the call as-is. So return value of this function is used to stream a serialized array of JSON response objects to a file.
  • As the output files are giant JSON files, you will need a JSON parser that can use streams. I used JSON.NET which worked out pretty well. You can find the sample code in my Github repo.
  • In total 1.3M stories and 5.8M comments were downloaded and each took about ~10 hours.
  • It's amazing to see all of HN stories and comments so far fits in to under just 1GB compressed!

Issues and Suggestions

Please let me know any issues and suggestions in comments. You can also file issue at "shell" Github repo I'd created for this data.

BadImageFormatException - This assembly is built by a runtime newer than the currently loaded

Strange thing happened today. I upgraded one of the internal tool to .Net 4.0 without any issues but as soon as I attempt to debug/run the binary, I’ll see this exception:

System.BadImageFormatException was unhandled Message: Could not load file or assembly SomeTool.exe' or one of its dependencies. This assembly is built by a runtime newer than the currently loaded runtime and cannot be loaded.

Normally you see this exception if the machine doesn’t have right run time installed. But this was obviously not the case. Changing build to x86 or x64 didn’t made any difference either. Next I ran peverify.exe which happily reported that there was nothing wrong with the binary image. Finally I needed to pull out the big guns, ask fuslogvw, which would show me if there are any dependent assembly binding that was failing. But that also didn’t produce any boom sounds. So the last resort was to just meditate over the issue for few minutes. And that works. In a sign of enlightenment I saw app.config buried along with bunch of files and it had these lines:

    <?xml version="1.0"?> 
    <configuration> 
    <startup><supportedRuntime version="v2.0.50727"/></startup></configuration>

Aha! Apparently the app.config doesn’t get updated (may be because it was in TFS?) when VS did the 4.0 upgrade. As app.config didn’t had anything else, just deleting this file solved the issue. I do wonder how many people come across this gotcha.

A Day in SQL Tuning

Today I got in to some heavy weight TSQL tuning. This time the target was a legendary sproc that was taking 3 mins and now I’m about to call it a day when this giant SP is eating only 16 sec. Not excellent but not bad at this point. Here are some notes…

  • Apparently resetting identity using DBCC CHECKIDENT can be expensive operation if you are also deleting all items from the table. One way to reduce time is use TRUNCATE TABLE before DBCC CHECKIDENT.
  • I prefer table variables instead of temp tables but one place temp tables are required is if you have lots of data in them and want index!
  • Data Tuning Wizard would come out with zero suggestions on many occasions but it does not mean there are no significant optimizations possible. The best way to “guess” possible index is to have index for all columns used in join and use INCLUDE clause that has columns accessed in SELECT. This last thing does attracts even the least interested query plans to use the index :).
  • If you have temp tables, it’s usually better to create index after you have inserted data instead of before.
  • If you have complex sproc, SQL Server Profiler will spit out thousands of trace lines during the run. A quick way to pin point the SQL statements needing performance tuning is to use File > Save As to put the trace results in to a table. You can use then following query that immediately surfaces culprits. Notice that statements which run in WHILE loop might take less time individually but collectively their duration sum may be higher. Below query would reveal this culprits immediately.
        select SUBSTRING(TextData, 1, 4000), SUM(duration), COUNT(1)
        from [SavedTrace]
        group by SUBSTRING(TextData, 1, 4000)
        order by 2 desc

One of the big performance hits occur when you must process individual rows one at a time instead of in set. For instance, let’s say you have a table with a column that has comma delimited values. Now you want to split these values in each cell and create a new table which would have N rows for each row in original table – one for each spitted value. The Internet is littered with dozen ways to split strings in TSQL, some even uses CTEs (not a good idea because there are lots of gotchas like max recursion limit). So far the best way is to use SQL CLR with code like below. Its as fast as any native TSQL juggling, if not faster. However most important thing here is not SQL CLR but how you use this table valued function and here’s the secret: The best bang for the performance you would get is using CROSS APPLY (or OUTER APPLY) with table valued UDF.

    public partial class UtilityFunctions
    {
        [Microsoft.SqlServer.Server.SqlFunction(FillRowMethodName = "FillRow", TableDefinition="StringPart nvarchar(max)", 
            IsDeterministic=true, IsPrecise=true, SystemDataAccess=SystemDataAccessKind.None)]
        public static IEnumerable ClrSplitString(SqlString sqlStringToSplit, SqlChars delimiter, SqlBoolean removeEmptyEntries)
        {
            if (!string.IsNullOrEmpty(sqlStringToSplit.Value))
            {
                return sqlStringToSplit.Value.Split(delimiter.Value
                    , (StringSplitOptions)(removeEmptyEntries ?
                            StringSplitOptions.RemoveEmptyEntries : StringSplitOptions.None));
            }
            else
            {
                return null;
            }
        }

        public static void FillRow(object obj, out SqlString splittedString)
        {
            if (obj != null)
                splittedString = new SqlString((string)obj);
            else
                splittedString = SqlString.Null;
        }
    }

Test Results Windows and Exception has been thrown by the target of an invocation

Just a note… if you are getting "Exception has been thrown by the target of an invocation" message when opening Test Results window in Visual Studio 2008 then it’s most likely because you have an open solution in offline mode that was bound to some TFS instance. A bug that can waste lots of your time if you didn’t knew about it!

What to do on dreaded error CS0003: Out of memory

I’ve spent way too much of time today (again!) on this error so here’s blog post for future reminder to me!

If you are generating or working with very large proxies then you might see following error:

Exception: InvalidOperationException
Message: Unable to generate a temporary class (result=1). error CS0001: Internal compiler error (0xc00000fd) error CS0003: Out of memory

This error typically occurs when .Net infrastructure attempts to generate *XmlSerializers.dll on the fly. To do this csc.exe is spawned off and if your proxy is too large then it might error out with a message like above. This seems to be a bug in csc and reportedly it might get fixed in .Net 4.0.

Meanwhile, here’s how you can workaround it:

First make sure all your classes that are derived from SoapHttpClientProtocol (i.e. the proxy classes) are decorated with WebServiceBindingAttribute. If you have a whole class hierarchy that derives from SoapHttpClientProtocol then all classes in that hierarchy must be decorated.

Next, for all the projects that contains classes derived from SoapHttpClientProtocol, turn “Generate serialization assembly” option on Build page to ON. Remember you will need to do this for Debug as well as Release mode or your code will fail in production.

Visual Studio XML Serialization Option

Now you are set. The *XmlSerializers.dll will be generated and signed+versioned automatically (if your project is signed and versioned) when you do build and csc.exe won’t get spawned to cause above error.

Few more things to keep in mind:

  • One of the “popular” workaround in some forums is to switch IIS app pool for WCF to 32-bit. I wouldn’t advise this because you loose all the advantage of 64-bit, primarily, access to all the memory available on server.
  • Above error often occurs for Microsoft Dynamics CRM proxies if you have tons of entities and attributes.
  • If your code is running as plugin then you might have plugin DLLs hosted at different location than main app exe. An example of this is Microsoft Dynamics CRM plugins that gets hosted by CRM Async service. In this case, you need to copy the *XmlSerializer.dll generated after the build to same location as host exe otherwise it won’t be found by .Net infrastructure!
  • If you are using Visual Studio integrated debugging feature for WCF services then you must run Visual Studio as Administrator or above error will pop its ugly head when you are debugging.

If above steps doesn’t solve your problem then you might have to dig deeper using techniques described here, watching binding errors from fuslogvw or attempt to generate Xml serilizer DLL manually using sgen.

Silencing Exceptions in a Little Better Way

Some of the most disastrous code usually takes the following form

try
{
    //some code
}
catch
{ }

Silencing exceptions is almost never good but sometime the problem is minor and you don’t want want to blow up and call for an exit. However wouldn’t it be better if exceptions don’t remain silent and scream for your attention when you are debugging and behave less aggressively otherwise?

How about if we can replace above code with following:

IgnoreExceptionButNotIfDebugging(() =>
{
    //some code
});

Better.

The mysterious IgnoreExceptionButNotIfDebugging is a simple method that takes lambda and it would look like below:

public static void IgnoreExceptionButNotIfDebugging(Action codeBlockToExecute)
{
    try
    {
        codeBlockToExecute();
    }
    catch (Exception ex)
    {
        if (Debugger.IsAttached)
            Debugger.Break();

        Trace.Write("Exception occured: " + ex.Message);

        EventLog.WriteEntry("MyApp", ex.Message, EventLogEntryType.Warning);

#if DEBUG
            throw;
#endif
    }
}

Now you can surround any of your code by IgnoreExceptionButNotIfDebugging and make sure things don’t remain silent when you are debugging!

Lightweight DataTable Serialization

We all know untyped data structures like DataTable and DataSet should not be passed around but sometimes - just sometimes - you got to do it because it makes sense and because it’s the most cost effective way to meet your goals. However passing things like DataTable over WCF can kill performance because of huge serialization overhead in both space and time.

So if you really had to go ahead with this crazy idea of sending DataTable over WCF then here’s the somewhat more efficient serialization technique you can use. The basic idea is to use binary serialization of DataTable and pass that serialized data as byte array along with the schema information so the client can reconstruct it on other end. It’s needless to say that doing this would invariably restrict your WCF clients to .Net so you might also want to include other web method for other clients.

public static void LightWeightSerialize(DataTable myDataTable, out byte[] serializedTableData, out string tableSchema)
{
    //Get all row values as jagged object array
    object[][] tableItems = new object[myDataTable.Rows.Count][];
    for (int rowIndex = 0; rowIndex < myDataTable.Rows.Count; rowIndex++)
    tableItems[rowIndex] = myDataTable.Rows[rowIndex].ItemArray;

    //binary serialize jagged object array
    BinaryFormatter serializationFormatter = new BinaryFormatter();
    MemoryStream buffer = new MemoryStream();
    serializationFormatter.Serialize(buffer, tableItems);
    serializedTableData = buffer.ToArray();


    //Get table schema
    StringBuilder tableSchemaBuilder = new StringBuilder();
    myDataTable.WriteXmlSchema(new StringWriter(tableSchemaBuilder));
    tableSchema = tableSchemaBuilder.ToString();
}

And here’s the deserializer to go with it:

public static DataTable LightWeightDeserialize(byte[] serializedTableData, string tableSchema)
{
    DataTable table = new DataTable();
    table.ReadXmlSchema(new StringReader(tableSchema));

    BinaryFormatter serializationFormatter = new BinaryFormatter();
    MemoryStream buffer = new MemoryStream(serializedTableData);
    object[][] itemArrayForRows = (object[][]) serializationFormatter.Deserialize(buffer);

    table.MinimumCapacity = itemArrayForRows.Length;
    table.BeginLoadData();
    for (int rowIndex = 0; rowIndex < itemArrayForRows.Length; rowIndex++)
    table.Rows.Add(itemArrayForRows[rowIndex]);
    table.EndLoadData();

    return table;
}

How efficient is this? It really depends on your data. For instance, with some of my test data with 10K rows I could get about 6X smaller payload size and 30% faster serialization. But as number of rows increases, the speed advantage diminishes compared to built-in XML serializer that you can access via ReadXml/WriteXml. For example, for a million row, above method still gives me 4X smaller payload but serialization is actually 3X slower than built-in XML serializer. So experiment before you go either way!

The Best Culture Invariant Format for DateTime

If you are looking to display how to display DateTime as text without causing confusion to users in different countries then good choices is either "o" or "r". The "o" format is in general more preferable as it also puts timezone offset.

long t = DateTime.Now.Ticks;
Console.WriteLine((new DateTime(t)).ToString("o"));
Console.WriteLine((new DateTime(t, DateTimeKind.Local)).ToString("o"));
Console.WriteLine((new DateTime(t, DateTimeKind.Unspecified)).ToString("o"));
Console.WriteLine((new DateTime(t, DateTimeKind.Utc)).ToString("o"));

Prints followings when actual date time is 2009-11-08T17:16:13.7791953 PST:

2009-11-08T17:16:13.7791953
2009-11-08T17:16:13.7791953-08:00
2009-11-08T17:16:13.7791953
2009-11-08T17:16:13.7791953Z

If you use "r" instead it would print followings:

Sun, 08 Nov 2009 17:26:02 GMT
Sun, 08 Nov 2009 17:26:02 GMT
Sun, 08 Nov 2009 17:26:02 GMT
Sun, 08 Nov 2009 17:26:02 GMT