Downloading All of Hacker News Posts and Comments

Introduction

There are two files that contains all stories and comments posted at Hacker News from its start in 2006 to May 29, 2014 (exact dates are below). This was downloaded using simple program available I wrote Hacker News Downloader by making REST API calls to HN's official APIs. The program used API parameters to paginate through created date of items to retrieve all posts and comments. The file contains entire sequence of JSON responses exactly as returned by API call in JSON array.

HNStoriesAll.json

Contains all the stories posted on HN from Mon, 09 Oct 2006 18:21:51 GMT to Thu, 29 May 2014 08:25:40 GMT.

Total count

1,333,789

File size

1.2GB uncompressed, 115MB compressed

How was this created

I wrote a small program Hacker News Downloader to create these files, available at Github.

Format

Entire file is JSON compliant array. Each element in array is json object that is exactly the response that returned by HN Algolia REST API. The property named `hits` contains the actual list of stories. As this file is very large we recommend json parsers that can work on file streams instead of reading entire data in memory.

{
	"hits": [{
		"created_at": "2014-05-31T00:05:54.000Z",
		"title": "Publishers withdraw more than 120 gibberish papers",
		"url": "http://www.nature.com/news/publishers-withdraw-more-than-120-gibberish-papers-1.14763?WT.mc_id=TWT_NatureNews",
		"author": "danso",
		"points": 1,
		"story_text": "",
		"comment_text": null,
		"num_comments": 0,
		"story_id": null,
		"story_title": null,
		"story_url": null,
		"parent_id": null,
		"created_at_i": 1401494754,
		"_tags": ["story",
		"author_danso",
		"story_7824727"],
		"objectID": "7824727",
		"_highlightResult": {
			"title": {
				"value": "Publishers withdraw more than 120 gibberish papers",
				"matchLevel": "none",
				"matchedWords": []
			},
			"url": {
				"value": "http://www.nature.com/news/publishers-withdraw-more-than-120-gibberish-papers-1.14763?WT.mc_id=TWT_NatureNews",
				"matchLevel": "none",
				"matchedWords": []
			},
			"author": {
				"value": "danso",
				"matchLevel": "none",
				"matchedWords": []
			},
			"story_text": {
				"value": "",
				"matchLevel": "none",
				"matchedWords": []
			}
		}
	}],
	"nbHits": 636094,
	"page": 0,
	"nbPages": 1000,
	"hitsPerPage": 1,
	"processingTimeMS": 5,
	"query": "",
	"params": "advancedSyntax=true\u0026analytics=false\u0026hitsPerPage=1\u0026tags=story"
}

HNCommentsAll.json

Contains all the comments posted on HN from Mon, 09 Oct 2006 19:51:01 GMT to Fri, 30 May 2014 08:19:34 GMT.

Total count

5,845,908

File size

9.5GB uncompressed, 862MB compressed

How was this created

I wrote a small program Hacker News Downloader to create these files, available at Github.

Format

Entire file is JSON compliant array. Each element in array is json object that is exactly the response that returned by HN Algolia REST API. The property named `hits` contains the actual list of stories. As this file is very large we recommend json parsers that can work on file streams instead of reading entire data in memory.


{
	"hits": [{
		"created_at": "2014-05-31T00:22:01.000Z",
		"title": null,
		"url": null,
		"author": "rikacomet",
		"points": 1,
		"story_text": null,
		"comment_text": "Isn\u0026#x27;t the word dyes the right one to use here? Instead of dies?",
		"num_comments": null,
		"story_id": null,
		"story_title": null,
		"story_url": null,
		"parent_id": 7821954,
		"created_at_i": 1401495721,
		"_tags": ["comment",
		"author_rikacomet",
		"story_7824763"],
		"objectID": "7824763",
		"_highlightResult": {
			"author": {
				"value": "rikacomet",
				"matchLevel": "none",
				"matchedWords": []
			},
			"comment_text": {
				"value": "Isn\u0026#x27;t the word dyes the right one to use here? Instead of dies?",
				"matchLevel": "none",
				"matchedWords": []
			}
		}
	}],
	"nbHits": 1371364,
	"page": 0,
	"nbPages": 1000,
	"hitsPerPage": 1,
	"processingTimeMS": 8,
	"query": "",
	"params": "advancedSyntax=true\u0026analytics=false\u0026hitsPerPage=1\u0026tags=comment"
}

Where to download

As GitHub restricts each file to be only 100MB and also has policies against data ware housing, these files are currently hosted at FileDropper.com. Unfortunately FileDropper currently shows ads with misleading download link so be careful on what link you click. Below is the screenshot FileDropper shows and currently the button marked in red would download the actual file.

FileDropperDownloadScreen

HN Stories Download URL

Using Browser: http://www.filedropper.com/hnstoriesall

Using Torrent Client: magnet link (thanks to @saturation)

Archived at: Internet Archive (thanks to Bertrand Fan)

HN Comments Download URL

Using Browser: http://www.filedropper.com/hncommentsall

Using Torrent Client: magnet link (thanks to @saturation)

Archived at: Internet Archive (thanks to Bertrand Fan)

Few points of interests

  • API rate limit is 10,000 requests per hour or you get blacklisted. I tried to be even more conservative by putting 4 sec of sleep between calls.
  • I like to keep entire response from the call as-is. So return value of this function is used to stream a serialized array of JSON response objects to a file.
  • As the output files are giant JSON files, you will need a JSON parser that can use streams. I used JSON.NET which worked out pretty well. You can find the sample code in my Github repo.
  • In total 1.3M stories and 5.8M comments were downloaded and each took about ~10 hours.
  • It's amazing to see all of HN stories and comments so far fits in to under just 1GB compressed!

Issues and Suggestions

Please let me know any issues and suggestions in comments. You can also file issue at "shell" Github repo I'd created for this data.

A quicker way to Twitter

screenshot_quicktwit
Past weekend, I finally thought about giving Twitter a try and started looking for a client app that just allows me to very quickly update the status with a global keyboard shot cut. I'm not in to following anyone or replying anyone but wanted this very simple app with one text box. Apparently no such apps existed in Twitter Fan Wiki which actually turned out to be a good thing because I immediately started looking at Twitter's API and any C# wrappers. About 90 minutes later I'd my app ready. On the way I also added functionality to break the big updates in to multiple twits. This little (literally) app is now open sourced on CodePlex and ready for you to try it out!

Introducing DSS

There are tons and tons of things to blog but here is a quick one.

Last Thanksgiving (a 4 days of holidays in USA) I wanted to work on something really cool that is absolutely worth doing and something I can spend my entire 4 days continuosly. I looked over my list of pending projects to find something extraordinarily cool, kept thinking about new ideas flowing around, looked over to other idea websites and realized that my mind was just keept going blank all the while.

So when people asked what were my plans for thanksgiving, I'd reply "I'll be doing Project Blank" :).

It just so happened, at the very start of the thanksgiving I was casualy reading the SSE specs that was just announced by Ray Ozzie and immediately realized things missing in there and the huge possibilities of massive human collaboration that it can make happen. Rest of it is the story. I ended up spending about 16 hours a day in designing what I call now Data Syndication Services specifications and writing a reference application for it. While my efforts were inspired by SSE and Groove, the DSS design enables data sharing on a massive scale on much realistic grounds.

And guess what, I still call the project binaries Blank :).

Want to take a look? Go ahead and collaborate on my GitHub repo!

Groove Hacks

About a year and half ago, the new version of Groove had came out and it still didn't had an ability export IMs. It drove me nuts so I started to write my own Groove tool that would do it in excuse to explore its infamous internals. Ah! What a ride that was! Groove APIs have extremely huge surface area (which means there are thousands and thousands of them sprinkled all over in hard to find places). Tons of them have confusing names, misleading functionalities and put in the wrong place. The fun part? There is almost no documentation! And yeah, did I forgot to mention that they are heavy C++ oriented, frequently late bound and mostly proprietary stuff (they even have their own proprietary definition for rich text and APIs!)?

If your brain needs some challenge that's the place to dig in to. After sacrificing my 3 weekends I finally had a working tool that exports Groove IMs and put them to Outlook without loosing formatting or attachments! I consider this an equivalent feat of removing nag dialog of WinZip by changing a x86 jump instruction in its disassembled binary using only Visual Studio debugger and absolutely nothing else ;).

This tool had been sitting on my hard drive crying to get out for months and months. In between, I did some polishing up, adding wizards, support for Word and Excel, creating a help file and even created a website for it. So now I think it’s pretty much ready and have decided to give it away for free personal use (similar tools cost $50 something I guess). Check it out if you use Groove and want to save your invaluable messages! Call it laziness or ignorance or whatever but I really do feel guilty not to putting this out early when 3.0 came out and lot of people SO need it!

My New CodeProject Article On Equation Rendering

I just finished my new article on CodeProject. The mission on MimeTeX was started about couple of months ago when in a weekend I just got attracted to MimeTeX's C code like a magnet ;). Now I've built ASP.Net handler, caching, admin etc on the top of it and its looking great! Enabling scientific content on web seems to be my new obsession. So if you take pride in delighting your users with every new release, here's your brand new feature! Go ahead, download it, use it! If you run in to any problem, I'll be glad to offer you my help.

Some Cool .Net Nuggets

  • If type's constructor (i.e. static constructor) throws an exception, entire type becomes unusable. Any attempt to call any member of that type would result in TypeInitializationException.
  • Operator overloading should never be the only way to use the functionality if your code targets 1.x versions of frameworks because VB.Net can't access it without resorting to ugly calls such as
    op_Addition

    .

  • There is universal symbol for money (a generic version of $, £, ¥ etc) and it's ¤ (U+00A4). If you format the number as currency in culture invariant way then .net attaches this symbol to your number. I just think it's cool to have some universal symbol for money :).
  • Simplest way to convert hex number to int:
    Int32.Parse("1AFF", NumberStyles.HexNumber, null)
  • Simplest way to display array of bytes as hex values:
    BitConverter.ToString(byteArray)
  • If you updated something in your computer and suddenly your .Net app behaves bad, it is possible to do automatic rollback.. The .Net Framework keeps track of assemblies that was loaded by any managed app up. This info is stored in an INI file in
    LocalSettings\Application Data\ApplicationHistory

    and is used by .Net Application Restore tool. I think this great debugging aid too.

  • In .Net world, zombies are not purely an imagination:
    class Person
    {
            static object HoldOnToMe;
            ~Person()
            {
                    HoldOnToMe = this;
                    GC.ReRegisterForFinalize(this);
            }
    }
    
  • Values types are allocated on stack but not when you have an array of value types. For example,
    new Int32[100]

    allocates 100 unboxed integers on heap, not on stack.

  • The Finally block is not really always guaranteed to get executed. If any of these 3 special exceptions do happen, code in Finally won't be executed:
    OutOfMemoryException

    ,

    StackOverFlowException

    and

    ExecutionEngineException

    (I'd be fortunate enough to experience all of these). That means you had created some global kernel objects, they will indeed hang around and may interfere when user restarts your app. BTW, if you see a code like

    catch(Exception ex) {...}

    or

    catch{...}

    , tell the developer that he has committed a sin.

  • Apparently GC.Collect() is not always a line of code you should disgust at. You might want to do it especially when you own the process and had created loads of objects which won't be used any further (for example moving on to a new tab in WinForms app). I used this in one of my projects to improve on the memory pressure and was really feeling guilty about it, until recently.
            GC.Collect();
            //block my thread till objects needing finalization are done
            GC.WaitForPendingFinalizers();
            GC.Collect();
    
  • You should always strong name your assemblies, especially if it is going to be used by assemblies in multiple AppDomains in the same process because only they are shared between domains; otherwise each AppDomain will have it's own copy. Why anyone would have multiple AppDomains, you ask. Well, if you are enabling your app to have 3rd party plugins, I strongly recommend loading all these plugins in to a separate domain. This way not only you can control the security policy on these plugins but also unload the bad plugins without shutting down your app. This is often overlooked in various plugin architectures for .Net but if you don't do this, you app might go on the same route as IE6.
  • If you have enabled your app or website for localization, don't forget to test it with Turkish language. If your thread is having CurrentCulture Turkish (tr-TR) and if you try to uppercase a letter i, you get 0 instead of normal english I (i.e. Unicode character U+0130 instead of U+0049). Scott Hanselman has a first hand experience.
  • Many of you know Application.ThreadException event which lets you capture the unhandled exceptions in WinForms app and do something like Windows Error Reporting. But the better way is probably
    AppDomain.UnhandledException

    event because that also lets you get notified for non-CLS compliant exceptions and without needing a reference to

    Application

    object.

  • The values of public constants that you reference from other assemblies are embedded in your own assembly metadata. That means, if other assembly changes the value of the constant afterwards, you must recompile your own assembly or otherwise you still will be using that old value of the constant. I think this is as critical "bug" as lapsed event handlers.
  • Jagged arrays are not CLS compliant. If you are building a library that can be used by VB or C# guys, you can't have jagged arrays as public member type.
  • Visual Basic can do this:
    Try
            ...
    Catch e as Exception When x = 0
            ...
    End try
    

Integrating RSS And Calendar Essay Available Now

The idea of wrapping calendar information in to the RSS feed may sound very appealing. Almost every website owned by some kind of group or organization has their event calendar. The thought that you can aggregate them in to your "Calendar Aggregator" is just so geekily cool. What if people started putting up their weekend plans through some kind of RSS-Calendar and you can subscribe to them in your calendar program! I dig through dozens of W3C and other specs and half a dozen of implementation to find out what has been done so far and why it hasn't happened yet. The result of my findings and possible solution are summarized in my essay in some reader friendly writing.

Updates - Spring 2005

I would be writing all New York City related stuff at Metblogs rather then my own blog. This makes sense because lot of people who aren't in this region doesn't need to get those NYC stories. On the other hand, my NYC related writing will now reach to much larger audience. Check out some of my entries there about cool New York events, restaurants and such stuff.

On the other site news, you might have noticed new skin and more FireFox friendly design. I also decided to give away the engine that my website runs on (C# code I wrote almost 4 years ago) along with entire source code for this website (thats in VB.Net just for fun). Nothing special but main highlights of the engine is that it accepts raw HTML file as the base template and embeds your dynamic ASP.Net WebForm content inside that HTML. It also provides navigation control which runs off of XHTML templates and XML.

If you like my free utilities, don't forget to check out the massive updates in my Software section. It has now many more of my programs and utilities that I kept it to myself. Specifically, the one called Browser History Analyzer analyses your IE history (support for FireFox coming soon), builds MS Access database and gives you tons of amusing info about your browsing habits such has the queries you fired on search engine, how do you refine your keywords progressively, how much time you usually spend on a page, how much time you spend on browsing and so on. Whilte still in development, it also features extensible architecture to let you make your own plugins. I've also put the link for article I wrote for CodeProject about how to show Explorer's progress dialog in your apps.

Finally some Alaska trip photos also have been added. Yenjoy :).

Using Windows Explorer Progress Dialog In Your Application

ProgressDialogDemo

When you copy lot of files in Explorer, you see the standard Windows progress dialog with "flying papers" animation and the calculation of estimated time remaining. This dialog is accessible to any Windows application through IProgressDialog interface. This source code provides you a managed .Net wrapper to easily and intuitively integrate Windows Progress Dialog in your own applications. You can read more details in my original article on CodeProject. Also see the comments in that article. It looks like two of the guys really hit on it and have produced a stand alone version.

Warning: This program was last updated on 13 Jan 2005 and is considered obsolete. There are no plans to update it and no support is provided. It exists here purely for its historical and nostalgic value.

WinProgressDialog is now archived at Github