The latest version of the Stack Overflow Trilogy Creative Commons Data Dump is now available. This reflects all public data in …
- Stack Overflow
- Server Fault
- Super User
- Meta Stack Overflow
… up to March 2010.
Download the Stack Overflow Trilogy Creative Commons Data Dump via BitTorrent
Please note that the Stack Overflow trilogy data dumps are now hosted at LegalTorrents! You can subscribe via RSS and be notified every time a new dump is available.
Have fun remixing and reusing; all we ask is for proper attribution.
March 2nd, 2010 at 2:24 am
new to this dump is the email/ip user gravatar hashes. The hash is email, if provided, and if not, the last known IP address of the user.
March 2nd, 2010 at 12:22 pm
If you want a NOSQL way to play with the dumps, but not digging the XML, check out the SO dump importer for MongoDB. It’s super simple, and fast.
http://github.com/bgianfo/stackoverflow-mongodb
March 3rd, 2010 at 11:06 pm
I found an odd problem in the March dump, the comments.xml file appears to be incomplete for three of the four dumps. The date of the last comment for each site is:
META – 2009-09-03
SU – 2010-02-24
SF – 2010-01-21
SO – 2010-02-28
It looks like some of the comments.xml files in the last dump are also truncated, but at different dates.
March 4th, 2010 at 2:31 am
Good find Greg – the data dump for next month will contain all of the missing comments.
March 5th, 2010 at 7:07 am
.. and if you are looking for an SQL way to play with the dumps, Rdbhost has the data online:
http://www.rdbhost.com/rdbadmin/main.html?r0000000767
March 5th, 2010 at 11:28 am
… and if you are looking for a Microsoft SQL Server way to play with the dumps, I’ve got the data online too:
http://www.brentozar.com/archive/2010/02/querying-the-stackoverflow-data-dump/