tom beeby blogs

May 6, 2010

free data (when you need a lot of it)

Filed under: Development — tombeeby @ 12:31 am

i’ve spent a lot of time lately searching around for data sets that are sufficiently big to run search / indexing benchmarks. i’m reading up on lucene.net at the moment, and while it’s all very well to talk about lightening performance, i’m not going to believe it till i see it. finding good data is not as easy as you might think. Here are a couple of good source i’ve been using:

http://www.archive.org/details/ol_data

among other things, archive.org provides library catalogues for download in various formats (json, xml, marc). some of these are very, very big. a particularly good one i’ve used a lot is the ‘open library json dump’ – 20gigs (43,000,000+ records) of juicy data.

http://blog.stackoverflow.com/2009/06/stack-overflow-creative-commons-data-dump/

a big chunk of posts, comments, authors etc. good data and very useful for simulating real-world scenarios

Advertisement

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

Gravatar
WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Theme: Shocking Blue Green. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.