tom beeby blogs

May 6, 2010

free data (when you need a lot of it)

Filed under: Development — tombeeby @ 12:31 am

i’ve spent a lot of time lately searching around for data sets that are sufficiently big to run search / indexing benchmarks. i’m reading up on lucene.net at the moment, and while it’s all very well to talk about lightening performance, i’m not going to believe it till i see it. finding good data is not as easy as you might think. Here are a couple of good source i’ve been using:

http://www.archive.org/details/ol_data

among other things, archive.org provides library catalogues for download in various formats (json, xml, marc). some of these are very, very big. a particularly good one i’ve used a lot is the ‘open library json dump’ – 20gigs (43,000,000+ records) of juicy data.

http://blog.stackoverflow.com/2009/06/stack-overflow-creative-commons-data-dump/

a big chunk of posts, comments, authors etc. good data and very useful for simulating real-world scenarios

May 5, 2010

otto

Filed under: Uncategorized — tombeeby @ 9:41 pm

Meet Otto @ 4 months.

Theme: Shocking Blue Green. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.