i’ve spent a lot of time lately searching around for data sets that are sufficiently big to run search / indexing benchmarks. i’m reading up on lucene.net at the moment, and while it’s all very well to talk about lightening performance, i’m not going to believe it till i see it. finding good data is not as easy as you might think. Here are a couple of good source i’ve been using:
http://www.archive.org/details/ol_data
among other things, archive.org provides library catalogues for download in various formats (json, xml, marc). some of these are very, very big. a particularly good one i’ve used a lot is the ‘open library json dump’ – 20gigs (43,000,000+ records) of juicy data.
http://blog.stackoverflow.com/2009/06/stack-overflow-creative-commons-data-dump/
a big chunk of posts, comments, authors etc. good data and very useful for simulating real-world scenarios