Free Amazonian Size Databases
Amazon just released an astounding amount of free, public information. I don’t even know how to comment on this yet! Over one terabyte of information that we can use to mashup, data mine, search and just generally futz with.
In the interests of full disclosure, I have a small business relationship with Amazon.com and I encourage you to click on the widget on the right of the screen and buy, buy, BUY!
In the Economics category, we have added a set of transportation databases from the US Bureau of Transportation Statistics. Data and statistics are provided for aviation, maritime, highway, transit, rail, pipeline, bike & pedestrian, and other modes of transportation, all in CSV format. I was able to locate employment data for our hometown airline and found out that they employed 9,322 full-time and 1,122 part-time employees as of the end of 2007.
In the Encyclopedic category, we have added access to the DBpedia Knowledge Base, the Freebase Data Dump, and the Wikipedia Extraction, or WEX.
The DBpedia Knowledge Base currently describes more than 2.6 million things including 213,000 people, 328,000 places, 57,000 music albums, 36,000 films, and 20,000 companies. There are 274 million RDF triples in the 67 GB data set.
The 66 GB Freebase Data Dump is an open database of the world’s information, covering millions of topics in hundreds of categories.
The Wikipedia Extraction (WEX) is a processed, machine-readable dump of the English-language section of the Wikipedia. At nearly 67 GB, this is a handly and formidable data set. The data is provided is the TSV format as exported by PostgreSQL.
Finally, we have updated the NCBI’s Genbank data. Weighing in at a hefty quarter of a petabyte terabyte, this public data set contains information on over 85 billion bases and 82 million sequence records.
Amazon Web Services Blog: New AWS Public Data Sets – Economics, DBpedia, Freebase, and Wikipedia
Here are some definitions and more information:
The Freebase Wikipedia Extraction (WEX) is a processed dump of the English language Wikipedia. The wiki markup for each article is transformed into machine-readable XML, and common relational features such as templates, infoboxes, categories, article sections, and redirects are extracted in tabular form. Freebase WEX is provided as a set of database tables in TSV format for PostgreSQL, along with tables providing mappings between Wikipedia articles and Freebase topics, and corresponding Freebase Types.Freebase Download
DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia, and to link other data sets on the Web to Wikipedia data.wiki.dbpedia.org : About
GenBank® is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences (Nucleic Acids Research, 2008 Jan;36(Database issue):D25-30). There are approximately 85,759,586,764 bases in 82,853,685 sequence records in the traditional GenBank divisions and 108,635,736,141 bases in 27,439,206 sequence records in the WGS division as of February 2008.GenBank Overview
Of course, Jeff (I’m assuming Bezos), putting in the ol’ good word for the Amazon EC2 service, which I might add, I’m starting to get curious about for my own plans of world domination.
Instantiating these data sets is basically trivial. You create a new EBS volume of the appropriate size, basing it on the snapshot id of the data. Next, you attach the volume to a running EC2 instance in the same availability zone. Finally, you create a mount point and mount the EBS volume on the instance. The last step can take a minute or two for a large volume; the other steps are essentially instantaneous. Instead of spending days or weeks downloading these data sets you can be up and running from a standing start in minutes. Once again, cloud computing reduces the friction between “I have a good idea” and “here’s the realization of my idea.” You don’t need loads of bandwidth, processing power, or local disk space in order to do interesting and significant work with these world-scale data sets.Amazon Web Services Blog: New AWS Public Data Sets – Economics, DBpedia, Freebase, and Wikipedia
amazon-web-services. When will Australian data be available in the cloud on Amazon Web Services (AWS)? Public Data Sets on AWS provides a centralised repository of public data sets that can be…
New data sets at AWS means more research on the cheap |
Finally, we have updated the NCBI’s Genbank data. Weighing in at a hefty quarter of a petabyte terabyte, this public data set contains information on over 85 billion bases and 82 million…
MemeStreams | Amazon Web Services Blog: New AWS Public Data Sets …
Weighing in at a hefty quarter of a petabyte, this public data set contains information on over 85 billion bases and 82 million sequence records. Genbank is on S3/EBS as a snapshot. Wow… 250TB!…
The Freebase Blog » Blog Archive » Freebase now available as a …
Just a quick note that our data dumps and Wikipedia extract are now available as public data sets on Amazon Web Services (AWS). This means that if you’re developing apps using Amazon’s EC2,…
Amazon Web Services Blog: New AWS Public Data Sets – Economics …
We have just released four additional AWS public data sets, and have updated another one. In the Economics category, we have added a set of transportation databases from the US Bureau of…
New AWS Public Data Sets – Economics, DBpedia, Freebase, and …
New AWS Public Data Sets – Economics, DBpedia, Freebase, and Wikipedia. 24 Feb, 2009 Amazon. We have just released four additional AWS public data sets, and have updated another one. In the…

GenBank® is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences (

Leave a Reply