Using Public Data Sets
Amazon Web Services provides a repository of public data sets that can be seamlessly integrated into AWS cloud-based applications. Amazon stores the data sets at no charge to the community and, as with all AWS services, you pay only for the compute and storage you use for your own applications.
Public Data Set Concepts
Previously, large data sets such as the mapping of the Human Genome and the US Census data required hours or days to locate, download, customize, and analyze. Now, anyone can access these data sets from an EC2 instance and start computing on the data within minutes. You can also leverage the entire AWS ecosystem and easily collaborate with other AWS users. For example, you can produce or use prebuilt server images with tools and applications to analyze the data sets. By hosting this important and useful data with cost-efficient services such as Amazon EC2, AWS hopes to provide researchers across a variety of disciplines and industries with tools to enable more innovation, more quickly.
For more information, go to the Public Data Sets on AWS Page.
Available Public Data Sets
Public data sets are currently available in the following categories:
Biology—Includes Human Genome Project, GenBank, and other content.
Chemistry—Includes multiple versions of PubChem and other content.
Economics—Includes census data, labor statistics, transportation statistics, and other content.
Encyclopedic—Includes Wikipedia content from multiple sources and other content.
Finding Public Data Sets
Before you can use a public data set, you must locate the data set and determine which format the data set is hosted in. The data sets are available in two possible formats: Amazon EBS snapshots or Amazon S3 buckets.
To find a public data set and determine its format
Go to the Public Data Sets Page to see a listing of all available public data sets. You can also enter a search phrase on this page to query the available public data set listings.
Click the name of a data set to see its detail page.
On the data set detail page, look for a snapshot ID listing to identify an Amazon EBS formatted data set or an Amazon S3 URL.
Data sets that are in snapshot format are used to create new EBS volumes that you attach to an EC2 instance. For more information, see Creating a Public Data Set Volume from a Snapshot.
For data sets that are in Amazon S3 format, you can use the AWS SDKs or the HTTP query API to access the information, or you can use the AWS CLI to copy or synchronize the data to and from your instance. For more information, see Amazon S3 and Amazon EC2.
You can also use Amazon EMR to analyze and work with public data sets. For more information, see What is Amazon EMR?.
Creating a Public Data Set Volume from a Snapshot
To use a public data set that is in snapshot format, you create a new volume, specifying the snapshot ID of the public data set. You can create your new volume using the AWS Management Console as follows. If you prefer, you can use the create-volume AWS CLI command instead.
To create a public data set volume from a snapshot
Open the Amazon EC2 console.
From the navigation bar, select the region that your data set snapshot is located in.
Snapshot IDs are constrained to a single region, and you cannot create a volume from a snapshot that is located in another region. In addition, you can only attach an EBS volume to an instance in the same Availability Zone. For more information, see Resource Locations.
In the navigation pane, click Volumes.
Above the upper pane, click Create Volume.
In the Create Volume dialog box, in the Type list, select General Purpose SSD, Provisioned IOPS SSD, or Magnetic. For more information, see Amazon EBS Volume Types.
In the Snapshot field, start typing the ID or description of the snapshot for your data set. Select the snapshot from the list of suggested options.
If the snapshot ID you are expecting to see does not appear, you may have a different region selected in the Amazon EC2 console. If the data set you identified in Finding Public Data Sets does not specify a region on its detail page, it is likely contained in the
us-east-1US East (N. Virginia) region.
In the Size field, enter the size of the volume (in GiB or TiB), or verify the that the default size of the snapshot is adequate.
If you specify both a volume size and a snapshot ID, the size must be equal to or greater than the snapshot size. When you select a volume type and a snapshot ID, minimum and maximum sizes for the volume are shown next to the Size list.
For Provisioned IOPS SSD volumes, in the IOPS field, enter the maximum number of input/output operations per second (IOPS) that the volume can support.
In the Availability Zone list, select the Availability Zone in which to launch the instance.
EBS volumes can only be attached to instances in the same Availability Zone.
Click Yes, Create.
If you created a larger volume than the default size for that snapshot (by specifying a size in Step 7), you need to extend the file system on the volume to take advantage of the extra space. For more information, see Modifying the Size, IOPS, or Type of an EBS Volume on Linux.
Attaching and Mounting the Public Data Set Volume
After you have created your new data set volume, you need to attach it to an EC2 instance to access the data (this instance must also be in the same Availability Zone as the new volume). For more information, see Attaching an Amazon EBS Volume to an Instance.
After you have attached the volume to an instance, you need to mount the volume on the instance. For more information, see Making an Amazon EBS Volume Available for Use.