Amazon EMR
Developer Guide

Parse Data with HParser

Informatica's HParser is a tool you can use to extract data stored in heterogeneous formats and convert it into a form that is easy to process and analyze. For example, if your company has legacy stock trading information stored in custom-formatted text files, you could use HParser to read the text files and extract the relevant data as XML. In addition to text and XML, HParser can extract and convert data stored in proprietary formats such as PDF and Word files.

HParser is designed to run on top of the Hadoop architecture, which means you can distribute operations across many computers in a cluster to efficiently parse vast amounts of data. Amazon EMR makes it easy to run Hadoop in the Amazon Web Services (AWS) cloud. With Amazon EMR you can set up a Hadoop cluster in minutes and automatically terminate the resources when the processing is complete.

In our stock trade information example, you could use HParser running on top of Amazon EMR to efficiently parse the data across a cluster of machines. The cluster will automatically shut down when all of your files have been converted, ensuring you are only charged for the resources used. This makes your legacy data available for analysis, without incurring ongoing IT infrastructure expenses.

The following tutorial walks you through the process of using HParser hosted on Amazon EMR to process custom text files into an easy-to-analyze XML format. The parsing logic for this sample has been defined for you using HParser, and is stored in the transformation services file (services_basic.tar.gz). This file, along with other content needed to run this tutorial, has been preloaded onto Amazon Simple Storage Service (Amazon S3) at s3n://elasticmapreduce/samples/informatica/. You will reference these files when you run the HParser job.

For information about how to run HParser on Amazon EMR, see Parse Data with HParser on Amazon EMR.

For more information about HParser and how to use it, go to