Amazon CloudSearch
Developer Guide (API Version 2011-02-01)
« PreviousNext »
View the PDF for this guide.Go to the AWS Discussion Forum for this product.Go to the Kindle Store to download this guide in Kindle format.Did this page help you?  Yes | No |  Tell us about it...

cs-generate-sdf

NAME
cs-generate-sdf - Experimental tool for analyzing the data you want to index
                  and automatically generating SDF batches for indexing. 
  
SYNOPSIS
  cs-generate-sdf --source PATH|S3_URI [--output PATH|S3_URI]
                  [--modified-after yyyy-mm-ddTnn:nn]
                  [--exclude-metadata] [--exclude-content] 
                  [--single-doc-per-csv] [--sdf-format json|xml]
                  [--docid-prefix STRING] [--doc-version NUM]
                  [--batch-size MB] [--batch-docs NUM]
                  COMMON_OPTIONS

DESCRIPTION

Analyze your data and generate SDF (Search Data Format) batches that can be
submitted to Amazon CloudSearch for indexing using the cs-post-sdf
command. The generated SDF batches can be saved to your local file system 
or to an S3 bucket.

The cs-generate-sdf command can generate SDF batches from the following 
content types:

   text/csv
   text/html
   text/plain
   application/json
   application/msword
   application/pdf
   application/vnd.ms-excel
   application/vnd.ms-powerpoint
   application/vnd.openxmlformats-officedocument.presentationml.presentation
   application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
   application/vnd.openxmlformats-officedocument.wordprocessingml.document
   application/xhtml+xml
   application/xml

Generally, a single add document request is added to the SDF batch for each
source file. Where possible, the contents of the source file are parsed
into one or more index fields. If metadata is available for the file, 
an index field is added for each piece of metadata.

When creating SDF batches from CSV source files, they are automatically 
parsed to generate a separate document for each row in the CSV file. The 
contents of the first row are used to define the document fields. If 
you are processing multiple files, CSV files are parsed row-by-row, 
and non-CSV files are treated as individual documents.

You can specify the --single-doc-per-csv option to override the default
behavior and treat each CSV file as a single document. Specifying 
the --single-doc-per-csv option has no effect on non-CSV files. 

Note: Currently, only CSV files are parsed to automatically extract 
custom field data and generate multiple documents. When processing 
XML and JSON files, each file is treated as a separate document and 
the contents of the file are used to populate a single text field.

COMMON OPTIONS

-a, --access-key STRING         Your AWS access key. Used in conjunction
                                with --secret-key. Required if you
                                do not use an AWS credential file.                 
 
-c, --aws-credential-file FILE  The path to the file that contains your AWS
                                credentials. Required if you have not
                                set the AWS_CREDENTIAL_FILE environment
                                variable or explicitly set your credentials
                                with --access-key and --secret-key.
 
-d, --domain-name STRING        The name of the domain that you are updating.
                                Optional.
 
-e,  --endpoint URL             The endpoint for the Amazon CloudSearch
                                Configuration Service. Defaults to the 
                                CS_ENDPOINT environment variable or
                                cloudsearch.us-east-1.amazonaws.com
                                if the environment variable is not set. 
                                Optional.  
 
-h, --help                      Display this help message. Optional.
  
-k, --secret-key STRING         Your AWS secret key. Used in conjunction with
                                --access-key. Required if you do not
                                use an AWS credential file.
 
-ve, --verbose                  Display verbose log messages. Optional.
 
-v, --version                   Display the version number of the command
                                line tools. Optional.
 
BASIC SDF OPTIONS
 
-o, --output PATH|S3_URI        The local directory or S3 bucket where you
                                want to save the generated SDF batches.
                                Required if you do not specify the 
                                --domain option to upload the generated
                                SDF batches to a search domain. Optional.

-s, --source PATH|S3_URI        The local directory, file, or S3 bucket
                                that contains the data that you want
                                to create SDF batches from. You can
                                process data from multiple locations 
                                by specifying multiple --source options.
                                Accepts Apache-ant style wildcards such 
                                as */** for files and S3 prefixes.
                                Required.
 
ADVANCED SDF OPTIONS

-bd, --batch-docs NUM           The maximum number of documents in a batch.
                                Optional.

-bs, --batch-size MB            The maximum batch size in MB. Defaults to 5MB.
                                Optional.
                                
-sdpc, --single-doc-per-csv     Treat the CSV file as a single document.
                                If this option is specified, the contents of
                                the CSV file will be treated as a single text
                                field. This option has no effect on non-CSV
                                files. Optional.
 
-dp, --docid-prefix STRING      The prefix to prepend to the document ID 
                                while processing CSV data. If not specified,
                                the filename is used as the --docid-prefix.
                                The docid column is used as the document ID
                                if it is included in the CSV data; otherwise,
                                the row number is used as the document ID.
                                Optional.

-dv, --doc-version NUM          The version number to use for all of the
                                generated SDF documents. Defaults to 1.
                                Optional.

-ec, --exclude-content          Do not include the content of the source
                                files in the generated SDF documents, only
                                process the metadata. Optional.

-em, --exclude-metadata         Do not include the metadata of the source
                                files in the generated SDF documents, only
                                process the content. Optional.

-format, --sdf-format json|xml  The format of the generated SDF docments:
                                json or xml. Defaults to json. Optional.

-m, --modified-after TIMESTAMP  Only process files or S3 objects modified
                                after the specified date and time. Specified
                                as yyyy-mm-ddTnn:nn. Optional.

EXAMPLES

Generate an SDF batch from a plain text file:

  cs-generate-sdf --source c:\myAmazingDataSet\data1.txt
                  --output c:\myAmazingDataSet\SDF\batch
                  COMMON_OPTIONS

Generate a single document for each CSV file:

  cs-generate-sdf --source c:\myAmazingDataSet\*.csv -sdpc
                  --output c:\myAmazingDataSet\SDF\batch 
                  COMMON_OPTIONS
          
Generate an SDF batch from multiple documents:

  cs-generate-sdf --source c:\myAmazingDataSet\data1.xml
                  --source c:\myAmazingDataSet\data2.xml
                  --source c:\myAmazingDataSet\data3.xml
                  --output c:\myAmazingDataSet\SDF\batch 
                  COMMON_OPTIONS

Generate SDF batches from all HTML documents in a directory:

  cs-generate-sdf --source c:\myAmazingDataSet\*.html
                  --output c:\myAmazingDataSet\SDF\batch 
                  COMMON_OPTIONS
                
Generate SDF batches from all Word or PDF documents in a directory:

  cs-generate-sdf --source c:\myAmazingDataSet\*.doc
                  --source c:\myAmazingDataSet\*.pdf
                  --output c:\myAmazingDataSet\SDF\batch
                  COMMON_OPTIONS
                
Generate SDF batches from all recognized file types:

  cs-generate-sdf --source c:\myAmazingDataSet\*
                  --output c:\myAmazingDataSet\SDF\batch
                  COMMON_OPTIONS
                
Generate SDF batches and upload them to your domain:
                
  cs-generate-sdf -d mydomain --source c:\myAmazingDataSet\*
                  COMMON_OPTIONS

SEE ALSO

cs-post-sdf