Configuring Text Analysis Schemes for Amazon CloudSearch - Amazon CloudSearch

Configuring Text Analysis Schemes for Amazon CloudSearch

Amazon CloudSearch enables you to configure a language-specific analysis scheme for each text and text-array field. An analysis scheme controls how the contents of the field are processed during indexing. Although the defaults for each language work well in many cases, fine-tuning the analysis options enables you to optimize the search results based on your knowledge of the data you are searching. For a list of supported languages, see Supported Languages.

An analysis scheme specifies the language of the text to be processed and the following analysis options:

  • Algorithmic stemming—specifies the level of algorithmic stemming to perform. The available stemming levels vary depending on the language.

  • Japanese Tokenization Dictionary—specifies overrides of the algorithmic tokenization when processing Japanese. The dictionary specifies how particular sets of characters should be grouped into words.

  • Stemming dictionary—specifies overrides for the results of the algorithmic stemming. The dictionary maps specific related words to a common root word or stem.

  • Stopwords—specifies words that should be ignored during indexing and searching.

  • Synonyms—specifies words that have the same meaning as words that occur in your data and should produce the same search results.

During text processing, field values and search terms are converted to lowercase (case-folded), so stopwords, stems, and synonyms are not case-sensitive. For more information about how Amazon CloudSearch processes text during indexing and when handling search requests, see Text Processing in Amazon CloudSearch.

You must specify a language for each analysis scheme and configure an analysis scheme for each text and text-array field. When you configure fields through the Amazon CloudSearch console, the analysis scheme defaults to the _en_default_ analysis scheme. If you do not specify analysis options for an analysis scheme, Amazon CloudSearch uses the default options for the specified language. For information about the defaults for each language, see Language Specific Settings.

The easiest way to define analysis schemes is through the Analysis Schemes page in the Amazon CloudSearch console. You must apply an analysis scheme to a field for it to take effect. You can apply an analysis scheme to a field from the Indexing Options page. You can also define analysis schemes and configure an analysis scheme for each field through the command line tools and AWS SDKs.

When you apply a new analysis scheme to an index field or modify an analysis scheme that's in use, you must explicitly rebuild the index for the changes to be reflected in search results.

Stemming in Amazon CloudSearch

Stemming is the process of mapping related words to a common stem. A stem is typically the root or base word from which variants are derived. For example, run is the stem of running and ran. Stemming is performed during indexing as well as at query time. Stemming reduces the number of terms that are included in the index, and facilitates matches when the search term is a variant of a term that occurs in the content being searched. For example, if you map the term running to the stem run and then search for running, the request matches documents that contain run as well as running.

Amazon CloudSearch supports both algorithmic stemming and explicit stemming dictionaries. You configure algorithmic stemming by specifying the level of stemming that you want to use. The available levels of algorithmic stemming vary depending on the language:

  • none—disable algorithmic stemming

  • minimal—perform basic stemming by removing plural suffixes

  • light—target the most common noun/adjective inflections and derived suffixes

  • full—aggressively stem inflections and suffixes

In addition to controlling the degree of algorithmic stemming that's performed, you can specify a stemming dictionary that maps specific related words to a common stem. You specify the dictionary as a JSON object that contains a collection of string:value pairs that map a term to its stem, for example, {"term1": "stem1", "term2": "stem2", "term3": "stem3"}. The stemming dictionary is applied in addition to any algorithmic stemming. This enables you to override the results of the algorithmic stemming to correct specific cases of overstemming or understemming. The maximum size of a stemming dictionary is 500 KB. Stemming dictionary entries must be lowercase.

You use the StemmingDictionary key to define a custom stemming dictionary in an analysis scheme. Because you pass the dictionary to Amazon CloudSearch as a string, you must escape all double quotes within the string. For example, the following analysis scheme defines stems for running and jumping:

{ "AnalysisSchemeName": "myscheme", "AnalysisSchemeLanguage": "en", "AnalysisOptions": { "AlgorithmicStemming": "light", "StemmingDictionary": "{\"running\": \"run\",\"jumping\": \"jump\"}" } }

If you do not specify the level of algorithmic stemming or a stemming dictionary in your analysis scheme, Amazon CloudSearch uses the default algorithmic stemming level for the specified language. While stemming can help users find relevant documents that might otherwise be excluded from the search results, overstemming can result in too many matches with questionable relevance. The default level of algorithmic stemming configured for each language works well for most use cases. In general, it's best to start with the default and then make adjustments to optimize the relevance of the search results for your use case. For information about the defaults for each language, see Language Specific Settings.

Stopwords in Amazon CloudSearch

Stopwords are words that should typically be ignored both during indexing and at search time because they are either insignificant or so common that including them would result in a massive number of matches.

During indexing, Amazon CloudSearch uses the stopword dictionary when it processes text and text-array fields. In most cases, stopwords are not included in the index. The stopword dictionary is also used to filter search requests.

A stopwords dictionary is a JSON array of terms, for example, ["a", "an", "the", "of"]. The stopwords dictionary must explicitly list each word that you want to ignore. Wildcards and regular expressions are not supported.

You use the Stopwords key to define a custom stopwords dictionary in an analysis scheme. Because you pass the dictionary to Amazon CloudSearch as a string, you must escape all double quotes within the string. For example, the following analysis scheme configures the stopwords a, an, and the:

{ "AnalysisSchemeName": "myscheme", "AnalysisSchemeLanguage": "en", "AnalysisOptions": { "Stopwords": "[\"a\",\"an\",\"the\"]" } }

If you do not specify a stopwords dictionary in your analysis scheme, Amazon CloudSearch uses the default stopword dictionary for the specified language. The default stopwords configured for each language work well for most use cases. In general, it's best to start with the default and then make adjustments to optimize the relevance of the search results for your use case. For information about the defaults for each language, see Language Specific Settings.

Synonyms in Amazon CloudSearch

You can configure synonyms for terms that appear in the data that you are searching. That way, if a user searches for the synonym rather than the indexed term, the results will include documents that contain the indexed term. For example, you might define custom synonyms to do the following:

  • Map common misspellings to the correct spelling

  • Define equivalent terms, such as film and movie

  • Map a general term to a more specific one, such as fish and barracuda

  • Map multiple words to a single word or vice versa, such as tool box and toolbox

When you define a synonym, the synonym is added to the index everywhere the base token occurs. For example, if you define fish as a synonym of barracuda, the term fish is added to every document that contains the term barracuda. Adding a large number of synonyms can increase the size of the index as well as query latency—synonyms increase the number of matches and the more matches, the longer it takes to process the results.

The synonym dictionary is used during indexing to configure mappings for terms that occur in text fields. No synonym processing is done on search requests. By default, Amazon CloudSearch does not define any synonyms.

You can specify synonyms in two ways:

  • As a conflation group where each term in the group is considered a synonym of every other term in the group.

  • As an alias for a specific term. An alias is considered a synonym of the specified term, but the term is not considered a synonym of the alias.

A synonym dictionary is specified as a JSON object that defines the synonym groups and aliases. The groups value is an array of arrays, where each sub-array is a conflation group. The aliases value is an object that contains a collection of string:value pairs where the string specifies a term and the array of values specifies each of the synonyms for that term. The following example includes both conflation groups and aliases:

{ "groups": [["1st", "first", "one"], ["2nd", "second", "two"]], "aliases": { "youth": ["child", "kid", "boy", "girl"], "adult": ["men", "women"] } }

Both groups and aliases support multiword synonyms. In the following example, multiword synonyms are used in a conflation group as well as an alias:

{ "groups": [["tool box", "toolbox"], ["band saw", "bandsaw"]], "aliases": { "workbench": ["work bench"]} }

You use the Synonyms key to define a custom synonym dictionary in an analysis scheme. Because you pass the dictionary to Amazon CloudSearch as a string, you must escape all double quotes within the string. For example, the following analysis scheme configures aliases for the term youth:

{ "AnalysisSchemeName": "myscheme", "AnalysisSchemeLanguage": "en", "AnalysisOptions": { "Synonyms": "{\"aliases\": {\"youth\": [\"child\",\"kid\"]}}" } }

Configuring Analysis Schemes Using the Amazon CloudSearch Console

You can define analysis schemes from the Analysis Schemes pane in the Amazon CloudSearch console.

To define an analysis scheme
  1. Open the Amazon CloudSearch console at https://console.aws.amazon.com/cloudsearch/home.

  2. From the left nagivation pane, choose Domains.

  3. Choose the name of your domain to open its configuration.

  4. Go to the Advanced search options tab.

  5. In the Analysis schemes pane, choose Add analysis scheme.

  6. Specify a name for the analysis scheme and select a language.

  7. Choose Next.

  8. In the next three steps, configure the scheme's text stopword, stemming, and synonym options. You can configure individual stopwords, stems, and synonyms, or edit the displayed dictionaries directly. The dictionaries are formatted in JSON. Stopwords are specified as an array of strings. Stems are specified as an object that contains one or more key:value pairs. Synonym aliases are also specified as a JSON object with one or move key:value pairs, where the alias values are specified as an array of strings. A synonym group is specified as a JSON array. (The synonym dictionary is an array of arrays.)

    If you selected Japanese as the language, you also have the option of specifying a custom tokenization dictionary that overrides the default tokenization of specific phrases. For more information, see Customizing Japanese Tokenization.

  9. On the summary page, review the analysis scheme configuration and choose Save.

Important

To use an analysis scheme, you must apply it to one or more text or text-array fields and rebuild the index. You can configure a field's analysis scheme from the Indexing options tab. To rebuild your index, choose Actions, Run indexing.

Configuring Analysis Schemes Using the AWS CLI

You use the aws cloudsearch define-analysis-scheme command to define language-specific text processing options, including stemming options, stopwords, and synonyms. For information about installing and setting up the AWS CLI, see the AWS Command Line Interface User Guide.

You specify an analysis scheme as part of the configuration of each text or text-array field. For more information, see configure indexing options.

To define an analysis scheme
  • Run the aws cloudsearch define-analysis-scheme command and specify the --analysis-scheme option and a JSON object that contains your analysis options. The analysis scheme must be valid JSON. The analysis option key and value pairs must be enclosed in quotes, and all quotes within the option values must be escaped with a backslash. For the format of the analysis options, see define-analysis-scheme in the AWS CLI Command Reference. See Configuring Analysis Schemes for more information about specifying stemming, stopword, and synonym options.

    If you specify Japanese (ja) as the language, you also have the option of specifying a custom tokenization dictionary that overrides the default tokenization of specific phrases. For more information, see Customizing Japanese Tokenization.

    Tip

    The easiest way to configure an analysis scheme with the AWS CLI is to store the analysis scheme in a text file and specify that file as the --analysis-scheme value. This enables you to format the scheme so that it's easier to read. For example, the following scheme defines an English analysis scheme called myscheme that uses light algorithmic stemming and configures two stopwords:

    { "AnalysisSchemeName": "myscheme", "AnalysisSchemeLanguage": "en", "AnalysisOptions": { "AlgorithmicStemming": "light", "Stopwords": "[\"a\", \"the\"]" } }

    If you save this scheme in a text file called myscheme.txt, you can pass the file in as the value of the --analysis-scheme parameter:

    aws cloudsearch define-analysis-scheme --region us-east-1 --domain-name movies --analysis-scheme file://myscheme.txt
Important

To use an analysis scheme, you must apply it to one or more text or text-array fields and rebuild the index. You can configure a field's analysis scheme with the aws cloudsearch define-index-field command. To rebuild the index, call aws cloudsearch index-documents.

Configuring Analysis Schemes Using the AWS SDKs

The AWS SDKs (except the Android and iOS SDKs) support all of the Amazon CloudSearch actions defined in the Amazon CloudSearch Configuration API, including DefineAnalysisScheme. For more information about installing and using the AWS SDKs, see AWS Software Development Kits.

Important

To use an analysis scheme, you must apply it to one or more text or text-array fields and rebuild the index. You can configure a field's analysis scheme with the define index field method. To rebuild your index, you use the index documents method.

Indexing Bigrams for Chinese, Japanese, and Korean in Amazon CloudSearch

Chinese, Japanese, and Korean do not have explicit word boundaries. Simply indexing individual characters (unigrams) can result in matches that aren't very relevant to a search query. One solution is to index bigrams. A bigram is every sequence of two adjacent characters in a string. For example, the following example shows bigrams for the string 我的氣墊船裝滿了鱔魚 :

我的  的氣  氣墊  墊船  船裝  裝滿  滿了  了鱔  鱔魚

While indexing bigrams can improve search result quality, keep in mind that it can significantly increase the size of your index.

To index bigrams for Chinese, Japanese, and Korean
  1. Create a text analysis scheme and set the language to multiple languages (mul).

  2. Configure the index field that contains the CJK data to use your multi-language analysis scheme.

When you assign an analysis scheme that sets a field's language to mul, Amazon CloudSearch automatically generates bigrams for all Chinese, Japanese, and Korean text within the field.

For more information about creating and using analysis schemes, see Configuring Analysis Schemes.

If you are indexing Japanese content, you might also be interested in using a custom tokenization dictionary with the standard Japanese language processor. For more information, see Customizing Japanese Tokenization.

Customizing Japanese Tokenization in Amazon CloudSearch

If you need more control over how Amazon CloudSearch tokenizes Japanese, you can add a custom Japanese tokenization dictionary to your analysis scheme. Configuring a custom tokenization dictionary enables you to override how specific entries are tokenized by the standard Japanese language processor. This can improve search result accuracy in some cases, particularly when you need to index and retrieve domain-specific phrases.

A tokenization dictionary is a collection of entries where each entry specifies a set of characters, how the characters should be tokenized, how each token should be pronounced (readings), and a part-of-speech tag. You specify the dictionary as an array, and each dictionary entry is an array of strings. The entries are of the following form:

["<text>","<token 1> ... <token n>","<reading 1> ... <reading n>","<part-of-speech tag>"]

You must specify a reading for each token and the part-of-speech tag for the entry. See Japanese Part-of-Speech Tags for the part of speech tags that are treated as stopwords.

You use the JapaneseTokenizationDictionary key to define a custom tokenization dictionary in an analysis scheme. Because you pass the tokenization dictionary to Amazon CloudSearch as a string, you must escape all double quotes within the string. For example, the dictionary in the following analysis scheme specifies segmentation overrides for Kanji and Katakana compounds, and a custom reading for a proper name:

{ "AnalysisSchemeName": "jascheme", "AnalysisSchemeLanguage": "ja", "AnalysisOptions": { "Stopwords": "[\"a\", \"the\"]", "AlgorithmicStemming": "full", "JapaneseTokenizationDictionary": "[ [\"日本経済新聞\",\"日本 経済 新聞\",\"ニホン ケイザイ シンブン\",\"カスタム名詞\"],[\"トートバッグ\",\"トート バッグ\",\"トート バッグ\",\"かずカナ名詞\"],[\"朝青龍\",\"朝青龍\",\"アサショウリュウ\",\"カスタム人名\"] ]" } }

When configuring an analysis scheme with the AWS CLI, you can store the analysis scheme in a text file and specify that file as the --analysis-scheme value. This enables you to format the scheme so that it's easier to read. For example, if you store the jascheme analysis scheme in a file called jascheme.txt, you can pass that file in when you call aws cloudsearch define-analysis-scheme:

aws cloudsearch define-analysis-scheme --region us-east-1 --domain-name mydomain --analysis-scheme file://jascheme.txt

For more information about creating and using analysis schemes, see Configuring Analysis Schemes.

Japanese Part-of-Speech Tags in Amazon CloudSearch

When you use a custom tokenization dictionary for Japanese, you specify a part-of-speech tag for each entry. If the part-of-speech tag matches one of the tags configured as a stop tag, the entry is treated as a stopword.

The following table shows the part of speech tags configured as stop tags in Amazon CloudSearch.

Stop Tags
Tag Part-of-Speech Description
助動詞 Auxiliary-verb A verb that adds functional or grammatical meaning to the clause in which it appears.
接続詞 Conjunction Conjunctions that can occur independently.
フィラー Filler Aizuchi that occurs during a conversation or sounds inserted as filler.
非言語音 Non-verbal Non-verbal sound.
その他-間投 Other-interjection Words that are hard to classify as noun-suffixes or sentence-final particles.
助詞-副詞化 Particle-adnominalizer The "ni" and "to" that appear following nouns and adverbs.
助詞-連体化 Particle-adnominalizer The "no" that attaches to nouns and modifies non-inflectional words.
助詞-副助詞 Particle-adverbial An adverb used to show position, direction of movement, and so on.
助詞-副助詞/並立助詞/終助詞 Particle-adverbial/conjunctive/final The particle "ka" when unknown whether it is adverbial, conjunctive, or sentence final.
助詞-格助詞-連語 Particle-case-compound Compounds of particles and verbs that mainly behave like case particles.
助詞-格助詞-一般 Particle-case-misc Case particles.
助詞-格助詞-引用 Particle-case-quote The "to" that appears after nouns, a person’s speech, quotation marks, expressions of decisions from a meeting, reasons, judgements, conjectures, and so on.
助詞-格助詞 Particle-case Case particles where the subclassification is undefined.
助詞-接続助詞 Particle-conjunctive Conjunctive particles.
助詞-並立助詞 Particle-coordinate Coordinate particles.
助詞-係助詞 Particle-dependency Dependency particles.
助詞-終助詞 Particle-final Final particles.
助詞-間投助詞 Particle-interjective Particles with interjective grammatical roles.
助詞-特殊 Particle-special A particle that does not fit into any of the other classifications. This includes particles that are used in Tanka, Haiku, and other poetry.
助詞 Particle Unclassified particles.
記号-括弧閉 Symbol-close_bracket Close bracket: ].
記号-読点 Symbol-comma Comma: ,.
記号-一般 Symbol-misc A general symbol not in one of the other categories.
記号-括弧開 Symbol-open_bracket Open bracket: [.
記号-句点 Symbol-period Periods and full stops.
記号-空白 Symbol-space Full-width whitespace.
記号 Symbol Unclassified symbols.