Menu
Amazon CloudSearch
Developer Guide (API Version 2013-01-01)

Text Processing in Amazon CloudSearch

During indexing, Amazon CloudSearch processes text and text-array fields according to the analysis scheme configured for the field to determine what terms to add to the index. Before the analysis options are applied, the text is tokenized and normalized.

During tokenization, the stream of text in a field is split into separate tokens on detectable boundaries using the word break rules defined in the Unicode Text Segmentation algorithm. For more information, see Unicode Text Segmentation.

According to the word break rules, strings separated by whitespace such as spaces and tabs are treated as separate tokens. In many cases, punctuation is dropped and treated as whitespace. For example, strings are split at hyphens (-) and the at symbol (@). However, periods that are not followed by whitespace are considered part of the token.

Note that strings are not split on case boundaries—CamelCase strings are not tokenized.

During normalization, upper case characters are converted to lower case. Accents are typically handled according to the stemming options configured in the field's analysis scheme. (The default analysis scheme for English removes accents.)

Once tokenization and normalization are complete, the stemming options, stopwords, and synonyms specified in the analysis scheme are applied.

When you submit a search request, the text you're searching for undergoes the same text processing so that it can be matched against the terms that appear in the index. However, no text analysis is performed on the search term when you perform a prefix search. This means that a search for a prefix that ends in s typically won't match the singular version of the term when stemming is enabled. This can happen for any term that ends in s, not just plurals. For example, if you search the actor field in the sample movie data for Anders, there are three matching movies. If you search for Ander*, you get those movies as well as several others. However, if you search for Anders* there are no matches. This is because the term is stored in the index as ander, anders does not appear in the index.

If stemming is preventing your wildcard searches from returning all of the relevant matches, you can suppress stemming for the text field by setting the AlgorithmicStemming option to none, or you can map the data to a literal field instead of a text field.

Language Specific Text Processing Settings in Amazon CloudSearch

Arabic (ar)

Algorithmic stemming options: light

Default analysis scheme: _ar_default_

  • Algorithmic stemming: light

  • Default stopword dictionary

Armenian (hy)

Algorithmic stemming options: full

Default analysis scheme: _hy_default_

  • Algorithmic stemming: full

  • Default stopword dictionary

Basque (eu)

Algorithmic stemming options: full

Default analysis scheme: _eu_default_

  • Algorithmic stemming options: full

  • Default stopword dictionary

Bulgarian (bg)

Algorithmic stemming options: light

Default analysis scheme: _bg_default_

  • Algorithmic stemming: light

  • Default stopword dictionary

Catalan (ca)

Algorithmic stemming options: full

Elision filter enabled

Default analysis scheme: _ca_default_

  • Algorithmic stemming: full

  • Default stopword dictionary

Chinese - Simplified (zh-Hans)

Algorithmic stemming not supported

Stemming dictionary not supported

Default analysis scheme: _zh-Hans_default_

Chinese - Traditional (zh-Hant)

Algorithmic stemming not supported

Stemming dictionary not supported

Default analysis scheme: _zh-Hant_default_

Czech (cs)

Algorithmic stemming options: light

Default analysis scheme: _cs_default_

  • Algorithmic stemming: light

  • Default stopword dictionary

Danish (da)

Algorithmic stemming options: full

Default analysis scheme: _da_default_

  • Algorithmic stemming: full

  • Default stopword dictionary

Dutch (nl)

Algorithmic stemming options: full

Default analysis scheme: _nl_default_

  • Algorithmic stemming: full

  • Default stopword dictionary

  • Default stemming dictionary

English (en)

Algorithmic stemming options: minimal|light|full

Default analysis scheme: _en_default_

  • Algorithmic stemming: full

  • Default stopword dictionary

Finnish (fi)

Algorithmic stemming options: light|full

Default analysis scheme: _fi_default_

  • Algorithmic stemming: light

  • Default stopword dictionary

French (fr)

Algorithmic stemming options: minimal|light|full

Elision filter enabled

Default analysis scheme: _fr_default_

  • Algorithmic stemming: minimal

  • Default stopword dictionary

Galician (gl)

Algorithmic stemming options: minimal|full

Default analysis scheme: _gl_default_

  • Algorithmic stemming: minimal

  • Default stopword dictionary

German (de)

Algorithmic stemming options: minimal|light|full

Default analysis scheme: _de_default_

  • Algorithmic stemming: light

  • Default stopword dictionary

Greek (el)

Algorithmic stemming options: full

Default analysis scheme: _el_default_

  • Algorithmic stemming: full

  • Default stopword dictionary

Hebrew (h3)

Algorithmic stemming options: full

Default analysis scheme: _he_default_

  • Algorithmic stemming: full

  • Default stopword dictionary

Hindi (hi)

Algorithmic stemming options: full

Default analysis scheme: _hi_default_

  • Algorithmic stemming: full

  • Default stopword dictionary

Hungarian (hu)

Algorithmic stemming options: light|full

Default analysis scheme: _hu_default_

  • Algorithmic stemming: light

  • Default stopword dictionary

Indonesian (id)

Algorithmic stemming options: light|full

Default analysis scheme: id_default_

  • Algorithmic stemming: full

  • Default stopword dictionary

Irish (ga)

Algorithmic stemming options: full

Elision filter enabled

Default analysis scheme: _ga_default_

  • Algorithmic stemming options: full

  • Default stopword dictionary

Italian (it)

Algorithmic stemming options: light|full

Elision filter enabled

Default analysis scheme: _it_default_

  • Algorithmic stemming: light

  • Default stopword dictionary

Japanese (ja)

Algorithmic stemming options: full

Algorithmic decompounding enabled

Optional tokenization dictionary

Default analysis scheme: _ja_default_

  • Algorithmic stemming: full

  • Default stopword dictionary

Korean (ko)

Algorithmic stemming not supported

Algorithmic decompounding enabled

Default analysis scheme: _ko_default_

  • Default stopword dictionary

Latvian (lv)

Algorithmic stemming: light

Default analysis scheme: _lv_default_

  • Algorithmic stemming: light

  • Default stopword dictionary

Multiple (mul)

Algorithmic stemming: not supported

Default analysis scheme: _mul_default_

  • Default stopword dictionary

Norwegian (no)

Algorithmic stemming options: minimal|light|full

Default analysis scheme: _no_default_

  • Algorithmic stemming: light

  • Default stopword dictionary

Persian (fa)

Algorithmic stemming not supported

Default analysis scheme: _fa_default_

  • Default stopword dictionary

Portuguese (pt)

Algorithmic stemming options: minimal|light|full

Default analysis scheme: _pt_default_

  • Algorithmic stemming: minimal

  • Default stopword dictionary

Romanian (ro)

Algorithmic stemming options: full

Default analysis scheme: _ro_default_

  • Algorithmic stemming: full

  • Default stopword dictionary

Russian (ru)

Algorithmic stemming options: light|full

Default analysis scheme: _ru_default_

  • Algorithmic stemming: light

  • Default stopword dictionary

Spanish (es)

Algorithmic stemming options: light|full

Default analysis scheme: _es_default_

  • Algorithmic stemming: light

  • Default stopword dictionary

Swedish (sv)

Algorithmic stemming options: light|full

Default analysis scheme: _sv_default_

  • Algorithmic stemming: light

  • Default stopword dictionary

Thai (th)

Algorithmic stemming not supported

Stemming dictionary not supported

Default analysis scheme: _th_default_

  • Default stopword dictionary

Turkish (tr)

Algorithmic stemming: full

Default analysis scheme: _tr_default_

  • Algorithmic stemming: full

  • Default stopword dictionary