Collations supported by Babelfish - Amazon Aurora

Collations supported by Babelfish

When you create an Aurora PostgreSQL DB cluster with Babelfish, you choose a collation for your data. A collation specifies the sort order and bit patterns that produce the text or characters in a given written human language. A collation includes rules comparing data for a given set of bit patterns. Collation is related to localization. Different locales affect character mapping, sort order, and the like. Collation attributes are reflected in the names of various collations. For information about attributes, see the Babelfish collation attributes table.

Babelfish maps SQL Server collations to comparable collations provided by Babelfish. Babelfish predefines Unicode collations with culturally sensitive string comparisons and sort orders. Babelfish also provides a way to translate the collations in your SQL Server DB to the closest-matching Babelfish collation. Locale-specific collations are provided for different languages and regions.

Some collations specify a code page that corresponds to a client-side encoding. Babelfish automatically translates from the server encoding to the client encoding depending on the collation of each output column.

Babelfish supports the collations listed in the Babelfish supported collations table. Babelfish maps SQL Server collations to comparable collations provided by Babelfish.

Babelfish uses version 153.80 of the International Components for Unicode (ICU) collation library. For more information about ICU collations, see Collation in the ICU documentation. To learn more about PostgreSQL and collation, see Collation Support in the the PostgreSQL documentation.

DB cluster parameters that control collation and locale

The following parameters affect collation behavior.

babelfishpg_tsql.default_locale

This parameter specifies the default locale used by the collation. This parameter is used in combination with attributes listed in the Babelfish collation attributes table to customize collations for a specific language and region. The default value for this parameter is en-US.

The default locale applies to all Babelfish collation names that start with "BBF" and to all SQL Server collations that are mapped to Babelfish collations. Changing the setting for this parameter on an existing Babelfish DB cluster doesn't affect the locale of existing collations. For the list of collations, see the Babelfish supported collations table.

babelfishpg_tsql.server_collation_name

This parameter specifies the default collation for the server (Aurora PostgreSQL DB cluster instance) and the database. The default value is sql_latin1_general_cp1_ci_as. The server_collation_name has to be a CI_AS collation because in T-SQL, the server collation determines how identifiers are compared.

When you create your Babelfish DB cluster, you choose the Collation name from the selectable list. These include the collations listed in the Babelfish supported collations table. Don't modify the server_collation_name after the Babelfish database is created.

The settings you choose when you create your Babelfish for Aurora PostgreSQL DB cluster are stored in the DB cluster parameter group associated with the cluster for these parameters and set its collation behavior.

Deterministic and nondeterministic collations and Babelfish

Babelfish supports deterministic and nondeterministic collations:

  • A deterministic collation evaluates characters that have identical byte sequences as equal. That means that x and X aren't equal in a deterministic collation. Deterministic collations can be case-sensitive (CS) and accent-sensitive (AS).

  • A nondeterministic collation doesn't need an identical match. A nondeterministic collation evaluates x and X as equal. Nondeterministic collations are case-insensitive (CI) and accent-insensitive (AI).

In the table following, you can find some behavior differences between Babelfish and PostgreSQL when using nondeterministic collations.

Babelfish PostgreSQL

Supports the LIKE clause for CI_AS collations.

Doesn't support the LIKE clause on nondeterministic collations.

Doesn't support the LIKE clause on AI collations.

Don't support pattern-matching operations on nondeterministic collations.

For a list of other limitations and behavior differences for Babelfish compared to SQL Server and PostgreSQL, see Collation limitations and behavior differences.

Babelfish and SQL Server follow a naming convention for collations that describe the collation attributes, as shown in the table following.

Attribute Description

AI

Accent-insensitive.

AS

Accent-sensitive.

BIN2

BIN2 requests data to be sorted in code point order. Unicode code point order is the same character order for UTF-8, UTF-16, and UCS-2 encodings. Code point order is a fast deterministic collation.

CI

Case-insensitive.

CS

Case-sensitive.

PREF

To sort uppercase letters before lowercase letters, use a PREF collation. If comparison is case-insensitive, the uppercase version of a letter sorts before the lowercase version, if there is no other distinction. The ICU library supports uppercase preference with colCaseFirst=upper, but not for CI_AS collations.

PREF can be applied only to CS_AS deterministic collations.

Collations supported by Babelfish

Use the following collations as a server collation or an object collation.

Collation ID Notes

bbf_unicode_general_ci_as

Supports case-insensitive comparison and the LIKE operator.

bbf_unicode_cp1_ci_as

Nondeterministic collation also known as CP1252.

bbf_unicode_CP1250_ci_as

Nondeterministic collation used to represent texts in Central European and Eastern European languages that use Latin script.

bbf_unicode_CP1251_ci_as

Nondeterministic collation for languages that use the Cyrillic script.

bbf_unicode_cp1253_ci_as

Nondeterministic collation used to represent modern Greek.

bbf_unicode_cp1254_ci_as

Nondeterministic collation that supports Turkish.

bbf_unicode_cp1255_ci_as

Nondeterministic collation that supports Hebrew.

bbf_unicode_cp1256_ci_as

Nondeterministic collation used to write languages that use Arabic script.

bbf_unicode_cp1257_ci_as

Nondeterministic collation used to support Estonian, Latvian, and Lithuanian languages.

bbf_unicode_cp1258_ci_as

Nondeterministic collation used to write Vietnamese characters.

bbf_unicode_cp874_ci_as

Nondeterministic collation used to write Thai characters.

sql_latin1_general_cp1250_ci_as

Nondeterministic single-byte character encoding used to represent Latin characters.

sql_latin1_general_cp1251_ci_as

Nondeterministic collation that supports Latin characters.

sql_latin1_general_cp1_ci_as

Nondeterministic collation that supports Latin characters.

sql_latin1_general_cp1253_ci_as

Nondeterministic collation that supports Latin characters.

sql_latin1_general_cp1254_ci_as

Nondeterministic collation that supports Latin characters.

sql_latin1_general_cp1255_ci_as

Nondeterministic collation that supports Latin characters.

sql_latin1_general_cp1256_ci_as

Nondeterministic collation that supports Latin characters.

sql_latin1_general_cp1257_ci_as

Nondeterministic collation that supports Latin characters.

sql_latin1_general_cp1258_ci_as

Nondeterministic collation that supports Latin characters.

chinese_prc_ci_as

Nondeterministic collation that supports Chinese (PRC).

cyrillic_general_ci_as

Nondeterministic collation that supports Cyrillic.

finnish_swedish_ci_as

Nondeterministic collation that supports Finnish.

french_ci_as

Nondeterministic collation that supports French.

japanese_ci_as

Nondeterministic collation that supports Japanese. Supported in Babelfish 2.1.0 and higher releases.

korean_wansung_ci_as

Nondeterministic collation that supports Korean (with dictionary sort).

latin1_general_ci_as

Nondeterministic collation that supports Latin characters.

modern_spanish_ci_as

Nondeterministic collation that supports Modern Spanish.

polish_ci_as

Nondeterministic collation that supports Polish.

thai_ci_as

Nondeterministic collation that supports Thai.

traditional_spanish_ci_as

Nondeterministic collation that supports Spanish (traditional sort).

turkish_ci_as

Nondeterministic collation that supports Turkish.

ukrainian_ci_as

Nondeterministic collation that supports Ukrainian.

vietnamese_ci_as

Nondeterministic collation that supports Vietnamese.

You can use the following collations as object collations.

Dialect Deterministic options Nondeterministic options

Arabic

Arabic_CS_AS

Arabic_CI_AS, Arabic_CI_AI

Chinese

Chinese_CS_AS

Chinese_CI_AS, Chinese_CI_AI

Cyrillic_General

Cyrillic_General_CS_AS

Cyrillic_General_CI_AS, Cyrillic_General_CI_AI

Estonian

Estonian_CS_AS

Estonian_CI_AS, Estonian_CI_AI

Finnish_Swedish

Finnish_Swedish_CS_AS

Finnish_Swedish_CI_AS, Finnish_Swedish_CI_AI

French

French_CS_AS

French_CI_AS, French_CI_AI

Greek

Greek_CS_AS

Greek_CI_AS, Greek_CI_AI

Hebrew

Hebrew_CS_AS

Hebrew_CI_AS, Hebrew_CI_AI

Japanese (Babelfish 2.1.0 and higher)

Japanese_CS_AS Japanese_CI_AI, Japanese_CI_AS

Korean_Wamsung

Korean_Wamsung_CS_AS

Korean_Wamsung_CI_AS, Korean_Wamsung_CI_AI

Modern_Spanish

Modern_Spanish_CS_AS

Modern_Spanish_CI_AS, Modern_Spanish_CI_AI

Mongolian

Mongolian_CS_AS

Mongolian_CI_AS, Mongolian_CI_AI

Polish

Polish_CS_AS

Polish_CI_AS, Polish_CI_AI

Thai

Thai_CS_AS

Thai_CI_AS, Thai_CI_AI

Traditional_Spanish

Traditional_Spanish_CS_AS

Traditional_Spanish_CI_AS, Traditional_Spanish_CI_AI

Turkish

Turkish_CS_AS

Turkish_CI_AS, Turkish_CI_AI

Ukranian

Ukranian_CS_AS

Ukranian_CI_AS, Ukranian_CI_AI

Vietnamese

Vietnamese_CS_AS

Vietnamese_CI_AS, Vietnamese_CI_AI

Default Collation in Babelfish

Earlier, the default collation of the collatable datatypes was pg_catalog.default. The datatypes and the objects that depends on these datatypes follows cases-sensitive collation. This condition potentially impacts the T-SQL objects of the data set with case-insensitive collation. Starting with Babelfish 2.3.0, the default collation for the collatable data types (except TEXT and NTEXT) is the same as the collation in the babelfishpg_tsql.server_collation_name parameter. When you upgrade to Babelfish 2.3.0, the default collation is picked automatically at the time of DB cluster creation, which doesn't create any visible impact.