Collations - SQL Server to Aurora MySQL Migration Playbook

Collations

Feature compatibility AWS SCT / AWS DMS automation level AWS SCT action code index Key differences

Three star feature compatibility

Four star automation level

Collations

UNICODE uses CHARACTER SET property instead of NCHAR or NVARCHAR data types.

SQL Server Usage

SQL Server collations define the rules for string management and storage in terms of sorting, case sensitivity, accent sensitivity, and code page mapping. SQL Server supports both ASCII and UCS-2 UNICODE data.

UCS-2 UNICODE data uses a dedicated set of UNICODE data types denoted by the prefix N: Nchar and Nvarchar. Their ASCII counterparts are CHAR and VARCHAR.

Choosing a collation and a character set has significant implications on data storage, logical predicate evaluations, query results, and query performance.

Note

To view all collations supported by SQL Server, use the fn_helpcollations function as shown following: SELECT * FROM sys.fn_helpcollations().

Collations define the actual bitwise binary representation of all string characters and the associated sorting rules. SQL Server supports multiple collations down to the column level. A table may have multiple string columns that use different collations. Collations for non-UNICODE character sets determine the code page number representing the string characters.

Note

UNICODE and non-UNICODE data types in SQL Server aren’t compatible. A predicate or data modification that introduces a type conflict is resolved using predefined collation precedence rules. For more information, see Collation Precedence in the SQL Server documentation.

Collations define sorting and matching sensitivity for the following string characteristics:

  • Case

  • Accent

  • Kana

  • Width

  • Variation selector

SQL Server uses a suffix naming convention that appends the option name to the collation name. For example, the collation Azeri_Cyrillic_100_CS_AS_KS_WS_SC, is an Azeri-Cyrillic-100 collation that is case-sensitive, accent-sensitive, kana type-sensitive, width-sensitive, and has supplementary characters.

SQL Server supports three types of collation sets: * Windows Collations use the rules defined for collations by the operating system locale where UNICODE and non-UNICODE data use the same comparison algorithms. * Binary Collations use the binary bit-wise code for comparison. Therefore, the locale doesn’t affect sorting. * SQL Server Collations provide backward compatibility with previous SQL Server versions. They aren’t compatible with the windows collation rules for non-UNICODE data.

You can define collations at various levels:

  • Server-level collations determine the collations used for all system databases and is the default for future user databases. While the system databases collation can’t be changed, an alternative collation can be specified as part of the CREATE DATABASE statement

  • Database-level collations inherit the server default unless the CREATE DATABASE statement explicitly sets a different collation. This collation is used as a default for all CREATE TABLE and ALTER TABLE statements.

  • Column-level collations can be specified as part of the CREATE TABLE or ALTER TABLE statements to override the database’s default collation setting.

  • Expression-level collations can be set for individual string expressions using the COLLATE function. For example, SELECT * FROM MyTable ORDER BY StringColumn COLLATE Latin1_General_CS_AS.

Note

SQL Server supports UCS-2 UNICODE only.

SQL Server 2019 adds support for UTF-8 for import and export encoding, and as database-level or column-level collation for string data. Support includes PolyBase external tables, and Always Encrypted (when not used with Enclaves). For more information, see Collation and Unicode Support in the SQL Server documentation.

Syntax

CREATE DATABASE <Database Name>
[ ON <File Specifications> ]
COLLATE <Collation>
[ WITH <Database Option List> ];
CREATE TABLE <Table Name>
(
    <Column Name> <String Data Type>
    COLLATE <Collation> [ <Column Constraints> ]...
);

Examples

The following example creates a database with a default Bengali_100_CS_AI collation.

CREATE DATABASE MyBengaliDatabase
ON
( NAME = MyBengaliDatabase_Datafile,
    FILENAME = 'C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\DATA\MyBengaliDatabase.mdf',
    SIZE = 100)
LOG ON
    ( NAME = MyBengaliDatabase_Logfile,
FILENAME = 'C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\DATA\MyBengaliDblog.ldf',
    SIZE = 25)
COLLATE Bengali_100_CS_AI;

The following example creates a table with two different collations.

CREATE TABLE MyTable
(
    Col1 CHAR(10) COLLATE Hungarian_100_CI_AI_SC NOT NULL PRIMARY KEY,
    COL2 VARCHAR(100) COLLATE Sami_Sweden_Finland_100_CS_AS_KS NOT NULL
);

For more information, see Collation and Unicode support in the SQL Server documentation.

MySQL Usage

Amazon Aurora MySQL-Compatible Edition (Aurora MySQL) supports multiple character sets and a variety of collations that can be used for comparison. Similar to SQL Server, you can define collations at the server, database, and column level. Additionally, you can define collations at the table level in Aurora MySQL.

The paradigm of collations in Aurora MySQL is different than in SQL Server and consists of separate character set and collation objects. Aurora MySQL supports 41 different character sets and 222 collations. Seven different UNICODE character sets are supported including UCS-2, UTF-8 and UTF-32.

Note

Use UCS-2 which is compatible with SQL Server UNICODE types.

Each character set can have one or more associated collations with a single default collation.

Collation names have prefixes consisting of the name of their associated character set followed by suffixes that indicate additional characteristics.

To see all character sets supported by Aurora MySQL, use the INFORMATION_SCHEMA.CHARACTER_SETS table or the SHOW CHARACTER SET statement.

To see all collations for a character set, use the INFORMATION_SCHEMA.COLLATIONS table or the SHOW COLLATION statement.

Note

Character set and collation settings also affect client-to -server communications. You can set explicit collations for sessions using the SET command. For example, SET NAMES 'utf8'; causes Aurora MySQL to treat incoming object names as UTF-8 encoded.

You can set the default character set and collations at the server level using custom cluster parameter groups. For more information, see Server Options.

At the database level, you can set a default character set and collation with the CREATE DATABASE and ALTER DATABASE statements. Consider the following example:

CREATE DATABASE MyDatabase
CHARACTER SET latin1 COLLATE latin1_swedish_ci;

To view the default character set and collation for an Aurora MySQL databases, use the following statement:

SELECT DEFAULT_CHARACTER_SET_NAME,
    DEFAULT_COLLATION_NAME
FROM INFORMATION_SCHEMA.SCHEMATA
WHERE SCHEMA_NAME = '<Database Name>';
Note

In Aurora MySQL, a database is equivalent to an SQL Server schema. For more information, see Databases and Schemas.

Every string column in Aurora MySQL has a character set and an associated collation. If not explicitly specified, it will inherit the table default. To specify a non-default character set and collation, use the CHARACTER SET and COLLATE clauses of the CREATE TABLE statement.

CREATE TABLE MyTable
(
    StringColumn VARCHAR(5) NOT NULL
    CHARACTER SET latin1
    COLLATE latin1_german1_ci
);

At the expression level, similar to SQL Server, you can use the COLLATE function to explicitly declare a string’s collation. In addition, a prefix to the string can be used to denote a specific character set. Consider the following example:

SELECT _latin1'Latin non-UNICODE String',
_utf8'UNICODE String' COLLATE utf8_danish_ci;
Note

The Aurora MySQL term for this prefix or string header is introducer. It doesn’t change the value of the string; only the character set.

At the session level, the server’s setting determines the default character set and collation used to evaluate nonqualified strings.

Although the server’s character set and collation default settings can be modified using the cluster parameter groups, it is recommended that client applications don’t assume a specific setting and explicitly set the required character set and collation using the SET NAMES and SET CHARACTER SET statements.

For more information, see Connection Character Sets and Collations in the MySQL documentation.

Syntax

The following example creates a database-level collation.

CREATE DATABASE <Database Name>
[DEFAULT] CHARACTER SET <Character Set>
[[DEFAULT] COLLATE <Collation>];

The following example creates a table-level collation.

CREATE TABLE <Table Name>
(Column Specifications)
[DEFAULT] CHARACTER SET <Character Set>
[COLLATE <Collation>];

The following example creates a column collation.

CREATE TABLE <Table Name>
(
<Column Name> {CHAR | VARCHAR | TEXT} (<Length>)
CHARACTER SET CHARACTER SET <Character Set>
[COLLATE <Collation>];

The following example creates an expression collation.

_<Character Set>'<String>' COLLATE <Collation>

Examples

The following walkthrough describes how to change the cluster character set and collation.

  1. Log in to your Management Console, choose Amazon RDS , and then choose Parameter groups.

  2. Choose Create parameter group.

  3. For Parameter group family, choose aurora-mysql5.7.

  4. For Type, choose DB Cluster Parameter Group.

  5. For Group name, enter the identified for the DB parameter group.

  6. Choose Create.

  7. Choose the newly created group on the Parameter groups list.

  8. For Parameters, enter character_set_server in the search box and choose Edit parameters.

  9. Choose the server default character set.

  10. Delete the search term and enter collation. Select the desired default server collation and choose Preview changes.

  11. Check the values and choose Close, and then choose Save changes.

  12. Return to the Management Console dashboard and choose Create database.

  13. For Choose a database creation method, choose Easy create.

  14. For Engine type, choose Amazon Aurora .

  15. Enter the instance size, cluster identifier and username. Choose Create database.

  16. Modify the created instance to change the DB Parameter group.

Summary

The following table identifies similarities, differences, and key migration considerations.

Feature SQL Server Aurora MySQL

Unicode support

UTF 16 using NCHAR and NVARCHAR data types

8 UNICODE character sets, using the CHARACTER SET option

Collations levels

Server, Database, Column, Expression

Server, Database, Table, Column, Expression

View collation metadata

fn_helpcollation system view

INFORMATION_SCHEMA.SCHEMATA, SHOW COLLATION, SHOW CHARACTER SET

For more information, see Character Sets, Collations, Unicode in the MySQL documentation.