Oracle character sets and PostgreSQL encoding
With AWS DMS, you can migrate databases between different database platforms while handling character set and encoding differences. Oracle databases use character sets to define which characters are allowed, while PostgreSQL uses encodings for the same purpose.
Feature compatibility | AWS SCT / AWS DMS automation level | AWS SCT action code index | Key differences |
---|---|---|---|
|
|
N/A |
N/A |
Oracle usage
Oracle supports most national and international encoded character set standards including extensive support for Unicode.
Oracle provides two scalar string-specific data types:
-
VARCHAR2 — Stores variable-length character strings with a length between 1 and 4000 bytes. The Oracle database can be configured to use the VARCHAR2 data type to store either Unicode or Non-Unicode characters.
-
NVARCHAR2 — Scalar data type used to store Unicode data. Supports AL16UTF16 or UTF8 and id specified during database creation.
Character sets in Oracle are defined at the instance level (Oracle 11g) or the pluggable database level (Oracle 12c R2). In Pre-12cR2 Oracle databases, the character set for the root container and all pluggable databases were required to be identical.
Oracle 18c updates AL32UTF8 and AL16UTF16 characted sets to Unicode standard version 9.0.
UTF8 Unicode
Oracle’s implementation uses the AL32UTF8 character set and provides encoding of ASCII characters as single-byte for latin characters, two-bytes for some European and Middle-Eastern languages, and three-bytes for certain South and East-Asian characters. Therefore, Unicode storage requirements are usually higher when compared non-Unicode character sets.
Character set migration
Two options exist for modifying existing instance-level or database-level character sets:
-
Export/Import from the source Instance/PDB to a new Instance/PDB with a modified character set.
-
Use the Database Migration Assistant for Unicode (DMU), which simplifies the migration process to the Unicode character set.
As of 2012, use of the CSALTER
utility for character set migrations is deprecated.
Note
Oracle Database 12c Release 1 (12.1.0.1) complies with version 6.1 of the Unicode standard.
Oracle Database 12c Release 2 (12.1.0.2) extends the compliance to version 6.2 of the Unicode standard.
UTF-8 is supported through the AL32UTF8 CS and is valid as both the client and database character sets.
UTF-16BE is supported through AL16UTF16 and is valid as the national (NCHAR) character set.
For more information, see Choosing a Character Set
PostgreSQL usage
PostgreSQL supports a variety of different character sets, also known as encoding, including support for both single-byte and multi-byte languages. The default character set is specified when initializing your PostgreSQL database cluster with initdb. Each individual database created on the PostgreSQL cluster supports individual character sets defined as part of database creation.
Note
Starting with PostgreSQL 13, Windows version now support obtaining version information for collations (ordering rules) from OS. This option is relevant for self-managed PostgreSQL installations running on Windows.
When querying the collversion from pg_collation in PostgreSQL running on Windows, prior to version 13 there wasn’t any value to reflect the OS collation version, for example version 11 running on Windows.
CREATE COLLATION german (provider = libc, locale = 'de_DE'); CREATE COLLATION select oid,collname,collversion from pg_collation where collprovider='c' and collname='german'; oid collname collversion 16394 german (1 row) select pg_collation_actual_version (16394); pg_collation_actual_version (1 row)
Starting with PostgreSQL 13 running on Windows.
CREATE COLLATION german (provider = libc, locale = 'de_DE'); CREATE COLLATION select oid,collname,collversion from pg_collation where collprovider='c' and collname='german'; oid collname collversion 32769 german 1539.5,1539.5 (1 row) select pg_collation_actual_version (32769); pg_collation_actual_version 1539.5,1539.5 (1 row)
Note
All supported character sets can be used by clients. However, some client-side only characters are not supported for use within the server.
Unlike Oracle, PostgreSQL doesn’t support an NVARHCHAR data type and doesn’t offer support for UTF-16.
Type | Function | Implementation level |
---|---|---|
Encoding |
Defines the basic rules on how alphanumeric characters are represented in binary format, for example, Unicode Encoding. |
Database |
Locale |
Superset which include LC_COLLATE and LC_CTYPE, among others. LC_COLLATE defines how strings are sorted and needs to be a subset supported by the database Encoding. LC_CTYPE is used to classify if a character is a digit, letter, whitespace, punctuation, and so on. |
Table-Column |
Examples
Create a database named test01 which uses the Korean EUC_KR Encoding the and the ko_KR locale.
CREATE DATABASE test01 WITH ENCODING 'EUC_KR' LC_COLLATE='ko_KR.euckr' LC_CTYPE='ko_KR.euckr' TEMPLATE=template0;
View the character sets configured for each database by querying the System Catalog.
select datname, datcollate, datctype from pg_database;
Changing character sets or encoding
In-place modification of the database encoding is not recommended nor supported. You must export all data, create a new database with the new encoding, and import the data.
Export the data using the pg_dump utility.
pg_dump mydb1 > mydb1_export.sql
Rename or delete your current database.
ALTER DATABASE mydb1 TO mydb1_backup;
Create a new database using the modified encoding.
CREATE DATABASE mydb1_new_encoding WITH ENCODING 'UNICODE' TEMPLATE=template0;
Import the data using the pg_dump file previously created. Verify that you set your client encoding to the encoding of your old database.
PGCLIENTENCODING=OLD_DB_ENCODING psql -f mydb1_export.sql mydb1_new_encoding
Note
Using the client_encoding parameter overrides the use of PGCLIENTENCODING.
Client/server character set conversions
PostgreSQL supports conversion of character sets between server and client for specific character set combinations as described in the pg_conversion system catalog.
PostgreSQL includes predefined conversions. For a complete list, see Built-in Client/Server Character Set Conversions
You can create a new conversion using the SQL command CREATE CONVERSION.
Examples
Create a conversion from UTF8 to LATIN1 using a custom-made myfunc1 function.
CREATE CONVERSION myconv FOR 'UTF8' TO 'LATIN1' FROM myfunc1;
Configure the PostgreSQL client character set.
psql \encoding SJIS SET CLIENT_ENCODING TO 'value';
View the client character set and reset it back to the default value.
SHOW client_encoding; RESET client_encoding;
Table level collation
PostgreSQL supports specifying the sort order and character classification behavior on a per-column level.
Example
Specify specific collations for individual table columns.
CREATE TABLE test1 (col1 text COLLATE "de_DE", col2 text COLLATE "es_ES");
Summary
Feature | Oracle | Aurora PostgreSQL |
---|---|---|
View database character set |
SELECT * FROM NLS_DATABASE_PARAMETERS; |
select datname, pg_encoding_to_char(encoding), datcollate, datctype from pg_database; |
Modify the database character set |
|
|
Character set granularity |
|
Database |
UTF8 |
Supported by VARCHAR2 and NVARCHAR data types |
Supported by VARCHAR datatype |
UTF16 |
Supported by NVARCHAR2 datatype |
Not Supported |
NCHAR/NVARCHAR data types |
Supported |
Not Supported |
For more information, see Character Set Support