Set up language-specific sorting for Amazon Redshift query results using a scalar Python UDF - AWS Prescriptive Guidance

Set up language-specific sorting for Amazon Redshift query results using a scalar Python UDF

Created by Ethan Stark (AWS)

Environment: Production

Technologies: Analytics

AWS services: Amazon Redshift

Summary

This pattern provides steps and sample code for using a scalar Python UDF (user-defined function) to set up case insensitive linguistic sorting for Amazon Redshift query results. It’s necessary to use a scalar Python UDF because Amazon Redshift returns results based on binary UTF-8 ordering and doesn’t support language-specific sorting. A Python UDF is non-SQL processing code that’s based on a Python 2.7 program and runs in a data warehouse. You can run Python UDF code with a SQL statement in a single query. For more information, see the Introduction to Python UDFs in Amazon Redshift AWS Big Data Blog post.

The sample data in this pattern is based on the Turkish alphabet for demonstration purposes. The scalar Python UDF in this pattern is built to make the default query results of Amazon Redshift conform to the linguistic ordering of characters in the Turkish language. For more information, see Turkish language example in the Additional information section of this pattern. You can modify the scalar Python UDF in this pattern for other languages.

Prerequisites and limitations

Prerequisites

  • Amazon Redshift cluster with a database, schema, and tables

  • Amazon Redshift user with CREATE TABLE and CREATE FUNCTION permissions

  • Python 2.7 or later

Limitations

The linguistic sorting used by the queries in this pattern is case insensitive.

Architecture

Technology stack

  • Amazon Redshift

  • Python UDF

Tools

AWS services

  • Amazon Redshift is a managed petabyte-scale data warehouse service in the AWS Cloud. Amazon Redshift is integrated with your data lake, which enables you to use your data to acquire new insights for your business and customers.

Other tools

Epics

TaskDescriptionSkills required

Create a table for your sample data.

To create a table in Amazon Redshift and insert your sample data into the table, use the following SQL statements:

CREATE TABLE my_table (first_name varchar(30)); INSERT INTO my_table (first_name) VALUES ('ali'), ('Ali'), ('ırmak'), ('IRMAK'), ('irem'), ('İREM'), ('oğuz'), ('OĞUZ'), ('ömer'), ('ÖMER'), ('sedat'), ('SEDAT'), ('şule'),

Note: The first names in the sample data include special characters from the Turkish alphabet. For more information about Turkish language considerations for this example, see Turkish language example in the Additional information section of this pattern.

Data engineer

Check the default sorting of the sample data.

To see the default sorting of your sample data in Amazon Redshift, run the following query:

SELECT first_name FROM my_table ORDER BY first_name;

The query returns the list of first names from the table that you created earlier:

first_name --------------- Ali IRMAK OĞUZ SEDAT ali irem oğuz sedat ÖMER ömer İREM ırmak ŞULE şule

The query results aren’t in the correct order because the default binary UTF-8 ordering doesn’t accommodate the linguistic ordering of the Turkish special characters.

Data engineer

Create a scalar Python UDF.

To create a scalar Python UDF, use the following SQL code: 

CREATE OR REPLACE FUNCTION collate_sort (value varchar) RETURNS varchar IMMUTABLE AS $$ def sort_str(val): import string dictionary = { 'I': 'ı', 'ı': 'h~', 'İ': 'i', 'Ş': 's~', 'ş': 's~', 'Ğ': 'g~', 'ğ': 'g~', 'Ü': 'u~', 'ü': 'u~', 'Ö': 'o~', 'ö': 'o~', 'Ç': 'c~', 'ç': 'c~' } for key, value in dictionary.items(): val = val.replace(key, value) return val.lower() return sort_str(value) $$ LANGUAGE plpythonu;
Data engineer

Query the sample data.

To query the sample data by using the Python UDF, run the following SQL query:

SELECT first_name FROM my_table ORDER BY collate_order(first_name);

The query now returns the sample data in Turkish linguistic order:

first_name --------------- ali Ali ırmak IRMAK irem İREM oğuz OĞUZ ömer Ömer sedat SEDAT şule ŞULE
Data engineer

Related resources

Additional information

Turkish language example

Amazon Redshift returns query results based on binary UTF-8 sort ordering, not language-specific sort ordering. This means that if you query an Amazon Redshift table containing Turkish characters, then the query results aren’t sorted according to the linguistic ordering of the Turkish language. The Turkish language contains six special characters (ç, ı, ğ, ö, ş, and ü) that don’t appear in the Latin alphabet. These special characters are placed at the end of a sorted result set based on binary UTF-8 ordering, as the following table shows.

Binary UTF-8 ordering

Turkish linguistic ordering

a

a

b

b

c

c

d

ç (*)

e

d

f

e

g

f

h

g

i

ğ (*)

j

h

k

ı (*)

l

i

m

j

n

k

o

l

p

m

r

n

s

o

t

ö (*)

u

p

v

r

y

s

z

ş (*)

ç (*)

t

ğ (*)

u

ı (*)

ü (*)

ö (*)

v

ş (*)

y

ü (*)

z

Note: The asterisk (*) indicates a special character in the Turkish language.

As the table above illustrates, special character ç is between c and d in Turkish linguistic ordering, but appears after z in binary UTF-8 ordering. The scalar Python UDF in this pattern uses the following character replacement dictionary to replace the Turkish special characters with corresponding Latin-equivalent characters.

Turkish special character

Latin-equivalent character

ç

c~

ı

h~

ğ

g~

ö

o~

ş

s~

ü

u~

Note: A tilde (~) character is appended to the end of the Latin characters that replace their corresponding Turkish special characters.

Modify a scalar Python UDF function

To modify the scalar Python UDF function from this pattern so that the function accepts a locate parameter and supports a multiple transaction dictionary, use the following SQL code:

CREATE OR REPLACE FUNCTION collate_sort (value varchar, locale varchar) RETURNS varchar IMMUTABLE AS $$ def sort_str(val): import string # Turkish Dictionary if locale == 'tr-TR': dictionary = { 'I': 'ı', 'ı': 'h~', 'İ': 'i', 'Ş': 's~', 'ş': 's~', 'Ğ': 'g~', 'ğ': 'g~', 'Ü': 'u~', 'ü': 'u~', 'Ö': 'o~', 'ö': 'o~', 'Ç': 'c~', 'ç': 'c~' } # German Dictionary if locale == 'de-DE': dictionary = { .... .... } for key, value in dictionary.items(): val = val.replace(key, value) return val.lower() return sort_str(value) $$ LANGUAGE plpythonu;

The following example code shows how to query the modified Python UDF:

SELECT first_name FROM my_table ORDER BY collate_order(first_name, 'tr-TR');