Summary Prerequisites and limitations Architecture Tools Epics Related resources

Convert the Teradata NORMALIZE temporal feature to Amazon Redshift SQL

Source: Teradata data warehouse	Target: Amazon Redshift	R Type: Re-architect
Environment: Production	Technologies: Analytics; Databases; Migration	Workload: All other workloads
AWS services: Amazon Redshift

Summary

NORMALIZE is a Teradata extension to the ANSI SQL standard. When a SQL table includes a column that has a PERIOD data type, NORMALIZE combines values that meet or overlap in that column, to form a single period that consolidates multiple, individual period values. To use NORMALIZE, at least one column in the SQL SELECT list must be of Teradata's temporal PERIOD data type. For more information about NORMALIZE, see the Teradata documentation.

Amazon Redshift doesn’t support NORMALIZE, but you can implement this functionality by using native SQL syntax and the LAG window function in Amazon Redshift. This pattern focuses on using the Teradata NORMALIZE extension with the ON MEETS OR OVERLAPS condition, which is the most popular format. It explains how this feature works in Teradata and how it can be converted into Amazon Redshift native SQL syntax.

Prerequisites and limitations

Prerequisites

Basic Teradata SQL knowledge and experience
Amazon Redshift knowledge and experience

Architecture

Source technology stack

Teradata data warehouse

Target technology stack

Amazon Redshift

Target architecture

For a high-level architecture for migrating a Teradata database to Amazon Redshift, see the pattern Migrate a Teradata database to Amazon Redshift using AWS SCT data extraction agents. The migration doesn't automatically convert the Teradata NORMALIZE phrase to Amazon Redshift SQL. You can convert this Teradata extension by following the guidelines in this pattern.

Tools

Code

To illustrate the concept and functionality of NORMALIZE, consider the following table definition in Teradata:


CREATE TABLE systest.project
     (    emp_id        INTEGER,
          project_name  VARCHAR(20),
          dept_id       INTEGER,
          duration      PERIOD(DATE)
     );

Run the following SQL code to insert sample data into the table:


BEGIN TRANSACTION;

INSERT INTO systest.project VALUES (10, 'First Phase', 1000,  PERIOD(DATE '2010-01-10', DATE '2010-03-20') );
INSERT INTO systest.project VALUES (10, 'First Phase', 2000,  PERIOD(DATE '2010-03-20', DATE '2010-07-15') );

INSERT INTO systest.project VALUES (10, 'Second Phase', 2000,  PERIOD(DATE '2010-06-15', DATE '2010-08-18') );
INSERT INTO systest.project VALUES (20, 'First Phase', 2000,  PERIOD(DATE '2010-03-10', DATE '2010-07-20') );

INSERT INTO systest.project VALUES (20, 'Second Phase', 1000,  PERIOD(DATE '2020-05-10', DATE '2020-09-20') );

END TRANSACTION;

Results:


select * from systest.project order by 1,2,3;
 
 *** Query completed. 4 rows found. 4 columns returned.
 *** Total elapsed time was 1 second.
 
     emp_id  project_name              dept_id  duration
-----------  --------------------  -----------  ------------------------
         10  First Phase                  1000  ('10/01/10', '10/03/20')        
         10  First Phase                  2000  ('10/03/20', '10/07/15')
         10  Second Phase                 2000  ('10/06/15', '10/08/18')
         20  First Phase                  2000  ('10/03/10', '10/07/20')
         20  Second Phase                 1000  ('20/05/10', '20/09/20')

Teradata NORMALIZE use case

Now add the Teradata NORMALIZE SQL clause to the SELECT statement:


SELECT NORMALIZE ON MEETS OR OVERLAPS emp_id, duration 
FROM systest.project 
ORDER BY 1,2;

This NORMALIZE operation is performed on a single column (emp_id). For emp_id=10, the three overlapping period values in duration coalesce into a single period value, as follows:


     emp_id  duration
-----------  ------------------------
         10  ('10/01/10', '10/08/18')
         20  ('10/03/10', '10/07/20')
         20  ('20/05/10', '20/09/20')

The following SELECT statement performs a NORMALIZE operation on project_name and dept_id. Note that the SELECT list contains only one PERIOD column, duration.


SELECT NORMALIZE project_name, dept_id, duration 
FROM systest.project;

Output:


project_name              dept_id  duration
--------------------  -----------  ------------------------
First Phase                  1000  ('10/01/10', '10/03/20')
Second Phase                 1000  ('20/05/10', '20/09/20')
First Phase                  2000  ('10/03/10', '10/07/20')
Second Phase                 2000  ('10/06/15', '10/08/18')

Amazon Redshift equivalent SQL

Amazon Redshift currently doesn’t support the PERIOD data type in a table. Instead, you need to divide a Teradata PERIOD data field into two parts: start_date, end_date, as follows:


CREATE TABLE systest.project
     (    emp_id        INTEGER,
          project_name  VARCHAR(20),
          dept_id       INTEGER,
          start_date  DATE,
          end_date    DATE
     );

Insert sample data into the table:


BEGIN TRANSACTION;
 
INSERT INTO systest.project VALUES (10, 'First Phase', 1000,  DATE '2010-01-10', DATE '2010-03-20' );
INSERT INTO systest.project VALUES (10, 'First Phase', 2000,  DATE '2010-03-20', DATE '2010-07-15');
 
INSERT INTO systest.project VALUES (10, 'Second Phase', 2000,  DATE '2010-06-15', DATE '2010-08-18' );
INSERT INTO systest.project VALUES (20, 'First Phase', 2000,  DATE '2010-03-10', DATE '2010-07-20' );
 
INSERT INTO systest.project VALUES (20, 'Second Phase', 1000,  DATE '2020-05-10', DATE '2020-09-20' );
 
END TRANSACTION;

Output:


 emp_id | project_name | dept_id | start_date |  end_date
--------+--------------+---------+------------+------------
     10 | First Phase  |    1000 | 2010-01-10 | 2010-03-20
     10 | First Phase  |    2000 | 2010-03-20 | 2010-07-15
     10 | Second Phase |    2000 | 2010-06-15 | 2010-08-18
     20 | First Phase  |    2000 | 2010-03-10 | 2010-07-20
     20 | Second Phase |    1000 | 2020-05-10 | 2020-09-20
(5 rows)

To rewrite Teradata’s NORMALIZE clause, you can use the LAG window function in Amazon Redshift. This function returns the values for a row at a given offset above (before) the current row in the partition.

You can use the LAG function to identify each row that begins a new period by determining if a period meets or overlaps with the previous period (0 if yes and 1 if no). When this flag is cumulatively summed up, it provides a group identifier that can be used in the outer Group By clause to arrive at the desired result in Amazon Redshift.

Here’s a sample Amazon Redshift SQL statement that uses LAG():


SELECT emp_id, start_date, end_date, 
            (CASE WHEN start_date <= LAG(end_date) OVER (PARTITION BY emp_id ORDER BY start_date, end_date) THEN 0 ELSE 1 END) AS GroupStartFlag
FROM systest.project 
ORDER BY 1,2;

Output:


 emp_id | start_date |  end_date  | groupstartflag
--------+------------+------------+----------------
     10 | 2010-01-10 | 2010-03-20 |              1
     10 | 2010-03-20 | 2010-07-15 |              0
     10 | 2010-06-15 | 2010-08-18 |              0
     20 | 2010-03-10 | 2010-07-20 |              1
     20 | 2020-05-10 | 2020-09-20 |              1
(5 rows)

The following Amazon Redshift SQL statement normalizes only on the emp_id column:


SELECT T2.emp_id, MIN(T2.start_date) as new_start_date, MAX(T2.end_date) as new_end_date
FROM 
( SELECT T1.*, SUM(GroupStartFlag) OVER (PARTITION BY emp_id ORDER BY start_date ROWS UNBOUNDED PRECEDING) As GroupID
FROM ( SELECT emp_id, start_date, end_date, 
            (CASE WHEN start_date <= LAG(end_date) OVER (PARTITION BY emp_id ORDER BY start_date, end_date) THEN 0 ELSE 1 END) AS GroupStartFlag
FROM systest.project ) T1
) T2
GROUP BY T2.emp_id, T2.GroupID
ORDER BY 1,2;

Output:


 emp_id | new_start_date | new_end_date
--------+----------------+------------------------------------
     10 | 2010-01-10     | 2010-08-18
     20 | 2010-03-10     | 2010-07-20
     20 | 2020-05-10     | 2020-09-20
(3 rows)

The following Amazon Redshift SQL statement normalizes on both the project_name and dept_id columns:


SELECT T2.project_name, T2.dept_id, MIN(T2.start_date) as new_start_date, MAX(T2.end_date) as new_end_date
FROM 
( SELECT T1.*, SUM(GroupStartFlag) OVER (PARTITION BY project_name, dept_id ORDER BY start_date ROWS UNBOUNDED PRECEDING) As GroupID
FROM ( SELECT project_name, dept_id, start_date, end_date, 
            (CASE WHEN start_date <= LAG(end_date) OVER (PARTITION BY project_name, dept_id ORDER BY start_date, end_date) THEN 0 ELSE 1 END) AS GroupStartFlag
FROM systest.project ) T1
) T2
GROUP BY T2.project_name, T2.dept_id, T2.GroupID
ORDER BY 1,2,3;

Output:


 project_name | dept_id | new_start_date | new_end_date
--------------+---------+----------------+--------------
 First Phase  |    1000 | 2010-01-10     | 2010-03-20
 First Phase  |    2000 | 2010-03-10     | 2010-07-20
 Second Phase |    1000 | 2020-05-10     | 2020-09-20
 Second Phase |    2000 | 2010-06-15     | 2010-08-18
(4 rows)

Epics

Task	Description	Skills required
Create your Teradata SQL code.	Use the NORMALIZE phrase according to your needs.	SQL developer
Convert the code to Amazon Redshift SQL.	To convert your code, follow the guidelines in the "Tools" section of this pattern.	SQL developer
Run the code in Amazon Redshift.	Create your table, load data into the table, and run your code in Amazon Redshift.	SQL developer

References

Teradata NORMALIZE temporal feature (Teradata documentation)
LAG window function (Amazon Redshift documentation)
Migrate to Amazon Redshift (AWS website)
Migrate a Teradata database to Amazon Redshift using AWS SCT data extraction agents (AWS Prescriptive Guidance)
Convert the Teradata RESET WHEN feature to Amazon Redshift SQL (AWS Prescriptive Guidance)

Tools

AWS Schema Conversion Tool (AWS SCT)

Partners

AWS Migration Competency Partners

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

Calculate value at risk (VaR) by using AWS services

Convert RESET WHEN to Amazon Redshift SQL