Transfer large-scale Db2 z/OS data to Amazon S3 in CSV files - AWS Prescriptive Guidance

Transfer large-scale Db2 z/OS data to Amazon S3 in CSV files

Created by Bruno Sahinoglu (AWS), Ivan Schuster (AWS), and Abhijit Kshirsagar (AWS)

Code repository: Unload DB2 z/OS to S3

Environment: Production

Source: Db2

Target: Amazon S3

R Type: Replatform

Workload: IBM

Technologies: Mainframe; Data lakes; Databases; Software development & testing; Migration

AWS services: Amazon Aurora; AWS Glue; Amazon S3; AWS Transfer Family; Amazon Athena

Summary

A mainframe is still a system of record in many enterprises, containing a massive amount of data including master data entities with records of current as well as the historical business transactions. It is often siloed and not easily accessed by the distributed systems within the same enterprise. With the emergence of cloud technology and big data democratization, enterprises are interested in using the insights hidden in the mainframe data to develop new business capabilities.

With that objective, enterprises are looking to open their mainframe Db2 data to their Amazon Web Services (AWS) Cloud environment. The business reasons are several and the transfer methods differ from case to case. You might prefer to connect your application directly to the mainframe, or you might prefer to replicate your data in near real time. If the use case is to feed a data warehouse or a data lake, having an up-to-date copy is no longer a concern, and the procedure described in this pattern might suffice, especially if you want to avoid any third-party product licensing costs. Another use case might be the mainframe data transfer for a migration project. In a migration scenario, data is required for performing the functional equivalence testing. The approach described in this post is a cost effective way to transfer the Db2 data to the AWS Cloud environment.

Because Amazon Simple Storage Service (Amazon S3) is one of the most integrated AWS services, you can access the data from there and gather insights directly by using other AWS services such as Amazon Athena, AWS Lambda functions, or Amazon QuickSight . You can also load the data to Amazon Aurora or Amazon DynamoDB by using AWS Glue or AWS Database Migration Service (AWS DMS). With that aim in mind, this describes how to unload Db2 data in CSV files in ASCII format on the mainframe and transfer the files to Amazon S3.

For this purpose, mainframe scripts have been developed to help to generate job control languages (JCLs) to unload and transfer as many Db2 tables as you need.

Prerequisites and limitations

Prerequisites

  • An IBM z/OS operating system user with authorization to run Restructured Extended Executor (REXX) and JCL scripts.

  • Access to z/OS Unix System Services (USS) to generate SSH (Secure Shell) private and public keys.

  • A writable S3 bucket. For more information, see Create your first S3 bucket in the Amazon S3 documentation.

  • An AWS Transfer Family SSH File Transfer Protocol (SFTP)-enabled server using Service managed as the identity provider and Amazon S3 as the AWS storage service. For more information, see Create an SFTP-enabled server in the AWS Transfer Family documentation.

Limitations

  • This approach isn’t suitable for near real-time or real-time data synchronization.

  • Data can be moved only from Db2 z/OS to Amazon S3, not the other way around.

Architecture

Source technology stack

  • Mainframe running Db2 on z/OS

Target technology stack

  • AWS Transfer Family

  • Amazon S3

  • Amazon Athena

  • Amazon QuickSight

  • AWS Glue

  • Amazon Relational Database Service (Amazon RDS)

  • Amazon Aurora

  • Amazon Redshift

Source and target architecture

The following diagram shows the process for generating, extracting, and transferring Db2 z/OS data in ASCII CSV format to an S3 bucket.

  1. A list of tables is selected for data migration from the Db2 catalog.

  2. The list is used to drive the generation of unload jobs with the numeric and data columns in the external format.

  3. The data is then transferred over to Amazon S3 by using AWS Transfer Family.

  4. An AWS Glue extract, transform, and load (ETL) job can transform the data and load it to a processed bucket in the specified format, or AWS Glue can feed the data directly into the database.

  5. Amazon Athena and Amazon QuickSight can be used to query and render the data to drive analytics.

The following diagram shows a logical flow of the entire process.

  1. The first JCL, called TABNAME, will use the Db2 utility DSNTIAUL to extract and generate the list of tables that you plan to unload from Db2. To choose your tables, you must manually adapt the SQL input to select and add filter criteria to include one or more Db2 schemas.

  2. The second JCL, called REXXEXEC, will use the a JCL skeleton and the REXX program that is provided to process the Table list created by the JCL TABNAME and generate one JCL per table name. Each JCL will contain one step for unloading the table and another step for sending the file to the S3 bucket by using the SFTP protocol.

  3. The last step consists of running the JCL to unload the table and transferring the file to AWS. The entire process can be automated using a scheduler on premises or on AWS.

Tools

AWS services

  • Amazon Athena is an interactive query service that helps you analyze data directly in Amazon Simple Storage Service (Amazon S3) by using standard SQL.

  • Amazon Aurora is a fully managed relational database engine that's built for the cloud and compatible with MySQL and PostgreSQL.

  • AWS Glue is a fully managed extract, transform, and load (ETL) service. It helps you reliably categorize, clean, enrich, and move data between data stores and data streams.

  • Amazon QuickSight is a cloud-scale business intelligence (BI) service that helps you visualize, analyze, and report your data in a single dashboard.

  • Amazon Redshift is a managed petabyte-scale data warehouse service in the AWS Cloud.

  • Amazon Relational Database Service (Amazon RDS) helps you set up, operate, and scale a relational database in the AWS Cloud.

  • Amazon Simple Storage Service (Amazon S3) is a cloud-based object storage service that helps you store, protect, and retrieve any amount of data.

  • AWS Transfer Family is a secure transfer service that enables you to transfer files into and out of AWS storage services.

Mainframe tools

  • SSH File Transfer Protocol (SFTP) is a secure file transfer protocol that allows remote login to and file transfer between servers. SSH provides security by encrypting all traffic.

  • DSNTIAUL is a sample program provided by IBM for unloading data.

  • DSNUTILB is a utilities batch program provided by IBM for unloading data with different options from DSNTIAUL.

  • z/OS OpenSSH is a port of Open Source Software SSH running on the Unix System Service under the IBM operating system z/OS. SSH is a secure, encrypted connection program between two computers running on a TCP/IP network. It provides multiple utilities, including ssh-keygen.

  • REXX (Restructured Extended Executor) script is used to automate JCL generation with the Db2 Unload and SFTP steps.

Code

The code for this pattern is available in the GitHub unloaddb2 repository.

Best practices

For the first unload, the generated JCLs should unload the entire table data.

After the first full unload, perform incremental unloads for better performance and cost savings. pdate the SQL query in the template JCL deck to accommodate any changes to the unload process.

You can convert the schema manually or by using a script on Lambda with the Db2 SYSPUNCH as an input. For an industrial process, AWS Schema Conversion Tool (SCT) is the preferred option.

Finally, use a mainframe-based scheduler or a scheduler on AWS with an agent on the mainframe to help manage and automate the entire process.

Epics

TaskDescriptionSkills required

Create the S3 bucket.

For instructions, see Create your first S3 bucket.

General AWS
TaskDescriptionSkills required

Create an SFTP-enabled server.

To open and create an SFTP server on the AWS Transfer Family console, do the following:

  1. On the Choose protocols, page, select the SFTP (SSH File Transfer Protocol) – file transfer over Secure Shell check box.

  2. For the identity provider, choose Service managed.

  3. For the endpoint, choose Publicly accessible.

  4. For the domain, choose Amazon S3.

  5. On the Configure additional details page, keep the default settings.

  6. Create the server.

General AWS

Create an IAM role for Transfer Family.

To create an AWS Identity and Access Management (IAM) role for Transfer Family to access Amazon S3, follow the instructions in Create an IAM role and policy.

AWS administrator

Add an Amazon S3 service-managed user.

To add the Amazon S3 service-managed user, follow the instructions in the AWS documentation, and use your mainframe user ID.

General AWS
TaskDescriptionSkills required

Create the SSH key.

Under your mainframe USS environment, run the following command.

ssh-keygen -t rsa

Note: When prompted for a passphrase, keep it empty.

Mainframe developer

Give the right authorization levels to the SSH folder and key files.

By default, the public and private keys will be stored in the user directory /u/home/username/.ssh.

You must give the authorization 644 to the key files and 700 to the folder.

chmod 644 .ssh/id_rsa chmod 700 .ssh
Mainframe developer

Copy the public key content to your Amazon S3 service-managed user.

To copy the USS-generated public key content, open the AWS Transfer Family console.

  1. In the navigation pane, choose Servers.

  2. Choose the identifier in the Server ID column to see the Server details

  3. Under Users, choose a user name to see the User details

  4. Under SSH public keys, choose Add SSH public key to add the public key to a user. For the SSH public key, enter your public key.Your key is validated by the service before you can add your new user.

  5. Choose Add key.

Mainframe developer
TaskDescriptionSkills required

Generate the in-scope Db2 table list.

Provide input SQL to create a list of the tables that are scoped for data migration. This step requires you to specify selection criteria quering the Db2 catalog table SYSIBM.SYSTABLES using a SQL where clause. Filters can be customized to include a specific schema or table names that start with a particular prefix or based on a timestamp for incremental unload. Output is captured in a physical sequential (PS) dataset on the mainframe. This dataset will act as input for the next phase of JCL generation.

Before you use the JCL TABNAME (You can rename it if necessary), make the following changes:

  1. Substitute <Jobcard> with a job class and a user that is authorized to run Db2 utilities.

  2. Substitute <HLQ1> or customize the output dataset names to meet your site standards.

  3. Update the STEPLIB stack of PDSEs (partitioned data set extended) according to your site standards. The example in this pattern uses the IBM defaults.

  4. Substitute the PLAN name and LIB with your installation-specific values.

  5. Substitute <Schema> and <Prefix> with your selection criteria for the Db2 catalog.

  6. Save the resultant JCL in a PDS (partitioned data set) library.

  7. Submit the JCL.

Db2 table list extraction job

<Jobcard> //* //* UNLOAD ALL THE TABLE NAMES FOR A PARTICULAR SCHEMA //* //STEP01 EXEC PGM=IEFBR14 //* //DD1 DD DISP=(MOD,DELETE,DELETE), // UNIT=SYSDA, // SPACE=(1000,(1,1)), // DSN=<HLQ1>.DSN81210.TABLIST //* //DD2 DD DISP=(MOD,DELETE,DELETE), // UNIT=SYSDA, // SPACE=(1000,(1,1)), // DSN=<HLQ1>.DSN81210.SYSPUNCH //* //UNLOAD EXEC PGM=IKJEFT01,DYNAMNBR=20 //SYSTSPRT DD SYSOUT=* //STEPLIB DD DISP=SHR,DSN=DSNC10.DBCG.SDSNEXIT // DD DISP=SHR,DSN=DSNC10.SDSNLOAD // DD DISP=SHR,DSN=CEE.SCEERUN // DD DISP=SHR,DSN=DSNC10.DBCG.RUNLIB.LOAD //SYSTSIN DD * DSN SYSTEM(DBCG) RUN PROGRAM(DSNTIAUL) PLAN(DSNTIB12) PARMS('SQL') - LIB('DSNC10.DBCG.RUNLIB.LOAD') END //SYSPRINT DD SYSOUT=* //* //SYSUDUMP DD SYSOUT=* //* //SYSREC00 DD DISP=(NEW,CATLG,DELETE), // UNIT=SYSDA,SPACE=(32760,(1000,500)), // DSN=<HLQ1>.DSN81210.TABLIST //* //SYSPUNCH DD DISP=(NEW,CATLG,DELETE), // UNIT=SYSDA,SPACE=(32760,(1000,500)), // VOL=SER=SCR03,RECFM=FB,LRECL=120,BLKSIZE=12 // DSN=<HLQ1>.DSN81210.SYSPUNCH //* //SYSIN DD * SELECT CHAR(CREATOR), CHAR(NAME) FROM SYSIBM.SYSTABLES WHERE OWNER = '<Schema>' AND NAME LIKE '<Prefix>%' AND TYPE = 'T'; /*
Mainframe developer

Modify the JCL templates.

The JCL templates that are provided with this pattern contain a generic job card and library names. However, most mainframe sites will have their own naming standards for dataset names, library names, and job cards. For example, a specific job class might be required to run Db2 jobs. The Job Entry Subsytem implementations JES2 and JES3 can impose additional changes. Standard load libraries might have a different first qualifier than SYS1, which is IBM default. Therefore, customize the templates to account for your site-specific standards before you run them.

Make the following changes in the skeleton JCL UNLDSKEL:

  1. Modify the job card with a job class and user that is authorized to run Db2 utilities.

  2. Customize output dataset names to meet your site standards.

  3. Update the STEPLIB stack of PDSEs according to your site standards. The example in this pattern uses the IBM defaults.

  4. Substitute <DSN> with your Db2 subsystem name and correlation ID.

  5. Save the resultant JCL in a PDS library that is part of your ISPSLIB stack, which is the standard skeleton template library for ISPF.

Unload and SFTP JCL skeleton

//&USRPFX.U JOB (DB2UNLOAD),'JOB',CLASS=A,MSGCLASS=A, // TIME=1440,NOTIFY=&USRPFX //* DELETE DATASETS //STEP01 EXEC PGM=IEFBR14 //DD01 DD DISP=(MOD,DELETE,DELETE), // UNIT=SYSDA, // SPACE=(TRK,(1,1)), // DSN=&USRPFX..DB2.PUNCH.&JOBNAME //DD02 DD DISP=(MOD,DELETE,DELETE), // UNIT=SYSDA, // SPACE=(TRK,(1,1)), // DSN=&USRPFX..DB2.UNLOAD.&JOBNAME //* //* RUNNING DB2 EXTRACTION BATCH JOB FOR AWS DEMO //* //UNLD01 EXEC PGM=DSNUTILB,REGION=0M, // PARM='<DSN>,UNLOAD' //STEPLIB DD DISP=SHR,DSN=DSNC10.DBCG.SDSNEXIT // DD DISP=SHR,DSN=DSNC10.SDSNLOAD //SYSPRINT DD SYSOUT=* //UTPRINT DD SYSOUT=* //SYSOUT DD SYSOUT=* //SYSPUN01 DD DISP=(NEW,CATLG,DELETE), // SPACE=(CYL,(1,1),RLSE), // DSN=&USRPFX..DB2.PUNCH.&JOBNAME //SYSREC01 DD DISP=(NEW,CATLG,DELETE), // SPACE=(CYL,(10,50),RLSE), // DSN=&USRPFX..DB2.UNLOAD.&JOBNAME //SYSPRINT DD SYSOUT=* //SYSIN DD * UNLOAD DELIMITED COLDEL ',' FROM TABLE &TABNAME UNLDDN SYSREC01 PUNCHDDN SYSPUN01 SHRLEVEL CHANGE ISOLATION UR; /* //* //* FTP TO AMAZON S3 BACKED FTP SERVER IF UNLOAD WAS SUCCESSFUL //* //SFTP EXEC PGM=BPXBATCH,COND=(4,LE),REGION=0M //STDPARM DD * SH cp "//'&USRPFX..DB2.UNLOAD.&JOBNAME'" &TABNAME..csv; echo "ascii " >> uplcmd; echo "PUT &TABNAME..csv " >>>> uplcmd; sftp -b uplcmd -i .ssh/id_rsa &FTPUSER.@&FTPSITE; rm &TABNAME..csv; //SYSPRINT DD SYSOUT=* //STDOUT DD SYSOUT=* //STDENV DD * //STDERR DD SYSOUT=*

 

Mainframe developer

Generate the Mass Unload JCL.

This step involves running a REXX script under an ISPF environment by using JCL. Provide the list of in-scope tables created on the first step as input for mass JCL generation against the TABLIST DD name. The JCL will generate one new JCL per table name in a user-specified partitioned dataset specified against the ISPFILE DD name. Allocate this library beforehand. Each new JCL will have two steps: one step to unload the Db2 table into a file, and one step to send the file to the S3 bucket.

Make the following changes in the JCL REXXEXEC (you can change the name):

  1. Substitute Job card user ID with a mainframe user ID that has unload authority on the tables. Substitute SYSPROC, ISPPLIB, ISPSLIB, ISPMLIB, and ISPTLIB <HLQ1> value or customize the DSN to meet your site standards. To find out your installation-specific values, you use the command TSO ISRDDN.

  2. Substitute <MFUSER> with a user ID that has job-running privileges in your installation.

  3. Substitute <FTPUSER> with a user ID that has the USS and FTP privilege in your installation.. It is assumed that this user ID and its SSH security keys are in place in the appropriate Unix Systems Services directory on the mainframe.

  4. Substitute <AWS TransferFamily IP> with the AWS Transfer Family IP address or the domain name. This address will be used for the SFTP step.

  5. Submit the JCL after applying the site standard accommodation and updating the REXX program as described below.

Mass JCL generation job

//RUNREXX JOB (CREATEJCL),'RUNS ISPF TABLIST',CLASS=A,MSGCLASS=A, // TIME=1440,NOTIFY=&SYSUID //* Most of the values required can be updated to your site specific //* values using the command 'TSO ISRDDN' in your ISPF session. //* Update all the lines tagged with //update marker to desired //* site specific values. //ISPF EXEC PGM=IKJEFT01,REGION=2048K,DYNAMNBR=25 //SYSPROC DD DISP=SHR,DSN=USER.Z23D.CLIST //SYSEXEC DD DISP=SHR,DSN=<HLQ1>.TEST.REXXLIB //ISPPLIB DD DISP=SHR,DSN=ISP.SISPPENU //ISPSLIB DD DISP=SHR,DSN=ISP.SISPSENU // DD DISP=SHR,DSN=<HLQ1>.TEST.ISPSLIB //ISPMLIB DD DSN=ISP.SISPMENU,DISP=SHR //ISPTLIB DD DDNAME=ISPTABL // DD DSN=ISP.SISPTENU,DISP=SHR //ISPTABL DD LIKE=ISP.SISPTENU,UNIT=VIO //ISPPROF DD LIKE=ISP.SISPTENU,UNIT=VIO //ISPLOG DD SYSOUT=*,RECFM=VA,LRECL=125 //SYSPRINT DD SYSOUT=* //SYSTSPRT DD SYSOUT=* //SYSUDUMP DD SYSOUT=* //SYSDBOUT DD SYSOUT=* //SYSTSPRT DD SYSOUT=* //SYSUDUMP DD SYSOUT=* //SYSDBOUT DD SYSOUT=* //SYSHELP DD DSN=SYS1.HELP,DISP=SHR //SYSOUT DD SYSOUT=* //* Input list of tablenames //TABLIST DD DISP=SHR,DSN=<HLQ1>.DSN81210.TABLIST //* Output pds //ISPFILE DD DISP=SHR,DSN=<HLQ1>.TEST.JOBGEN //SYSTSIN DD * ISPSTART CMD(ZSTEPS <MFUSER> <FTPUSER> <AWS TransferFamily IP>) /*

Before you use the REXX script, make the following changes:

  1. Save the REXX script in a PDS library defined under the SYSEXEC stack in the JCL REXXEXEC edited in the previous step with ZSTEPS as the member name. If you want to rename it, you should update the JCL to accommodate your needs.

  2. This script uses the trace option to print additional information in case there are errors. You can instead add error handling code after the EXECIO, ISPEXEC, and TSO statements, and remove the trace line.

  3. This script generates member names by using the LODnnnnn naming convention, which can support up to 100,000 members. If you have more than 100,000 tables, use a shorter prefix, and adjust the numbers in the tempjob statement.

ZSTEPS REXX script

/*REXX - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */ /* 10/27/2021 - added new parms to accommodate ftp */ Trace "o" parse arg usrpfx ftpuser ftpsite Say "Start" Say "Ftpuser: " ftpuser "Ftpsite:" ftpsite Say "Reading table name list" "EXECIO * DISKR TABLIST (STEM LINE. FINIS" DO I = 1 TO LINE.0 Say I suffix = I Say LINE.i Parse var LINE.i schema table rest tabname = schema !! "." !! table Say tabname tempjob= "LOD" !! RIGHT("0000" !! i, 5) jobname=tempjob Say tempjob ADDRESS ISPEXEC "FTOPEN " ADDRESS ISPEXEC "FTINCL UNLDSKEL" /* member will be saved in ISPDSN library allocated in JCL */ ADDRESS ISPEXEC "FTCLOSE NAME("tempjob")" END ADDRESS TSO "FREE F(TABLIST) " ADDRESS TSO "FREE F(ISPFILE) " exit 0
Mainframe developer
TaskDescriptionSkills required

Perform the Db2 Unload step.

After the JCL generation, you will have as many JCLs as you have tables that need to be unloaded.

This story uses a JCL-generated example to explain the structure and the most important steps.

No action is required on your part. The following information is for reference only. If your intention is to submit the JCLs that you have generated in the previous step, skip to the Submit the LODnnnnn JCLs task.

When unloading Db2 data using a JCL with the IBM provided DSNUTILB Db2 utility, you must make sure that the unloaded data does not contain compressed numeric data. To accomplish this, use the DSNUTILB DELIMITED parameter.

The DELIMITED parameter supports unloading the data in CSV format by adding a character as the delimiter and double quotation marks for the text field, removing the padding in the VARCHAR column, and converting all the numeric fields into EXTERNAL FORMAT, including the DATE fields.

The following example shows what the unload step in the generated JCL looks like, using the comma character as a delimiter.

UNLOAD DELIMITED COLDEL ',' FROM TABLE SCHEMA_NAME.TBNAME UNLDDN SYSREC01 PUNCHDDN SYSPUN01 SHRLEVEL CHANGE ISOLATION UR;
Mainframe developer, System engineer

Perform the SFTP step.

To use the SFTP protocol from a JCL, use the BPXBATCH utility. 

The SFTP utility can’t access the MVS datasets directly. You can use the copy command (cp) to copy the sequential file &USRPFX..DB2.UNLOAD.&JOBNAME to the USS directory, where it becomes &TABNAME..csv.

Run the sftp command using the private key (id_rsa) and using the RACF user ID as the user name to connect to the AWS Transfer Family IP address.

SH cp "//'&USRPFX..DB2.UNLOAD.&JOBNAME'" &TABNAME..csv; echo "ascii " >> uplcmd; echo "PUT &TABNAME..csv " >>>> uplcmd; sftp -b uplcmd -i .ssh/id_rsa &FTPUSER.@&FTP_TF_SITE; rm &TABNAME..csv;
Mainframe developer, System engineer

Submit the LODnnnnn JCLs.

The prior JCL has generated all LODnnnnn JCL  tables that need to be unloaded, transformed into CSV, and transferred to the S3 bucket.

Run the submit command on all the JCLs that have been generated.

Mainframe developer, System engineer

Related resources

For more information about the different tools and solutions used in this document, see the following:

Additional information

After you have your Db2 data on Amazon S3, you have many ways to develop new insights. Because Amazon S3 integrates with AWS data analytics services, you can freely consume or expose this data on the distributed side. For example, you can do the following:

  • Build a data lake on Amazon S3, and extract valuable insights by using query-in-place, analytics, and machine learning tools without moving the data.

  • Initiate a Lambda function by setting up a post-upload processing workflow that is integrated with AWS Transfer Family.

  • Develop new microservices for accessing the data in Amazon S3 or in fully managed database by using AWS Glue, which is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development.

In a migration use case, because you can transfer any data from the mainframe to S3, you can do the following:

  • Retire physical infrastructure, and create a cost-effective data archival strategy with Amazon S3 Glacier and S3 Glacier Deep Archive. 

  • Build scalable, durable, and secure backup and restore solutions with Amazon S3 and other AWS services, such as S3 Glacier and Amazon Elastic File System (Amazon EFS), to augment or replace existing on-premises capabilities.