EDUCAÇÃO E TECNOLOGIA

Loading TPC-DS files into SAP HANA Cloud from a cloud storage (Amazon S3)

Being the cloud-native product SAP HANA Cloud does not give you an access to the underlying OS shell and file system. Therefore it is not possible to load a file from the database server as you can do with on-prem SAP HANA. But through its broad data integration capabilities SAP HANA Cloud offers import and export data using popular cloud storage services.

In this post I show you steps done to import TPC-DS data files into my trial SAP HANA Cloud database’s tables in a regular schema and share some of the gotchas. My database instance is hosted in eu10 region of SAP Business Technology Platform (BTP), where infrastructure is provided by Amazon Web Services. So, I use Amazon Simple Storage Service (S3) as the cloud storage in this exercise.

At the moment of writing this post Object Store service is only available in SAP BTP productive accounts, and I plan to return to that service another time. So, here I will use S3 service directly from AWS. At this moment S3 is available in AWS account’s free tier for 12 months after an account creation.

Before we move on just a few words about the dataset and why.

What is TPC-DS?

The TPC Benchmark DS (TPC-DS) is “a decision support benchmark that models several generally applicable aspects of a decision support system“. While I am not going to use it for any benchmarking (which would not even make sense in the trial environment), I liked the fact that it provides the snowflake-modeled schema of a retail business’s decision system. It is much closer to the real customer case, then most of the single-file datasets out there.

Store%20Sales%20Snow%20Flake%20Schema

Store Sales Snow Flake Schema (source: The Making of TPC-DS)

Instead of providing a data set, it provides two generators:

What’s more the data generator allows you to create datasets of the different size, starting from 1GB of raw data. So, it is a volume that is nice to work with. I plan to play with it after this exercise too.

The flow of the exercise…

…in general:

  1. Generate TPC-DS data files using dsdgen program.
  2. Move files to S3 bucket using aws client.
  3. Import data from files into SAP HANA Cloud database using hdbsql client.

…and required tools

To accomplish my goal I am using a command line more than UIs to make this text compact and to script repetitive task. This is more efficient when working with more files and bigger data, like in this exercise.

While all required command-line tools can be installed and used from any OS shell, here I am going to use the Linux-based AWS CloudShell service — mostly for the performance reasons of moving generated files within the same region of the same cloud infrastructure.

AWS CloudShell is a free service, but with 1GB of persistent storage — the limitation I needed to take into the consideration when generating files.

At the moment CloudShell is not available in AWS Frankfurt region eu-central-1, so I had to open it in eu-west-1.

The code repository

All code files are available in the SAP Tech Bytes repository: https://github.com/SAP-samples/sap-tech-bytes/tree/2021-02-22-saphanacloud-db-import-tpcds-datafiles-from-s3.

You can create files following this post, you can copy their code from links provided in the post, or you can clone the code.

git clone https://github.com/SAP-samples/sap-tech-bytes.git \ --branch 2021-02-22-saphanacloud-db-import-tpcds-datafiles-from-s3 \ ~/Projects/saphc-hdb-import-tpcds

So, let’s go.

Create a file with project’s environment variables

Create a project folder ~/Projects/saphc-hdb-import-tpcds:

mkdir -p ~/Projects/saphc-hdb-import-tpcds
cd ~/Projects/saphc-hdb-import-tpcds

To make it consistent between different commands and scripts, and to make customizable for your needs let’s create a file ~/Projects/saphc-hdb-import-tpcds/env_tpcds.sh to store and configure environment variables.

myTemp_folder=~/tmp
myProjects_folder=~/Projects
tpcds_kit_folder=${myProjects_folder}/tpcds-kit
tpcds_kit_tools_folder=${tpcds_kit_folder}/tools
tpcds_scale_factor=1 datasets_root_folder=~/Datasets
prefixObj=sf$(printf "%04d" ${tpcds_scale_factor})/data aws_region=eu-central-1
s3bucket=tpcds4hdbc aws_ro_s3user=S3TPC-reader
aws_key_file_folder=${myTemp_folder}
aws_key_file=${aws_ro_s3user}-access-key.json HDB_USE_IDENT=AWSCloudShell
saphc_hdb_endpoint=8e1a286a-21d7-404d-8d7a-8c77d2a77050.hana.trial-eu10.hanacloud.ondemand.com:443

Modify the file to set your own:

  1. SAP HANA Cloud database endpoint in saphc_hdb_endpoint variable, and
  2. globally unique S3 bucket name in tpcds4hdbc.

Gotchas:

  • Only a content of the user’s home directory persisted between your AWS CloudShell sessions, so take this in considerations when setting folders structure
  • AWS CloudShell virtual machines are changing internal IP addresses and therefore host names. To prevent issues with SAP HANA Clients, the HDB_USE_IDENT environment variable is set.
  • I have experienced the AWS CloudShell resetting the user session once in a while and it is causing the work folder and environment variables to be reset. You will find me adding source ~/Projects/saphc-hdb-import-tpcds/env_tpcds.sh at the beginning of every block of OS commands to make sure they are executed properly.

Generate TPC-DS data files and store them in S3

I used TPC-DS generator from https://github.com/gregrahn/tpcds-kit, because it can be compiled on MacOS too (but that’s the different story).

In AWS CloudeShell run the following commands.

sudo yum install --assumeyes gcc make flex bison byacc git source ~/Projects/saphc-hdb-import-tpcds/env_tpcds.sh
git clone https://github.com/gregrahn/tpcds-kit.git ${tpcds_kit_folder}
cd ${tpcds_kit_folder}/tools
make OS=LINUX ls -l ds*

The make command will produce a lot of warnings, but it was Ok to ignore them and to move on.

Now you should have TPC-DS data generator dsdgen and query generator dsqgen available locally.

If executed without any parameters ./dsdgen will generate the smallest sample (1GB of size) with all data files in the current directory. The --scale option allows generating bigger samples, but even with 1GB we face the limitation of the CloudShell persistent storage.

The approach I took was to generate a data file for each table listed in tpcds.sql and to move (not copy) it to the S3 bucket defined as s3bucket variable in the file env_tpcds.sh. Remember to choose your own globally unique name for your S3 bucket!

I want the bucket to be close to my SAP HANA Cloud db instance, so I used a flag --region ${aws_region}, where I set this variable to eu-central-1 in env_tpcds.sh.

cd ~/Projects/saphc-hdb-import-tpcds/
source ~/Projects/saphc-hdb-import-tpcds/env_tpcds.sh aws s3 mb s3://${s3bucket} --region ${aws_region}
aws s3 ls ${s3bucket} --summarize

To script the creation and movement of data files to S3 bucket I created an executable Bash script ~/Projects/saphc-hdb-import-tpcds/gen_tpcds_s3_init.sh with the following code…

#!/bin/bash
set -e
source env_tpcds.sh
log_file=${myTemp_folder}/${0}__$(date +%Y%m%e%H%M).log if [ ! -f ${tpcds_kit_tools_folder}/dsdgen ]; then echo "You should have dsdgen installed and the environment configured in env_tpcds.sh!" exit 1
fi datasets_tpcds_files_folder=${datasets_root_folder}/tpcds-data/${prefixObj}/init
echo ${datasets_tpcds_files_folder}
mkdir -p ${datasets_tpcds_files_folder} mkdir -p ${myTemp_folder}
touch ${log_file} tables=$(grep table ${tpcds_kit_tools_folder}/tpcds.sql | grep --invert-match returns | cut -d' ' -f3)
for t in ${tables}
do echo "Working on the table: "$t | tee -a ${log_file} ${tpcds_kit_tools_folder}/dsdgen -QUIET Y -FORCE \ -TABLE $t -SCALE ${tpcds_scale_factor} \ -DIR ${datasets_tpcds_files_folder} -DISTRIBUTIONS ${tpcds_kit_tools_folder}/tpcds.idx for f in ${datasets_tpcds_files_folder}/*.dat; do tail -3 $f >> ${log_file} echo -e "$(wc -l $f) \n" >> ${log_file} mv -- "$f" "${f%.dat}.csv" done aws s3 mv --recursive ${datasets_root_folder}/tpcds-data s3://${s3bucket}
done

…and execute it.

Let’s check generated files and their total size in the bucket.

source ~/Projects/saphc-hdb-import-tpcds/env_tpcds.sh
aws s3 ls s3://${s3bucket} --recursive --human-readable --summarize

There are 25 files generated — one for each target database table.

Create a read-only user for the S3 bucket

SAP HANA Cloud will use S3’s REST API to get the data from the bucket. It will require an AWS user with an access key and required permissions.

It is a good idea to use an AWS user with a minimal permissions to read objects from the bucket.

Let’s create a template file S3TPC-reader-inline-policy-template.json following the example from AWS documentation, but allowing only read access to the bucket…

{ "Version": "2012-10-17", "Statement": [ { "Sid": "ListObjectsInBucket", "Effect": "Allow", "Action": "s3:ListBucket", "Resource": "arn:aws:s3:::${s3bucket}" }, { "Sid": "AllowObjectRead", "Effect": "Allow", "Action": "s3:GetObject", "Resource": "arn:aws:s3:::${s3bucket}/*" } ]
}

…and a script create_ro_s3user.sh to create a AWS user (S3TPC-reader) and its programatic access keys for an authentication:

#!/bin/bash
source env_tpcds.sh mkdir -p ${myTemp_folder}
inline_policy_template=$(cat S3TPC-reader-inline-policy-template.json)
echo ${inline_policy_template//'${s3bucket}'/${s3bucket}} > ${myTemp_folder}/${aws_ro_s3user}-inline-policy.json aws iam create-user --user-name ${aws_ro_s3user}
aws iam put-user-policy --user-name ${aws_ro_s3user} \ --policy-name inline_S3ReadTPCDSbucket \ --policy-document file://${myTemp_folder}/${aws_ro_s3user}-inline-policy.json
aws iam create-access-key --user-name ${aws_ro_s3user} > ${myTemp_folder}/${aws_ro_s3user}-access-key.json
cat ${myTemp_folder}/${aws_ro_s3user}-access-key.json

A copy of an access key is written to a file in the temporary directory and will be accessed during the execution of SQL statements to provide user’s key+secret values.

Import the data from S3 bucket into SAP HANA Cloud tables

Following steps could be done from SAP HANA Database Explorer or any other client. I will use command-line hdbsql within AWS CloudShell.

For this activity we need to do following steps:

  1. Install SAP HANA Clients
  2. Create a database schema and tables to store the data
  3. Set host certificates required by SAP HANA
  4. Import data into tables

Install SAP HANA Clients

wget --no-cookies \
--header "Cookie: eula_3_1_agreed=tools.hana.ondemand.com/developer-license-3_1.txt" \ "https://tools.hana.ondemand.com/additional/hanaclient-latest-linux-x64.tar.gz" \
-P /tmp tar -xvf /tmp/hanaclient-latest-linux-x64.tar.gz -C /tmp/ source ~/Projects/saphc-hdb-import-tpcds/env_tpcds.sh /tmp/client/hdbinst --batch --hostname=${HDB_USE_IDENT}

SAP HANA Clients should be installed in the default directory ~/sap/hdbclient thanks to the flag --batch.

Add your database user to the SAP HANA Clients user store

I will use a flag -i to interactively provide the password of my DBAdmin database user.

source ~/Projects/saphc-hdb-import-tpcds/env_tpcds.sh ~/sap/hdbclient/hdbuserstore -i -H ${HDB_USE_IDENT} \ SET HANACloudTrial_DBAdmin ${saphc_hdb_endpoint} DBAdmin

Let’s check if we can connect to the database with this user key.

~/sap/hdbclient/hdbsql -U HANACloudTrial_DBAdmin \ "SELECT Current_User FROM DUMMY"

As mentioned above, the use of -H ${HDB_USE_IDENT} in hdbuserstore will prevent loosing database user keys between AWS CloudShell sessions, as the internal host name is changing with each session.

Create a schema and tables

TPC-DS Kit provides a file tpcds-kit/tools/tpcds.sql with an ANSI-compliant SQL statements to create all required tables. I’ll create a dedicated schema TPCDS for them.

Logon to the database using hdbsql and a created db user key.

~/sap/hdbclient/hdbsql -A -U HANACloudTrial_DBAdmin

…and execute following statements.

CREATE SCHEMA TPCDS;
SET SCHEMA TPCDS;
\input "/home/cloudshell-user/Projects/tpcds-kit/tools/tpcds.sql"
SELECT TABLE_NAME FROM TABLES WHERE SCHEMA_NAME='TPCDS';

Setup certificates for S3 remote source in SAP HANA Cloud

The setup of trust follows defined steps described in https://help.sap.com/viewer/c82f8d6a84c147f8b78bf6416dae7290/2020_04_QRC/en-US/342122ab55684a20a2e076e5bee4109c.html.

While still in an hdbsql session connected to the database, check if a certificate collection (also called PSE, or Personal Security Environment) with REMOTE SOURCE purpose already exists.

\al off
SELECT * FROM PSES WHERE PURPOSE='REMOTE SOURCE';

If not, then let’s create it with the name TLS_SOURCES. But if it already exists, then skip the step of PSE creation and replace PSE name TLS_SOURCES in the SQL statements below with the one from your database instance.

CREATE PSE TLS_SOURCES;
SET PSE TLS_SOURCES PURPOSE REMOTE SOURCE;

Gotchas:

Exit the hdbsql session to return to the OS shell.

\quit

Now we need to create an entry for a public certificate of a virtual-hosted–style URI of the S3 bucket, which is https://tpcds4hdbc.s3.eu-central-1.amazonaws.com in my case.

To do this programatically install openssl with yum in AWS CloudShell.

sudo yum install --assumeyes openssl

…and use it to retrieve the certificate for SSL access to HTTPS port 443.

source ~/Projects/saphc-hdb-import-tpcds/env_tpcds.sh openssl x509 -in <(openssl s_client -connect ${s3bucket}.s3.${aws_region}.amazonaws.com:443 -prexit 2>/dev/null)

~/sap/hdbclient/hdbsql -U HANACloudTrial_DBAdmin \ "CREATE CERTIFICATE FROM '$(openssl x509 -in <(openssl s_client -connect ${s3bucket}.s3.${aws_region}.amazonaws.com:443 -prexit 2>/dev/null))' COMMENT 'S3-${aws_region}';"
~/sap/hdbclient/hdbsql -A -U HANACloudTrial_DBAdmin \ "SELECT CERTIFICATE_ID, SUBJECT_COMMON_NAME, COMMENT FROM CERTIFICATES WHERE COMMENT='S3-${aws_region}';"

Now that certificate is created in SAP HANA Cloud database, let’s add it to the certificate collection TLS_SOURCES (or your existing certificate collection with the REMOTE SOURCE purpose).

source ~/Projects/saphc-hdb-import-tpcds/env_tpcds.sh
export s3CertID=$(~/sap/hdbclient/hdbsql -U HANACloudTrial_DBAdmin -xa "SELECT CERTIFICATE_ID FROM CERTIFICATES WHERE COMMENT='S3-${aws_region}';")
~/sap/hdbclient/hdbsql -U HANACloudTrial_DBAdmin "ALTER PSE TLS_SOURCES ADD CERTIFICATE ${s3CertID};"
~/sap/hdbclient/hdbsql -U HANACloudTrial_DBAdmin -A "SELECT PSE_NAME, CERTIFICATE_ID, SUBJECT_NAME FROM PSE_CERTIFICATES WHERE PSE_NAME='TLS_SOURCES';"

While I promised we are focusing on working with the command line, it is good to know that results the same can be achieved from SAP HANA Cloud Cockpit.

Gotchas:

  • It is important to note that the certificate is expiring on September 1st of this year (i.e. 2021). Certificates should be periodically checked and updated in real systems to keep applications running!

Import data from files in S3 bucket into SAP HANA Cloud tables

After all the setup and configuration are done it is time to import data into SAP HANA Cloud tables.

To script the import of 25 data files into corresponding database tables I created an executable Bash script ~/Projects/saphc-hdb-import-tpcds/imp_s3_hdbc_init.sh with the following code…

#!/bin/bash
set -e source env_tpcds.sh
log_file=${myTemp_folder}/${0}__$(date +%Y%m%e%H%M).log if [ ! -f ~/sap/hdbclient/hdbsql ]; then echo "You should have ~/sap/hdbclient/hdbsql installed!" exit 1
fi keyid=$(jq -r ".AccessKey.AccessKeyId" ${aws_key_file_folder}/${aws_key_file})
keysecret=$(jq -r ".AccessKey.SecretAccessKey" ${aws_key_file_folder}/${aws_key_file}) tables=$(grep table ${tpcds_kit_tools_folder}/tpcds.sql | cut -d' ' -f3)
for t in ${tables}
do echo -n "Inserting into $t" query="IMPORT FROM CSV FILE 's3-${aws_region}://${keyid}:${keysecret}@${s3bucket}/${prefixObj}/init/${t}.csv' INTO TPCDS.${t} WITH FIELD DELIMITED BY '|' THREADS 2 FAIL ON INVALID DATA;" # echo ${query//${keysecret}/'***'} ~/sap/hdbclient/hdbsql -U HANACloudTrial_DBAdmin -f "TRUNCATE TABLE TPCDS.${t}" >> ${log_file} ~/sap/hdbclient/hdbsql -U HANACloudTrial_DBAdmin -f "${query}" >> ${log_file} echo ": " $(~/sap/hdbclient/hdbsql -U HANACloudTrial_DBAdmin -xa "SELECT COUNT(1) FROM TPCDS.${t}") " records"
done

…and execute.

Bingo! The data is loaded and we can run some sample queries, like those generated by TPC-DS…

SET SCHEMA TPCDS;
SELECT TOP 10 i_brand_id AS brand_id, i_brand AS brand, i_manufact_id, i_manufact, d_year, sum(ss_ext_sales_price) AS ext_price
FROM date_dim, store_sales, item, customer, customer_address, store
WHERE d_date_sk = ss_sold_date_sk AND ss_item_sk = i_item_sk AND i_manager_id = 8 AND d_moy = 11 AND ss_customer_sk = c_customer_sk AND c_current_addr_sk = ca_address_sk AND substr( ca_zip, 1, 5 ) <> substr( s_zip, 1, 5 ) AND ss_store_sk = s_store_sk
GROUP BY i_brand, i_brand_id, i_manufact_id, i_manufact, d_year
ORDER BY ext_price DESC, d_year DESC, i_brand ASC, i_brand_id ASC, i_manufact_id ASC, i_manufact ASC;

…but that’s a kind of the exercise for another time!


Till then,
-Vitaliy aka @Sygyzmundovych