Move data FAST from an SAP HANA Cloud database to a HANA Data Lake

If you’ve ever been stuck wondering how you can move data from your SAP HANA Cloud database to your SAP HANA Data Lake with minimal effort, this is for you. In fact, it might be the fastest way to move your data depending on your data lake’s configuration. Plus, it’s simple enough that a single Python function can do all the work for you!

This blog will outline how you can leverage remote servers in the data lake to make a connection to your HANA database and pull the data from HANA to data lake. I also experiment with different data lake configurations to see what parameters affect the speed of this data transfer the most, that way you know how to scale your data lake instance to achieve the best performance.

If you want to compare what I find in this blog to loading data from files, Jason Hinsperger has written an excellent blog here.

The Method

The method consists of 5 steps.

  1. Create a remote server from HANA data lake to HANA database and a user with an external login to use it. This is most easily done through a SQL Console that is directly connected to your data lake.
    CREATE SERVER MY_HDB CLASS 'HANAODBC' USING 'Driver=libodbcHDB.so;
    ConnectTimeout=0;
    CommunicationTimeout=15000000;
    RECONNECT=0;
    ServerNode=<HANA_DB_HOST>:<PORT>;
    ENCRYPT=TRUE;
    ssltruststore==<HANA_DB_HOST>;
    ssltrustcert=Yes;'; CREATE USER <USERNAME> IDENTIFIED BY <PASSWORD> FORCE PASSWORD CHANGE OFF; CREATE EXTERNLOGIN <USERNAME> TO MY_HDB REMOTE LOGIN <REMOTE_USERNAME> IDENTIFIED BY <REMOTE_PASSWORD>;
    ​

    The rest of the steps will require creating a few objects related to the tables we want to move. For my example, I will be using a LINEITEM table. However, this can be done with any table of your choosing, just keep in mind that data type conversions between HANA and data lake IQ are not 1-to-1.

  2. Step two is to create an EXISTING LOCAL TEMPORARY table, which can be called a virtual table, that points to the table in HANA that you want to move. For my LINEITEM table it looks like this.
    CREATE EXISTING LOCAL TEMPORARY TABLE <TEMP_TABLE>(
    L_ORDERKEY INTEGER,
    L_PARTKEY INTEGER,
    L_SUPPKEY INTEGER,
    L_LINENUMBER INTEGER,
    L_QUANTITY DECIMAL(15,2),
    L_EXTENDEDPRICE DECIMAL(15,2),
    L_DISCOUNT DECIMAL(15,2),
    L_TAX DECIMAL(15,2),
    L_RETURNFLAG VARCHAR(2),
    L_LINESTATUS VARCHAR(2),
    L_SHIPDATE DATE,
    L_COMMITDATE DATE,
    L_RECEIPTDATE DATE,
    L_SHIPINSTRUCT VARCHAR(25),
    L_SHIPMODE VARCHAR(10),
    L_COMMENT VARCHAR(44)) AT ''MY_HDB..<SCHEMA>.<TABLE>'';
    ​
  3. Then we can create the table in IQ that we want to persist.
    CREATE TABLE <DESTINATION_TABLE>(
    L_ORDERKEY INTEGER,
    L_PARTKEY INTEGER,
    L_SUPPKEY INTEGER,
    L_LINENUMBER INTEGER,
    L_QUANTITY DECIMAL(15,2),
    L_EXTENDEDPRICE DECIMAL(15,2),
    L_DISCOUNT DECIMAL(15,2),
    L_TAX DECIMAL(15,2),
    L_RETURNFLAG VARCHAR(2),
    L_LINESTATUS VARCHAR(2),
    L_SHIPDATE DATE,
    L_COMMITDATE DATE,
    L_RECEIPTDATE DATE,
    L_SHIPINSTRUCT VARCHAR(25),
    L_SHIPMODE VARCHAR(10),
    L_COMMENT VARCHAR(44));
    ​
  4. Then we run an INSERT…SELECT command to pull the data from HANA into data lake IQ
    INSERT INTO <DESTINATION_TABLE> SELECT * FROM <TEMPORARY_TABLE>;​
  5. Then some final clean up, I.e. dropping the temp table that we created in step 2. This step is optional, since local temporary tables will be dropped once the connection is dropped. But for completeness.
    DROP TABLE <TEMP_TABLE>;​

And boom! We’ve now moved our data from HANA to data lake IQ. Turns out that this is pretty fast too, depending on your data lake configuration. Note: If you want the SAP HANA container user which owns the container for the managed data lake to have access to this new table, then you will need to grant SYSRDL#CG select permissions on this table.

The Experiment

I wanted to know which data lake parameters would have the greatest impact on the performance of this data move, so I conducted an imperfect experiment with a variation of disk size, coordinator nodes, and worker nodes. Here are my results.

I used a HANA instance with 60GB memory, 200GB storage, and 4vCPUs for this experiment. I started with a 16TB, 16vCPU worker node, and 16vCPU coordinator node managed data lake for a benchmark. I wanted to move my LINEITEM table with approximately 60 million rows (2GB) from HANA to data lake. I took 3 samples for each of the configurations I tested and used the average time of completion to make comparisons. There was some variability for each configuration, but I spared the effort of testing for statistical significance in this blog.

Looking at the results, it appears that altering the number of worker nodes had no effect on the time it took to move the data. Decreasing the disk size had the greatest performance impact and reducing the coordinator nodes had some impact. From these results I predicted that an increase in disk size, from 16TB to 32TB should increase performance. Unfortunately, the average speed was the same as a 16TB system.

So, it appears simply scaling disk size has an upper limit on how much faster it can load data. Now, what if I kept 32TB and had 32vCPU coordinators? Well, I got no performance increase compared to a 16TB and 16vCPU system. Interestingly, it appears at a certain point the performance gain of increasing your data lake’s size and number of vCPUs diminishes.

Then, a realization occurred (prompted by a fellow coworker). Worker nodes are responsible for the “heavy lifting” when it comes to processing queries in the data lake. I wanted to test if worker nodes really did influence performance, so I drastically reduced the worker nodes to 2vCPUs. In this case, the data move operation took much longer!

The Script

So, now we have some idea on what data lake configurations will influence the data movement performance, but I don’t like needing to type out all that SQL to move my data. Thankfully, with the HANA System View SYS.TABLE_COLUMNS we can find any table’s schema and write a script to execute all that SQL for us!

As mentioned earlier the data type conversions from HANA to data lake IQ are not 1-to-1. This means that our script will require some logic to figure out what the data types should be used on the data lake side when moving data. Luckily, my LINEITEM table has simple and mostly supported data types, and I was able to piece a script together in Python.

Note: my script is not complete for all use-cases, but it’s a start to something a hard-working individual can make into a useful tool. So how does it work?

First, we define a function that takes in a schema name, table name, and remote server name.

def move_to_hdl(schema_name, table_name, remote_server='MY_HDB'):

Then, we use the hdbcli Python module to make a connection to our HANA database.

conn = dbapi.connect( address=<HANA_SQL_EP>, port=443, user=<USERNAME>, password=<PASSWORD>
)
cursor = conn.cursor()

We will also need a pyodbc connection to our data lake instance. This will require that the data lake client is installed on your local system. Check out this blog if you’re not sure how to install the client.

host = '<DATALAKE_SQL_EP>'
CONN_STR = f'DRIVER=Sybase IQ;UID=<USERNAME>;PWD=<PASSWORD>;host={host};ENC=TLS(tls_type=rsa;direct=yes)'
cnxn = pyodbc.connect(CONN_STR)
hdl_cursor = cnxn.cursor()

With the HANA cursor, we can now query the SYS.TABLE_COLUMNS system view to get the schema of our desired table and parse the results to write a CREATE TABLE statement. Note: This is where my script is imperfect, it only considers the data type conversions for the data types in my use-case.

cursor.execute(f""" SELECT COLUMN_NAME, POSITION, DATA_TYPE_NAME, LENGTH, SCALE FROM SYS.TABLE_COLUMNS WHERE SCHEMA_NAME = '{schema_name}' AND TABLE_NAME = '{table_name}' ORDER BY POSITION ASC; """); create_table_str = '('
for row in cursor: if 'CHAR' in row[2]: create_table_str += row[0] + ' VARCHAR' + f'({row[3]})' ',' elif 'DECIMAL' == row[2]: create_table_str += row[0] + ' ' + row[2] + f'({row[3]},{row[4]})' ',' else: create_table_str += row[0] + ' ' + row[2] + ',' create_table_str = create_table_str[:-1] + ')'

Lastly, we just create a query that consolidates steps 2-5 in The Method section and execute the query and close our cursor connections.

create_table_str = create_table_str[:-1] + ')'
temp_table = f'drop_me_{table_name}'
dest_table = f'hdl_{table_name}' create_temp_table = f"CREATE EXISTING TABLE drop_me_LINEITEM_TEST {create_table_str} AT '{remote_server}..{schema_name}.{table_name}';"
create_dest_table = f'CREATE TABLE {dest_table} {create_table_str};'
move_table = f'INSERT INTO {dest_table} SELECT * FROM {temp_table};'
drop_temp_table = f'DROP TABLE {temp_table};' hdl_cursor.execute(create_temp_table + create_dest_table + move_table + drop_temp_table + 'commit;') hdl_cursor.close()
cursor.close()

All together the function will look like this:

def move_to_hdl(schema_name, table_name, remote_server='MY_HDB'): conn = dbapi.connect( address=<HANA_SQL_EP>, port=443, user=<USERNAME>, password=<PASSWORD> ) cursor = conn.cursor() host = '<DATALAKE_SQL_EP>' CONN_STR = f'DRIVER=Sybase IQ;UID=<USERNAME>;PWD=<PASSWORD>;host={host};ENC=TLS(tls_type=rsa;direct=yes)' cnxn = pyodbc.connect(CONN_STR) hdl_cursor = cnxn.cursor() cursor.execute(f""" SELECT COLUMN_NAME, POSITION, DATA_TYPE_NAME, LENGTH, SCALE FROM SYS.TABLE_COLUMNS WHERE SCHEMA_NAME = '{schema_name}' AND TABLE_NAME = '{table_name}' ORDER BY POSITION ASC; """); create_table_str = '(' for row in cursor: if 'CHAR' in row[2]: create_table_str += row[0] + ' VARCHAR' + f'({row[3]})' ',' elif 'DECIMAL' == row[2]: create_table_str += row[0] + ' ' + row[2] + f'({row[3]},{row[4]})' ',' else: create_table_str += row[0] + ' ' + row[2] + ',' create_table_str = create_table_str[:-1] + ')' create_table_str = create_table_str[:-1] + ')' temp_table = f'drop_me_{table_name}' dest_table = f'hdl_{table_name}' create_temp_table = f"CREATE EXISTING TABLE drop_me_LINEITEM_TEST {create_table_str} AT '{remote_server}..{schema_name}.{table_name}';" create_dest_table = f'CREATE TABLE {dest_table} {create_table_str};' move_table = f'INSERT INTO {dest_table} SELECT * FROM {temp_table};' drop_temp_table = f'DROP TABLE {temp_table};' hdl_cursor.execute(create_temp_table + create_dest_table + move_table + drop_temp_table + 'commit;') hdl_cursor.close() cursor.close()

And voila! Type in your schema name, table name, and remote server then let the function take care of the rest. I was able to move my LINEITEM and ORDERS table with this function. How great is that?

Summary

We can move data efficiently by pulling it from HANA and into our data lake. Even better the syntax is simple enough that it can be scripted and generalized for any table. We learned that scaling the storage size of our data lake had the biggest impact on performance, but there was an upper bound on how fast the data can be moved. Similarly, the coordinator nodes had some minor impact on performance, but there is still that performance upper bound. We also noticed that worker nodes did not appear to have any effect on the data movement performance unless they were drastically scaled down to 2vCPUs.

I hope this helped provide you with a relatively simple and quick way to move data between your HANA database and data lake. Let me know what you think in the comments down below and if this helped you give it a like! Questions can be asked in the comments section or on the community questions page.