Databricks triggering calculations / ML in SAP Data Warehouse Cloud

If Databricks is part of your landscape, you might enjoy the option to use your familiar Databricks front ends to trigger data processing in your SAP Data Warehouse Cloud / SAP HANA Cloud systems.

Connect from Databricks to the data in SAP Data Warehouse Cloud / SAP HANA Cloud and carry out data explorations, preparations, calculations, and Machine Learning. The data remains in place, thereby avoiding unnecessary data movement and data duplication. Save your outcomes as semantic views or physical tables. And if needed, you can still extract the data into Databricks for further processing. However, you might not need to move all granular records, just the aggregated / prepared / filtered / predicted data that you require. Databricks can then further process the data and write the outcomes back to SAP Data Warehouse Cloud / SAP HANA Cloud.

If you would like to stay instead within the SAP Business Technology Platform, you have the familiar choices to deploy your Python / R code with SAP Data IntelligenceCloudFoundry and Kyma. Either way, SAP’s Python package hana_ml and R package hana.ml.r make it easy to trigger such an advanced analysis from any Python or R environment on the data held in SAP Data Warehouse Cloud / SAP HANA Cloud.

In case part of this use and blog is somewhat familiar to you, then you have probably already seen a related article, which explains a similar scenario with Azure Machine Learning as front end.

Let’s go through two examples.

  • First we will use Python code in a notebook within Databricks to explore the data in SAP Data Warehouse Cloud, aggregate it and create a time-series forecast on the aggregates. All triggered from Databricks, executed in SAP Data Warehouse Cloud.
  • The second example uses R code in a notebook within Databricks to calculate a correlation matrix in SAP Data Warehouse Cloud.

In this scenario we are assuming that you are working with SAP Data Warehouse Cloud. However, an implementation with SAP HANA Cloud would be very similar:

  • You need access to a SAP Data Warehouse Cloud system with three virtual CPUs (free trial is not sufficient)
  • The script server must be activated, see the documentation
  • Create a Database User in your SAP Data Warehouse Cloud space, with the options “Enable Automated Predictive Library and Predictive Analysis Library”, “Enable Read Access” and “Enable Write Access” selected, see the documentation
  • To trigger the Machine Learning in SAP Data Warehouse Cloud, you need that user’s password and the system’s host name, see this tutorial

The Databricks environment used in this blog is running on Azure, but the implementation should be very similar, if not identical, with Databricks running on a different platform. The system was created according to this getting started guidance. A few key notable settings:

  • Azure Databricks workspace created with pricing tier “Trial”.
  • Cluster mode: Standard
  • Databricks Runtime Version: 10.4 LTS ML (includes Apache Spark 3.2.1, Scala 2.12) – without GPU

The Python example also works well with the Databricks Community Edition. To implement the R example the Community Edition doesn’t seem to be sufficient though (any attempt to run a Global init script resulted there in “Global init script failure“).

With SAP Data Warehouse Cloud and Databricks up and running, install SAP’s hana_ml library on your Databricks cluster. This library allows Databricks to trigger calculations in SAP Data Warehouse Cloud from Python.

Open the configuration of Databrick’s Compute Cluster (see screenshot above) and open the “Library” tab. Install these two libraries with the Library Source set to “PyPI”.

  • hana-ml==2.13.22072200
  • shapely

shapely is a dependency of the hana_ml library, that we don’t actually use. The code in this blog would still work, but you would receive a warning that shapely is missing. So shapely is only installed here to avoid that warning.

Then create a Notebook with default language Python.

Verify that the notebook can use the hana_ml library that was installed earlier. You should see the version of hana_ml displayed.

import hana_ml
print(hana_ml.__version__)

With the hana_ml library installed, you can establish a connection to the SAP Data Warehouse Cloud. Just enter your database user’s details as a quick test (see prerequisites above). This code should result with the output “True”, to confirm that the connection was created. Should this fail, you might still need to add the IP address of your cluster to the “IP Allowlist” in SAP Data Warehouse Cloud. To keep the logon credentials secure, you can leverage Databricks’s Secret Management.

import hana_ml.dataframe as dataframe
conn = dataframe.ConnectionContext(address='[YOURDBUSERSHOSTNAME]', port=443, user='[YOURDBUSER]', password='[YOURDBUSERSPASSWORD]')
conn.connection.isconnected()

In your own projects you might be working with data that is already in SAP Data Warehouse Cloud. To follow this example, load some data into the system for us to work with. Load the Power consumption of Tetouan city Data Set from the UCI Machine Learning Repository into a pandas DataFrame. (Salam, A., & El Hibaoui, A. (2018, December). Comparison of Machine Learning Algorithms for the Power Consumption Prediction:-Case Study of Tetouan city“. In 2018 6th International Renewable and Sustainable Energy Conference (IRSEC) (pp. 1-5). IEEE)

The data shows the power consumption of the Moroccan city in 10-minute intervals for the year 2017. We will use the data to create a time-series forecast.

import pandas as pd df_data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/00616/Tetuan%20City%20power%20consumption.csv') df_data.columns = map(str.upper, df_data.columns)
df_data.columns = df_data.columns.str.replace(' ', '_')
df_data.DATETIME = pd.to_datetime(df_data.DATETIME)

Use the hana_ml library and the connection that you have already established, to load this data into SAP Data Warehouse Cloud. The table and the appropriate columns will be created automatically.

import hana_ml.dataframe as dataframe
df_remote = dataframe.create_dataframe_from_pandas(connection_context=conn, pandas_df=df_data, table_name='POWERCONSUMPTION', force=True, replace=False)

Create a hana_ml DataFrame, which points to the table in SAP Data Warehouse Cloud. This object points to the data that remains in SAP Data Warehouse Cloud. We will be using this object extensively. Its collect method returns the data as pandas DataFrame. To have a peek at the data, just select 5 rows.

df_remote = conn.table('POWERCONSUMPTION')
df_remote.head(5).collect()

You can check on the column types.

df_remote.dtypes()

Have SAP Data Warehouse Cloud calculate some statistics about the data content.

df_remote.describe().collect()

The above distribution statistics were all calculated in SAP Data Warehouse Cloud. Only the results were transferred to your notebook. You can see the SELECT statement that was created by the describe()-method.

print(df_remote.describe().select_statement)

You can also request further calculations on the data, or use ready-made exploratory plots. Have SAP Data Warehouse Cloud calculate a correlation matrix on the granular data. As you would expect, there is for instance a fair positive correlation between temperature and power consumption.

import matplotlib.pyplot as plt
from hana_ml.visualizers.eda import EDAVisualizer
f = plt.figure()
ax1 = f.add_subplot(111) # 111 refers to 1x1 grid, 1st subplot
eda = EDAVisualizer(ax1)
ax, corr_data = eda.correlation_plot(data=df_remote, cmap='coolwarm')

Now plot part of the data’s time series. Download the records of the last 7 days and plot them locally. The calculation 6 * 24 * 7 specifies how many most recent rows we are interested in. The data is in 10-minute intervals, so 6 gets a full hour. Multiplied by 24 to get a day, multiplied by 7 to get one week. The power consumption clearly has a strong daily pattern.

df_data = df_remote.tail(n=6*24*7, ref_col='DATETIME').collect()
df_data.drop('DATETIME', axis=1).plot(subplots=True, figsize=(10, 14));

For further analysis add a DATE and MONTH column. These columns are added logically to the hana_ml DataFrame. The original table remains unchanged. Not data gets duplicated.

df_remote = conn.table('POWERCONSUMPTION')
df_remote = df_remote.select('*', ('TO_DATE(DATETIME)', 'DATE'))
df_remote = df_remote.select('*', ('LEFT(DATE, 7)', 'MONTH'))
df_remote.head(5).collect()

With the MONTH column, we can now create a boxplot that shows how consumption changes through the year. August has the highest median consumption.

import matplotlib.pyplot as plt
from hana_ml.visualizers.eda import EDAVisualizer
f = plt.figure()
ax1 = f.add_subplot(111) # 111 refers to 1x1 grid, 1st subplot
eda = EDAVisualizer(ax1)
ax, cont = eda.box_plot(data = df_remote, column='ZONE_1_POWER_CONSUMPTION', groupby='MONTH', outliers=True)
ax.legend(bbox_to_anchor=(1, 1));

Let’s get ready for a daily forecast. Have SAP Data Warehouse Cloud aggregate the 10-minute intervals to daily aggregates and plot the consumption of the last 3 weeks. By the way, such transformation can also be saved as tables or views in SAP Data Warehouse Cloud, where other processes and programs could continue working with the aggregated data.

df_remote_daily = df_remote.agg([('sum', 'ZONE_1_POWER_CONSUMPTION', 'ZONE_1_POWER_CONSUMPTION_SUM')], group_by='DATE')
df_data = df_remote_daily.tail(n=21, ref_col='DATE').collect() import seaborn as sns
sns.set_theme()
sns.set(rc={'figure.figsize':(15,5)})
sns.lineplot(data=df_data, x="DATE", y="ZONE_1_POWER_CONSUMPTION_SUM")
plt.xticks(rotation=45);

SAP Data Warehouse Cloud has an Automated ML framework built-in as well as 100+ individual algorithms. The hana_ml library’s documentation has all the details. For our time-series forecast we choose to use AdditiveModelForecast, which is an implementation of the Prophet algorithm, and train a model.

from hana_ml.algorithms.pal.tsa.additive_model_forecast import AdditiveModelForecast
amf = AdditiveModelForecast(growth='linear')
amf.fit(data=df_remote_daily.select('DATE', 'ZONE_1_POWER_CONSUMPTION_SUM'))

To create predictions, you just need to specify for which dates predictions should be produced. Use the historic data (on which the model was trained) to dynamically obtain a pandas DataFrame, which lists the future 21 dates, for which we want to create a prediction.

import pandas as pd
strLastDate = str( df_remote_daily.select('DATE').max())
df_topredict = pd.date_range(strLastDate, periods=1 + 21, freq="D", closed='right').to_frame()
df_topredict.columns = ['DATE']
df_topredict['DUMMY'] = 0
df_topredict

Bring this data through the hana_ml library as temporary table to SAP Data Warehouse Cloud.

import hana_ml.dataframe as dataframe
df_rem_topredict = dataframe.create_dataframe_from_pandas(connection_context=conn, pandas_df=df_topredict, table_name='#TOPREDICTDAILY', force=True, replace=False)

Use the temporary table as input for the prediction.

df_rem_predicted = amf.predict(data=df_rem_topredict)

With the prediction now available as hana_ml DataFrame, you are free to use the predictions as required by your use case. Download the predictions for instance into a  pandas DataFrame and have them plotted. The weekly pattern is nicely represented.

df_data = df_rem_predicted.collect() from matplotlib import pyplot as plt
plt.plot(df_data['DATE'], df_data['YHAT'])
plt.fill_between(df_data['DATE'],df_data['YHAT_LOWER'], df_data['YHAT_UPPER'], alpha=.3);

Or look at the predictions, still from Databricks. Here with the predicted columns renamed before the display.

df_rem_predicted = df_rem_predicted.rename_columns(['DATE', 'PREDICTION', 'PREDICTION_LOWER', 'PREDICTION_UPPER'])
df_rem_predicted.collect()

Or save the predictions as table to SAP Data Warehouse Cloud, where SAP Analytics Cloud, or another tool of your choice could pick up the predictions to share them with a broader user base.

df_rem_predicted.save('POWERCONSUMPTION_21DAYPRED')

You have used Databricks to explore data in SAP Data Warehouse Cloud, prepare the data, create a forecast and share the predictions to SAP Analytics Cloud. No data got duplicated along the way, thereby reducing architectural and governmental complexity. This logic can be further automated of course so that Databricks regularly produces and shares the latest predictions.

You can also tigger calculations in SAP Data Warehouse Cloud from R, by using the hana.ml.r package. The following steps make this package available in Databricks.

Configuration step 1 of 3: Download SAP HANA’s JDBC driver from SAP Development Tools (currently the file is called ngdbc-2.13.9.jar). Then upload that file through the same “Install library” option, already used in the above Python example. Just select the options “Upload” and “Jar”, then select the file ngdbc-2.13.9.jar and proceed with the upload.

Once uploaded, take note of the displayed path of the JDBC driver. You will need this information later on.

Remember that the Python libraries hana_ml and shapely are not required for R, it’s okay in case you haven’t installed them. But the implementation in this R example assumes that you went through the earlier Python scenario and it uses the data that was uploaded from Python to SAP Data Warehouse Cloud.

Configuration step 2 of 3: Upload the hana.ml.r library into Databricks. You can obtain this file by downloading the SAP HANA Client from SAP Development Tools (currently the file hanaclient-2.13.21-linux-x64.tar.gz  for Windows). It takes a couple of extraction steps to get the file we need. Extracting the download gives a client.tar file. Extract this file, and in the “client” folder you find HANAMLR.TGZ. Extracting this gives HANAMLR.tar. Extracting this finally gives the hana.ml.r library we need to install (currently hana.ml.r_2.13.22051100.tar.gz).

This file needs to be uploaded to Databricks, which is easy with the DBFS File Browser. You might need to activate this browser on Databricks (“Settings” -> “Admin Console” -> “Workspace settings” -> Here activate “DBFS File Browser” and refresh the page).

With the DBFS File Browser activated, you can click in the Databricks menu on the left on the “Data” entry which now shows the DBFS option to upload file into the Databricks File System. Use this option to upload the file hana.ml.r_2.13.22051100.tar.gz.

After uploading the file, right-click on it in the same interface and select “Copy path”. Copy the path in “File API Format”, ie here “/dbfs/FileStore/hana_ml_r_2_13_22051100_tar.gz”.

Configuration step 3 of 3: All the necessary files for the connection from R to SAP Data Warehouse Cloud are now on Databricks. Now use a global init script to install additional R libraries, configure the Java environment and install the hana.ml.r library on all the nodes of the cluster. Go into “Settings” -> “Admin Console” -> “Global init scripts” and create and activate a new script with the following content. You should only have to modify the last line in case you downloaded a different version of hana.ml.r. This script uses the steps mentioned in Databricks’ kbase “Install rJava and RJDBC” to install rJava, which is a dependenciy of hana.ml.r.

#!/bin/bash
ls -l /usr/bin/java
ls -l /etc/alternatives/java
ln -s /usr/lib/jvm/zulu8-ca-amd64 /usr/lib/jvm/default-java
R CMD javareconf
R -e 'install.packages(c("rJava", "RJDBC"), repos = "https://cloud.r-project.org")'
R -e 'dyn.load("/usr/lib/jvm/zulu8-ca-amd64/jre/lib/amd64/server/libjvm.so")'
R -e 'library(rJava)'
R -e 'library(RJDBC)'
R -e 'install.packages(c("R6", "futile.logger", "sets", "uuid"), repos = "https://cloud.r-project.org")'
R -e 'install.packages("/dbfs/FileStore/hana_ml_r_2_13_22051100_tar.gz", repos=NULL, type="source")'

Now restart the cluster in the “Compute” menu so that the init script gets executed. The configuration is complete.

Implementation: Once the cluster is running again, create a new Notebook with default language R.

Create a connection to SAP Data Warehouse Cloud through the hana.ml.r library and send a simple SELECT statement as test. Use the database user’s logon credentials. The jdbcDriver parameter is the path that was displayed earlier for the uploaded SAP HANA JDBC Driver (ngdbc-2.13.9.jar). However, you have to change the beginning of the path from dbfs:/FileStore to /dbfs/FileStore (The path was displayed in Spark API Format but the parameter needs the File API Format). You should receive the result: 12345.

dyn.load('/usr/lib/jvm/zulu8-ca-amd64/jre/lib/amd64/server/libjvm.so')
library(hana.ml.r)
conn.jdbc <- hanaml.ConnectionContext(dsn='[YOURDBUSERSHOSTNAME]:443', user='[YOURDBUSER]', password='[YOURDBUSERSPASSWORD]', odbc=FALSE, jdbcDriver='/dbfs/FileStore/jars/8fdab417_9e28_47c5_9b34_45acec6da607-ngdbc_2_13_9-189ea.jar')
conn.jdbc$sql('SELECT 12345 FROM DUMMY')

Create a hana.ml.r DataFrame pointing to the table that was loaded in the Python example and show the column names and types.

df_remote <- conn.jdbc$table(table='POWERCONSUMPTION')
as.data.frame(do.call(rbind, df_remote$dtypes()))

Have SAP Data Warehouse Could calculate some statistics about the content of these columns.

options(width = 300)
print(df_remote$Describe())

The above table is the result of a SELECT statement that was sent to SAP Data Warehouse Cloud.

df_remote$Describe()$select.statement

As a simple test to use the Predictive Analysis Library, request a correlation matrix of the numerical columns. Non-numerical colums (here DATETIME) must not be part of the DataFrame that is passed into the method.

hanaml.PearsonrMatrix(df_remote$Drop('DATETIME'))

You have used Python and R in Databricks to work with data that remains in SAP Data Warehouse Cloud (or SAP HANA Cloud). Leveraging those calculation- and ML-capabilities out of Databricks avoids the complexity and risks of moving data around and duplicating it in multiple places, but you can still work with Databricks as frontend. For your own implementation check the documentation of Python’s hana_ml and R’s hana.ml.r packages.