EDUCAÇÃO E TECNOLOGIA

Creating HANA (Cloud Foundry) connection with SAP Data intelligence and Applying Random Forest

Data being one of the most important assets for any Enterprise , its exploration and analysis becomes very crucial.

SAP Data Intelligence is a very powerful tool , which lets you do those complex processing on the data .

What is SAP Data Intelligence and how does it relate to Data Hub? – link

In this blog , you will be able connect HANA database as a service with Data Intelligence , explore the data via meta explorer and apply Random Forest Classifier algorithm on it.

For this you will be requiring a HANA database a service running on SAP Cloud Platform (Foundry) , a running instance of SAP Data Intelligence

if you are new to this platform , i would highly recommend to read blog by Andreas Forster.

So lets Get Started

Open SAP Cloud Platform Cockpit, navigate to the Global Account , then to Sub account , and finally to the space , where your HANA instance is running and open the HANA Dashboard.

Click On Edit and then Allow All IP address , this will make sure your SAP Data Intelligence instance can access the HANA instance

Its time to login into your SAP Data Intelligence and navigate to connection management and create a connection of type HANA_DB

user , password – username and password for logging into the HANA database

Host,Port – direct sql connectivity host and port , which can be found on HANA DB dashboard  from above step

Now we are going to create a Jupyter notebook.

For analysis , my database (File) looks like

User ID Gender Age Salary Purchased
1 Male 19 19000 0
2 Male 25 24000 1
3 Male 36 25000 0
4 Female 37 87000 1
5 Female 29 89000 0
6 Female 27 90000 1

For analysis i will be using (Only Age , Salary Column) for predicting Purchased column

now open a jupyter notebook from ML scenario manager and install these libraries one by one

pip install sklearn pip install hdbcli pip install matplot 

Code For Jupiter (Note , if you have any library missing , kindly install using above step)

2 things to configue

  1. HANA connection id – line 2
  2. Enter Table Name (Schema.TableName) – line 13
import notebook_hana_connector.notebook_hana_connector di_connection = notebook_hana_connector.notebook_hana_connector.get_datahub_connection(id_="hana") # enter id of the connection from hdbcli import dbapi conn = dbapi.connect( address=di_connection["contentData"]['host'], port=di_connection["contentData"]['port'], user=di_connection["contentData"]['user'], password=di_connection["contentData"]["password"], encrypt='true', sslValidateCertificate='false' ) cursor = conn.cursor() path="ML_TEST.PURCHASE" #enter table name sql = 'SELECT * FROM '+path cursor = conn.cursor() cursor.execute(sql) c=0 X=[] y=[] for row in cursor: d_r=[] #I AM USING 4 COLUMN DATASET d_r.append(row[2]) d_r.append(row[3]) y.append(row[4]) X.append(d_r) # Importing the libraries import numpy as np import matplotlib.pyplot as plt import pandas as pd # Splitting the dataset into the Training set and Test set from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0) # Fitting Random Forest Classification to the Training set from sklearn.ensemble import RandomForestClassifier classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0) classifier.fit(X_train, y_train) # Predicting the Test set results y_pred = classifier.predict(X_test) # Making the Confusion Matrix from sklearn.metrics import confusion_matrix cm = confusion_matrix(y_test, y_pred.tolist()) print(cm) arrx=np.array(X_train) y_set=np.array(y_train) from matplotlib.colors import ListedColormap X1, X2 = np.meshgrid(np.arange(start = arrx[:, 0].min() - 1, stop = arrx[:, 0].max() + 1, step = 0.1), np.arange(start = arrx[:, 1].min() - 1, stop = arrx[:, 1].max() + 1, step = 1000)) plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape), alpha = 0.75, cmap = ListedColormap(('red', 'green'))) plt.xlim(X1.min(), X1.max()) plt.ylim(X2.min(), X2.max()) for i, j in enumerate(np.unique(y_set)): plt.scatter(arrx[y_set == j, 0], arrx[y_set == j, 1], c = ListedColormap(('red', 'green'))(i), label = j) 

you should be able to view the results in a graph.

(Please refer to blog on how to create a pipeline and deploy it as well)

Now Lets us create a pipeline from ML Scenario Manager for creating the model.

First let us create a pipeline from the template python producer

(There are some changes in the components ) to get data from HANA

  1. Constant Generator – to feed in the SQL query , please see the configuration below, in this case the query is
    SELECT * FROM ML_TEST.PURCHASE​
  2. HANA Client (To connect with HANA):things to note(Connection,TableName) and if you scroll down(ColumnHeader) select it to None
  3. JS Operator – to extract only the  body of the message i.e. rows
    $.setPortCallback("input",onInput); function isByteArray(data) { switch (Object.prototype.toString.call(data)) { case "[object Int8Array]": case "[object Uint8Array]": return true; case "[object Array]": case "[object GoArray]": return data.length > 0 && typeof data[0] === 'number'; } return false; } function onInput(ctx,s) { var msg = {}; var inbody = s.Body; var inattributes = s.Attributes; // convert the body into string if it is bytes if (isByteArray(inbody)) { inbody = String.fromCharCode.apply(null, inbody); } msg.Attributes = {}; msg.Body = inbody; $.output(msg.Body); } ​
  4. To String converter (Use inInterface for sending the data from JS operator to the python file)

Python File for training the model and saving it

# Example Python script to perform training on input data & generate Metrics & Model Blob def on_input(data): import pandas as pd import io from io import BytesIO import os import numpy as np import json dataset = json.loads(data) i =0; # to send metrics to the Submit Metrics operator, create a Python dictionary of key-value pairs X=[] y=[] for j in dataset: x_temp=[] x_temp.append(j["AGE"]) x_temp.append(j["SALARY"]) y.append(j["PURCHASED"]) X.append(x_temp) from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0) # Fitting Random Forest Classification to the Training set from sklearn.ensemble import RandomForestClassifier classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0) classifier.fit(X_train, y_train) # Predicting the Test set results y_pred = classifier.predict(X_test) # Making the Confusion Matrix from sklearn.metrics import confusion_matrix cm = confusion_matrix(y_test, y_pred.tolist()) metrics_dict = {"confusion matrix": str(cm)} # send the metrics to the output port - Submit Metrics operator will use this to persist the metrics api.send("metrics", api.Message(metrics_dict)) # create & send the model blob to the output port - Artifact Producer operator will use this to persist the model and create an artifact ID import pickle model_blob = pickle.dumps(classifier) api.send("modelBlob", model_blob) api.set_port_callback("input", on_input) 

wiretaps have been used to check the output , you may skip those blocks

For running the pipeline , you may need the dockerfile , blog

Content of the dockerfile

FROM python:3.6.4-slim-stretch RUN pip install tornado==5.0.2 RUN python3.6 -m pip install numpy==1.16.4 RUN python3.6 -m pip install pandas==0.24.0 RUN python3.6 -m pip install sklearn RUN groupadd -g 1972 vflow && useradd -g 1972 -u 1972 -m vflow USER 1972:1972 WORKDIR /home/vflow ENV HOME=/home/vflow

Now create tags for the dockerfile (Custom tag blogFile is create ) , tag your python file with this tag as well. Build the dockefile

Now we can run the pipeline and store the artifact (Please provide a name )

Now we have to create another pipeline to make an API , so  that it can be consumed.For this case use the template (Python Consumer)

As done in the above step , tag the python and update the script

import json import io import numpy as np import pickle # Global vars to keep track of model status model = None model_ready = False # Validate input data is JSON def is_json(data): try: json_object = json.loads(data) except ValueError as e: return False return True # When Model Blob reaches the input port def on_model(model_blob): global model global model_ready model = pickle.loads(model_blob) model_ready=True # Client POST request received def on_input(msg): error_message = "" success = False try: attr = msg.attributes request_id = attr['message.request.id'] api.logger.info("POST request received from Client - checking if model is ready") if model_ready: api.logger.info("Model Ready") api.logger.info("Received data from client - validating json input") user_data = msg.body.decode('utf-8') # Received message from client, verify json data is valid if is_json(user_data): api.logger.info("Received valid json data from client - ready to use") # obtain your results feed = json.loads(user_data) data_to_predict = np.array(feed['data']) api.logger.info(str(data_to_predict)) # check path prediction = model.predict(data_to_predict) prediction = (prediction > 0) success = True else: api.logger.info("Invalid JSON received from client - cannot apply model.") error_message = "Invalid JSON provided in request: " + user_data success = False else: api.logger.info("Model has not yet reached the input port - try again.") error_message = "Model has not yet reached the input port - try again." success = False except Exception as e: api.logger.error(e) error_message = "An error occurred: " + str(e) if success: # apply carried out successfully, send a response to the user result = json.dumps({'Results': str(prediction)}) else: result = json.dumps({'Error': error_message}) request_id = msg.attributes['message.request.id'] response = api.Message(attributes={'message.request.id': request_id}, body=result) api.send('output', response) api.set_port_callback("model", on_model) api.set_port_callback("input", on_input)

Now you can deploy the pipeline , once it is done , you will get a url , which you can use for the testing of your model , make sure to append /v1/uploadjson/  to your url.

Deployment of the pipeline can take a while .

Post data you can test the model

headers of the call , Authorization is Basic with username

[{"key":"X-Requested-With","value":"XMLHttpRequest","description":""},{"key":"Authorization","value":"Add your authentication here":""},{"key":"Content-Type","value":"application/json","description":""}]

Body of the request , having Age and Salary

{ "data":[[47,25000]] }

!!!!! Congratulations !!!!!

you have successfully created and deployed a model , using HANA DB as a data source.

Some Blogs related to SAP Data Intelligence

https://blogs.sap.com/2020/03/20/sap-data-intelligence-development-news-for-3.0/

https://blogs.sap.com/2020/03/20/sap-data-intelligence-next-evolution-of-sap-data-hub/

https://blogs.sap.com/2019/07/17/sap-data-hub-and-sap-data-intelligence-streamlining-data-driven-intelligence-across-the-enterprise/