Computer vision with SAP Data Intelligence

In the time that I have spent coding, not a single day has gone by without opening up stack overflow to copy code. And on days that I use SAP Data Intelligence (DI), not a single day has gone by without scouring the SAP community hoping to find the code I need, waiting to be copied. Sometimes I have found the code snippets that I need in the many blogs in the community. Sometimes I have run into zilch online. Those are the rare days when even Google doesn’t know the answers. This blog was born a long time ago, from one of those frustrating experiences of being unable to copy code and having to write one that works, from scratch. I am unable to share the entire code publicly, but if you are internal to SAP, please write to me and I am happy to share our github repository for this piece of work.

Specifically, this blog is a list of things that worked for us while setting up image classification in DI, should you like to try the same on your instance. This is of course not the only way to get things working but hopefully serves as a cheat sheet for those starting out afresh. This is also not a guide to building your first ML model in DI, so I do not explain all steps of the way. There are many useful blogs for that in the community and I have left links to these in the references section. This blog assumes you know the basics of setting up ML models in DI and only delves on what might be different when you deal with images.

Debugging errors is not the most fun part of development, but there is an undeniable sense of victory when you find that fix. So I have also thrown in a few general errors we faced along the way, and what worked for us to fix them. If you find this useful, give us a like and share, and pass it along to those to whom it might be of help. If you have faced other errors and have figured out the fix, do share it in the comments below.

The Exploration Phase

Reading from the data lake

Our training & test images were uploaded in the DI Data Lake. The below code works in both Jupyter Notebooks as well as the Python operator in pipeline. But first we tested out that it works in Notebooks. The code loops over all images in a folder in the DI data lake, reads each image as a numpy array and appends the incoming arrays as it goes into a list.

train_images = [] train_labels = []
shape = (200,200) directory='/shared/ml/data/<insert your path here>/AllImages/'
client = InsecureClient('http://datalake:50070') for filename in client.list(directory): with client.read(directory+filename) as reader: # Spliting file names and storing the labels for image in list train_labels.append(filename.split('_')[0]) # Read and resize all images to a specific shape img = BytesIO(reader.read()) img = np.array(Image.open(img)) img = cv2.resize(img,shape) train_images.append(img)

If you see an error as below in Jupyter, it means you need someone with an admin role to increase the memory allocation to Jupyter notebooks. Alternately, if you can change the code to use less memory you can sail through.

Kernel restarting abruptly

Subsequently, you can continue with the rest of your training code as you would on your local system.

x_train1,x_test,y_train1,y_test = train_test_split(train_images, train_labels, test_size=0.2, random_state=1,stratify=train_labels)

Setting up the training pipeline

Docker Build

Here’s what our docker image looked like.

FROM $com.sap.sles.base
RUN python3.6 -m pip install pandas --user
RUN python3.6 -m pip install --upgrade pip setuptools wheel --user
RUN python3.6 -m pip install opencv-python-headless --user RUN python3.6 -m pip install numpy==1.16.4 --user
RUN python3.6 -m pip install pandas==1.0.3 --user
RUN python3.6 -m pip install scikit-learn --user
RUN python3.6 -m pip install hdfs --user
RUN python3.6 -m pip install mahotas --user
RUN python3.6 -m pip install Pillow --user

And this what the associated tags looked like. The order of the tags is immaterial.

Docker%20Tags

Docker Tags

It is essential to mention the version for tornado. Else you may not be able to execute the pipeline and you will see this error in your status tab. Essentially it means the pipeline was unable to find your tagged docker file.

Status%20error

Status error

If pip install opencv-python fails for you, try pip install opencv-python-headless. As the documentation here suggests, “these packages are smaller than the two other (opencv) packages because they do not contain any GUI functionality (not compiled with Qt / other GUI components). This means that the packages avoid a heavy dependency chain to X11 libraries and you will have for example smaller Docker images as a result. You should always use these packages if you do not use cv2.imshow et al. or you are using some other package (such as PyQt) than OpenCV to create your GUI.” Given we don’t need to visualise in our training operators, the headless package is ideal for us.

Training Pipeline

The most convenient (and perhaps less error prone) approach to setting up a pipeline is to create one from the pre-configured templates. We use a Python Producer template for the training. Keep all operators in the suggested template intact as much as possible. The Training operator can be edited to follow the training code that worked for you in Jupyter. Once the model (clf in our case) is ready, save it as a pickle file and pass it to the model port.

pickled_clf = pickle.dumps(clf)
api.send("modelBlob", model_blob)

Setting up the inference pipeline

Editing the Python operator

Create the inference pipeline from the Python Consumer template. The only changes you need to make are in the Python operator where your inference code needs to go in. The key thing to identify is how your client will pass the image to the DI API call. If your client will pass a base64 image, your inference code should carry below code. You will notice the sample code from the template before the image read code. I personally find the log an easy way to debug how far along the code executed successfully before running into an error and exiting the operator. The sample inference template provides a simple framework to establish this and liberal usage of log along the code makes debugging easier during inference. The rest of the code would be similar to how you do inference on your model in Jupyter notebook.

if is_json(user_data): api.logger.info("Received valid json data from client - ready to use") log = log + 'data is json. ' image = base64.b64decode(user_data) log = log + 'image ready\n\n' jpg_as_np = np.frombuffer(image, dtype=np.uint8) log = log + 'buffer ready\n\n' img = cv2.imdecode(jpg_as_np, flags=1) log = log + 'decoded ready\n\n'

Once you prediction is ready, pass it to the output port as below.

if success: msg.body = json.dumps({'Results': result, 'Prediction' : str(prediction[0]), 'Log ' : log}) else: msg.body = json.dumps({'Results': False, 'Error': error_message, 'Log': log}) new_attributes = {'message.request.id': msg.attributes['message.request.id']} msg.attributes = new_attributes api.send('output', msg)

Inferencing in Jupyter

The inference code, in DI’s Jupyter will be as below.

import base64
import requests
import json
import io
from io import BytesIO
from hdfs import InsecureClient
client = InsecureClient('http://datalake:50070') credential = "default\\<insert DI username>:<insert DI password>" #With base64 image
with client.read('/shared/ml/data/<insert path to inference image>/Inf_img1.jpeg') as reader: payload_arr = BytesIO(reader.read()) payload=base64.b64encode(payload_arr.getvalue()).decode() url = "<insert DI deployment URL here>v1/uploadjson" headers = { 'Content-Type': 'application/json', 'X-Requested-With': 'XMLHttpRequest', 'Authorization': 'Basic '+str(base64.b64encode(credential.encode("utf-8")), "utf-8")
} response = requests.request("POST", url, headers=headers, data = payload)
response.json().get("Prediction")

The inference on your local Jupyter notebook could read from your local file. In which case it would look as below.

import base64
import requests
credential = "default\\<insert DI username>:<insert DI password>" with open("<insert source pathname of image file>", "rb") as image_file: payload = base64.b64encode(image_file.read()) url = "<insert DI deployment URL here>/v1/uploadjson" headers = { 'Content-Type': 'application/json', 'X-Requested-With': 'XMLHttpRequest', 'Authorization': 'Basic '+str(base64.b64encode(credential.encode("utf-8")), "utf-8")
} response = requests.request("POST", url, headers=headers, data = payload)
response.json().get("Prediction")

Inferencing in postman

To check in Postman, just take a json dump of your payload so you can later copy it over into the Body of the post call.

with open('/Users/<insert pathname to destination file>/Payload.json', 'w') as outfile: json.dump(payload, outfile)

Within postman, set the call to POST and add the DI deployment URL. Under Auth select Basic Type and enter your credentials. The username format should be tenant_name\user_name. Headers should be entered as in image below.

Headers tab in postman

Open your payload json and copy over the contents to the body of Postman. If you see double quotes before and after the string, remove them.

Body tab in postman

You should see the response from DI if you click Send.

Packing and unpacking payload

In your API call to DI, if you pass the image in any other format, you will just need to make sure the input is unpacked correctly in DI. A simple way to check this would be to just try it out in Jupyter. For eg. when you create your payload in Jupyter, unpack it with the same code you use in DI and see if it works with your subsequent inference code. If the unpacking works in Jupyter, it should work in DI.

For eg, below code packs and unpacks the base64 payload and can be verified in DI’s Jupyter.

#Create payload
with client.read('/shared/ml/data/<insert path to inference image here>/Inf_img1.jpeg') as reader: payload_arr = BytesIO(reader.read()) payload=base64.b64encode(payload_arr.getvalue()).decode() #Read payload image = base64.b64decode(payload)
jpg_as_np = np.frombuffer(image, dtype=np.uint8)
img = cv2.imdecode(jpg_as_np, flags=1)

Also, below code packs and unpacks image as numpy array and can also be verified in DI’s Jupyter.

#Create payload
import cv2
from PIL import Image
shape = (200,200)
with client.read('/shared/ml/data/<insert path to inference image>/Inf_img1.jpeg') as reader: payload_arr = BytesIO(reader.read()) payload = np.array(Image.open(payload_arr)) payload = cv2.resize(payload,shape) payload = json.dumps(payload.tolist()) #Read payload
img_arr = np.asarray(json.loads(payload),dtype=np.uint8)

Both times, the create payload will go before point of inference in python. Read payload will go into DI’s Python operator. In the second case, adjust body of postman with the numpy array as input.

Conclusion

This covers all that would be different when you deal with images in SAP Data Intelligence, as compared to structured data. The same extends to voice and video files as well. I hope you found it useful and saved you a little time today. Drop your questions (or new bugs you’re facing) into the comments section below and I will try to answer them if I know the answers.

References

Create your first ML scenario by Andreas Foster

Deploy your first HANA ML pipelines by Andreas Foster

Hands on tutorials by Denys van Kempen

DI with tensor flow by Frank Schuler