Creating a New Image Dataset and Training with SAP Data Intelligence

Image Source

Computer vision is at the forefront of artificial intelligence (AI) research. Numerous models, most of them based on convolutional neural networks (CNNs), have been developed that perform image processing tasks such as image classification, image segmentation, and object detection, with performance that parallels or exceeds the human eye.

Behind all these sophisticated models is data—a large number of annotated images showing instances of the object the model will encounter in real life. In this article I’ll show how to use SAP Data Intelligence, SAP’s data management platform, to ingest an image label dataset, as a first step to building innovative computer vision models.

What Is SAP Data Intelligence

SAP Data Intelligence is a complete data management platform that identifies, connects, enriches, and organizes distributed data assets into useful business insights at scale. It allows you to create a data warehouse using heterogeneous business intelligence. It helps you manage data streams from IoT devices and implement machine learning models.

SAP Data Intelligence helps integrate, process, and manage enterprise data in a comprehensive, unified way. 

SAP Data Intelligence lets you:

  • Discover data and connect from anywhere at any time via a single enterprise data platform.
  • Transform and enrich complex data, regardless of the data type, and compile searchable data catalogs.
  • Orchestrate complex and enriched data flows to implement data intelligence processes using repeatable, scalable, production-grade ML pipelines.
  • Integrate with GPU hardware acceleration for deep learning scenarios.

How to Create a New Image Dataset

An image dataset is a curated collection of data for a machine learning project. It contains digital images used to test, train, and evaluate the performance of computer vision algorithms.

Whenever you train a custom model, the most important thing is the quality of the image data fed into the algorithm during training. The accuracy of the model will be determined by the training images. Here are a few guidelines that can help you build a high quality image dataset.

Where to get images

A good source of images is Google. There are various techniques for downloading images in bulk from Google image search (make sure to follow Google’s terms of service). Another option is to review public image datasets created by other researchers and select a subset of images that meet your needs.

Image labeling

Once you obtain relevant images, you will need to annotate them. If you are using a public dataset, images may already be annotated, which is a big time saver. Images are typically annotated with human input, and the process can sometimes be semi-automated. Labels should be predetermined by data scientists constructing the model, to ensure that the images provide the algorithm with sufficient information about objects in the image. 

Image annotation process and tools

An image annotation project begins by deciding what to label an image and instructing the annotator to use the image annotation tool to annotate. Because different companies have different image labeling requirements, annotators should be trained in the specifications and guidelines for each annotation project. 

The annotation process also depends on the image annotation tool you use. There are several free image annotation tools available. For example, LabelMe is an open source labeling tool written in Python, which enables manual image annotation for image classification, object detection, and segmentation.

How to Add Your New Dataset to SAP Data Intelligence

Uploading dataset information to SAP Data Intelligence requires creating a connection of METADATA_IMPORT type. This connection goes through the metadata API instead of the Connection Management option. 

Adding Connection To Upload Dataset

Adding the connection requires passing data formatted in JSON to a RestAPI named /catalog/connections. The API is described in the SAP API Business Hub. The data is sent either through Postman or by calling a script on the command line. The command line option requires a Python installation.

To add a connection through the command line:

  1. Open the command line and install the diadmin package:
 pip install diadmin>=0.0.72

2. Create a YAML file for configurations in the working directory named config.yaml and paste the following code in it:

PWD: demopassword123
TENANT: default
URL: https://vsystem.ingress.xxx.shoot.live.k8s-hana.ondemand.com
USER: demo-user

3. Create a JSON file with the connection details, name it demo-connection.json, and paste the following in it:

{ "id": "COUCHDB_IMPORT", "description": "Demo Metadata Import from CouchDB", "type": "METADATA_IMPORT", "contentData": { "type": "METADATA_IMPORT" }, "tags": [ "import" ], "gatewayId": "", "cloudConnectorLocationId": "", "licenseRelevant": false, "readOnly": true
}

The contentData/type part of the JSON file specifies the connection type and is an important element. 

  

4. Run the script through the package with the command:

dicatalog connections demo-connection.json --upload

Uploading Dataset Through Connection

Uploading the dataset through the connection to the SAP Data Intelligence catalog happens through a Rest API named /catalog/datasets. The API returns a status code showing the success of the upload and the upload task id. 

The status code will be 202 if the job is in progress and not yet complete. Check the Browse Catalog option in SAP Data Intelligence to see if the sent datasets got uploaded.

To upload a dataset through the RestAPI:

  1. Create a directory in the working directory and name it metadata_datasets.
  2. Create a JSON file named demo-dataset-metadata.json and paste the metadata information of the dataset inside it.
  3. Go to the command line and use the command to upload the dataset:
dicatalog datasets demo-dataset-metadata.json --upload

4. Check the Browse Catalog option in SAP Data Intelligence to see if the sent datasets got uploaded.

That’s it! You have successfully uploaded your dataset to SAP Data Intelligence.

Deploying Your Model

You can build your computer vision model using a familiar deep learning framework such as TensorFlow. Then, follow these steps to deploy your model using SAP Data Intelligence with the dataset you uploaded:

  1. Create a Machine Learning Scenario and start a Jupyter Notebook instance to train your model.
  2. Save the trained model as a pickle file and export it as a ZIP artifact, which you can use to deploy and serve the model.
  3. Deploy the exported ZIP into SAP Data Intelligence, which automatically exposes a REST endpoint for making real time AI inference. This leverages the Model Serving Operator
  4. You can now serve your deployed model through the exposed REST API endpoint.

Learn more about this workflow in the great post by Suresh Kumar Raju.

Conclusion

In this article, I explained the basics of SAP Data Intelligence and showed how to:

  • Create an image dataset for a computer vision model
  • Upload your dataset to SAP Data Intelligence
  • Create a Machine Learning Scenario and train your model
  • Export your model as a ZIP artifact and deliver it over an API endpoint

This illustrates how SAP Data Intelligence can help you manage the entire data science workflow from end to end.