Import External Datasets to SAP Data Intelligence Catalog

It is one of the hidden secrets of SAP Data Intelligence that you can manage datasets of external data sources (=not connected by Connection Management) with SAP Data Intelligence Catalog.

I have already discussed the openness of the Catalog in a previous blog: Open Catalog of SAP Data Intelligence, while having not covered this aspect. In my most recent blog about connecting to external data sources: Connecting Semi-structured Data Sources to SAP Data Intelligence, I announced this blog because it compliments processing-focused connection. Before you access the data you need to know that this dataset exists with some basic metadata information. This blog is about how to achieve this.

There are 2 steps we need to do:

  1. Create a connection with a specific Connection Type in Connection Management. This is essential, because in SAP Data Intelligence datasets are organised based on Connection IDs.
  2. Upload the metadata information to the catalogue.

Then you can manage the external datasets within limits as usual by adding tags or glossaries.

As a use case I took up the example of my previous blog of integrating a couchDB datasource.

Before you been able to upload dataset information you need to create a connection of connection type METADATA_IMPORT. Because there is only the metadata API option to upload dataset information to SAP Data Intelligence, you cannot add the connection using the Connection Management.

For adding the connection you use the RestAPI /catalog/connections described on SAP API Business Hub passing a json-formatted data. For this you can either use e.g. postman or use my command-line script. For the latter you need a Python installation on your computer and install the package diadmin

pip install diadmin>=0.0.72

and run the script:

dicatalog connections <json-file> --upload

You need to have a config-file config.yaml with the credentials and your DI-system:

PWD: pwd123
TENANT: default
URL: https://vsystem.ingress.xxx.shoot.live.k8s-hana.ondemand.com
USER: user

and a folder “connections”  at your working directory with the json-file of your connection details, e.g

{ "id": "COUCHDB_IMPORT", "description": "CouchDB Metadata Import", "type": "METADATA_IMPORT", "contentData": { "type": "METADATA_IMPORT" }, "tags": [ "import" ], "gatewayId": "", "cloudConnectorLocationId": "", "licenseRelevant": false, "readOnly": true
}

This file contains the very basic data you need for creating a new connection. It is important having the element “contentData”/”type”:”METADATA_IMPORT although it seems redundant.

In case you like to delete a hidden connection call:

dicatalog connections <connection_id> --remove

To ensure that you have added the connection successfully you can download all connections:

dicatalog connections all --download

If you like to see the details of how I have implemented the script you can have a look to by personal public GitHub-repository: thhapke/diamin

After this preparation we can start to upload the actual dataset information.

Again we call a RestAPI to upload data to the SAP Data Intelligence catalog: /catalog/datasets. As response you get a status code and a task id. If the status code is 202 then you might have to wait a couple of seconds until the job is completed. You can either send another request to /catalog/importTasks/{taskid} to check if the job was successful or see in SAP Data Intelligence “Browse Catalog” if there are datasets in the connection created in the first step.

Again you can use my script:

dicatalog datasets <json-file> --upload

In this case you need to create a folder at your working directory “metadata_datasets” with a json-file that contains the metadata information of the dataset. In my example it is:

[ { "remoteObjectReference": { "qualifiedName": "/visits", "remoteObjectType": "VIEW", "connection": { "id": "COUCHDB_IMPORT", "type": "METADATA_IMPORT" }, "name": "visits", "size": 2345789 }, "schema": { "genericType": "TABLE", "tableBasedRepresentation": { "attributes": [ { "name": "KEY", "datatype": "STRING", "templateType": "string", "length": 50, "labelAttribute": "String" }, { "name": "LOCATIONID", "datatype": "STRING", "templateType": "string", "length": 30, "labelAttribute": "String" }, { "name": "VISITID", "datatype": "STRING", "templateType": "string", "length": 30, "labelAttribute": "string" }, { "name": "TIMESTAMP", "datatype": "DATETIME", "templateType": "timestamp", "labelAttribute": "DateTime" }, { "name": "YEAROFBIRTH", "datatype": "INTEGER", "templateType": "int16", "labelAttribute": "Integer" } ], "uniqueKeys": [ { "attributeReferences": [ "KEY" ] } ] } }, "descriptions": [ { "value": "CouchDB Visits", "type": "SHORT", "origin": "REMOTE" } ] }, { "remoteObjectReference": { "qualifiedName": "/booking", "remoteObjectType": "VIEW", "size": 73923, "connection": { "id": "COUCHDB_IMPORT", "type": "METADATA_IMPORT" } }, "schema": { "genericType": "TABLE", "tableBasedRepresentation": { "attributes": [ { "qualifiedName": "timestamp", "name": "timestamp", "datatype": "DATETIME", "templateType": "timestamp" }, { "qualifiedName": "location_id", "name": "location_id", "datatype": "STRING", "templateType": "string" }, { "qualifiedName": "name", "name": "name", "datatype": "STRING", "templateType": "string" }, { "qualifiedName": "booking_time", "name": "booking_time", "datatype": "DATETIME", "templateType": "timestamp" } ] } }, "descriptions": [ { "value": "CouchDB Booking", "type": "SHORT", "origin": "REMOTE" } ] }
]

You find the details of the openAPI definitions in SAP API Business Hub tab “Models View” for each of the elements:

Once you have uploaded the datasets you see them when you browse the catalog:

In my example I have uploaded 2 datasets of connection ID COUCHDB_IMPORT: booking and visits.

With the Metadata API you can add datasets to the catalog of unsupported connection types in the Connection Management. Once you have a dataset in the catalog you can add tags and link glossaries.

This allows you to manage external datasets and that you might further process with pipelines as described in my previous blog: Connecting Semi-structured Data Sources to SAP Data Intelligenc