Federated Machine Learning using SAP Data Warehouse Cloud & Google Cloud Vertex AI 2.0

SAP Federated-ML or FedML is a library that enables businesses and data scientists to build, train and deploy machine learning models on hyperscalers, thereby eliminating the need for replicating or migrating data out from its original source.

If you would like to know more about the FedML library and the data federation architecture of SAP Data Warehouse Cloud upon which it is built, please reference our overview blog here.

Training a Model on VertexAI with FedML GCP

In this article, we focus on building a machine learning model on Google Cloud Platform VertexAI by federating the training data via SAP Data Warehouse Cloud without the need for replicating or moving the data from the original data storage.

To learn how to set up a connection between SAP S/4HANA and SAP Data Warehouse Cloud and SAP HANA (on-premise and cloud) and Data Warehouse Cloud, please refer here. Please also create a view in DWC with consumption turned on.

If you would like to use local tables in Data Warehouse Cloud instead of connecting SAP S/4HANA or SAP HANA on-premise or SAP HANA Cloud to Data Warehouse Cloud, please refer here. Please create a view in Data Warehouse Cloud with consumption turned on.

Once you have the data, you can merge these tables for a merged view and run your FedML experiment with the merged dataset.

  1. Set up your environment
    1. Follow this guide to create a new Vertex AI notebook instance
    2. Create a Cloud Storage bucket to store your training artifacts
  2. Install the FedML GCP Library in your notebook instance with the following command
pip install fedml-gcp
  1. Load the libraries with the following imports
from fedml_gcp import DwcGCP
import numpy as np
import pandas as pd
  1. Create a new DwcGCP class instance with the following (replace the project name and bucket name)
dwc = DwcGCP(project_name='your-project-name', bucket_name='your-bucket-name')
  1. Make a tar bundle of your training script files
    1. You can find example training script files here
      1. Open a folder and drill down to the trainer folder (which contains the scripts)
    2. And more information about GCP training application structure here
dwc.make_tar_bundle('your_training_app_name.tar.gz', 'Path_of_Traning_Folder', 'gcp_bucket_path/training/)
  1. Create your training inputs
    1. More info about training inputs can be found here
    2. Replace ‘DATA_VIEW’ with the name of the table you exposed in Data Warehouse Cloud
training_inputs = { 'scaleTier': 'BASIC', 'packageUris': ['gs://gcp_bucket_path/training/ your_training_app_name.tar.gz', 'pythonModule': 'trainer.task', 'args': ['--table_name', 'DATA_VIEW', '--table_size', '1', '--bucket_name', 'fedml-bucket'], 'region': 'us-east1', 'jobDir': 'gs://gcp_bucket_path’, 'runtimeVersion': '2.5', 'pythonVersion': '3.7', 'scheduling': {'maxWaitTime': '3600s', 'maxRunningTime': '7200s'}
  1. Submit your training job (note that each job must have a unique_id)
dwc.train_model('your_training_job_id’, training_inputs)
  1. Deploy your model.

Option 1: Deploy to GCP Platform directly.

dwc.deploy(model_name='unique_model_name’, model_location='/demo/model/', version='v1', region='us-east4')

Option 2: Deploy to SAP BTP Kyma.

This method requires a GCP service account user with permissions for Cloud Storage and Container Registry.  Download the service key and place in your notebook environment.  You will also need to download your Kyma configuration file from the dashboard.

dwc.deploy_to_kyma('example@example-project.gserviceaccount.com', ‘key.json', 'example-model-name', model_location='demo/model/model.pkl')
  1. Now that the model is deployed you can invoke your endpoint in two ways.

Option 1: Invoke GCP Endpoint

dwc.predict(sample_data, “your-model-name')

Option 2: Invoke Kyma Endpoint (currently FedML GCP only supports passing data as a JSON)

dwc.invoke_kyma_endpoint(api="https://your-kyma-endpoint.com/predict", payload_path="sample_data.json")
  1. Finally, since we now have a working model and can run predictions, we can write our predictions results back to Data Warehouse Cloud for further use and analysis.

First, you’ll need to create a table

db.create_table("CREATE TABLE <table_name> (ID INTEGER PRIMARY KEY, <column_name> <datatype>, …)”)

You’ll then want to prepare your prediction results to follow the format of your create table statement, ensuring the proper data types and column names.

Once your data set is ready, you can start inserting it into your table. Based on the size of your dataset, insertion might take some time.

db.insert_into_table(‘<table_name>’, dwc_data)

You can also drop the table as so:

db.drop_table('<table_name>')

Once the table is created and the insertions are done, you will have a local table in Data Warehouse Cloud with the data you inserted. You can deploy this table in Data Warehouse Cloud, create a view, and run further analytics on it using SAP Analytics Cloud if you would like.

More information about the FedML GCP library and more examples can be found here.

In summary,  FedML makes it extremely convenient for data scientists and developers to perform cross-platform ETL and train machine learning models on hyperscalers without focusing on the hassle of data replication and migration.

For more information about this topic or to ask a question, please leave a comment below or contact us at paa@sap.com.