EDUCAÇÃO E TECNOLOGIA

Federated Machine Learning using SAP Data Warehouse Cloud and Azure Machine Learning


Background

Large-scale distributed data has become the foundation for analytics and informed decision-making processes in most businesses. A large amount of this data is also utilized for predictive modeling and building machine learning models.

There has been a rise in the number and variety of hyperscaler platforms providing machine learning and modeling capabilities, along with data storage and processing. Businesses that use these platforms for data storage can now seamlessly utilize them for efficient training and deployment of machine learning models.

Training machine learning models on most of the hyperscaler platforms is relatively smoother if the training data resides in their respective hyperscaler-native data storages. The tight coupling of the Machine Learning services with the native data storage results in the need to migrate or replicate the data from non-native data storages. Migrating the data from non-native data storage is a complex and time-consuming process. Moreover, replicating the data to the hyperscaler-native data storage to access its Machine learning capabilities leads to redundant storage, thereby incurring storage costs with the overhead of ensuring data consistency.

Proposed Solution

Federated-ML or FedML is a library built to address these challenges. The library applies the data federation architecture with SAP Data Warehouse Cloud for intelligently sourcing the data in real-time from data storages. The library provides functionality that enables businesses and data scientists to build, train and deploy machine learning models on hyperscalers, without the hassle of replicating or migrating the data from the original data storage. The library also provides the capabilities to build machine learning models by sourcing the data stored across multiple data storages in real-time.

By abstracting the data connection, data load and model training on these hyperscalers, the FedML library provides end to end integration with just a few lines of code.

In this article, we focus on building a machine learning model on Azure by federating the training data from Amazon Athena and Google BigQuery via SAP Data Warehouse Cloud without the need for replicating or moving the data from the original data storages.

The overview of the sections involved are as follows:

  1. Federate data from Amazon Athena and Google BigQuery.
  2. Pre-requisites for the Federated ML Library for Azure ML.
  3. Steps to build a machine learning model on Azure using Federated ML library for Azure.

1. Federate data from Amazon Athena and Google BigQuery:

1.1 Follow all the steps till step 4 in this guide, for integrating Amazon Athena with SAP Data Warehouse Cloud.

Complete all the steps till step 5 in this guide, for integrating Google BigQuery with SAP Data Warehouse Cloud.

1.2 Create Remote table in SAP Data Warehouse Cloud:

In Data Builder, click New Graphical view.

Select sources and Open Connections, Google BigQuery Connection, Project, Dataset and select the BigQuery table. Drag the BigQuery table to the SQL editor.

Similarly, Open Connections, Athena Connection, Database and select the Athena table. Drag the Athena table over the BigQuery table on the SQL Editor and select the join operation.

Click the inner join icon and observe the relevant column mapped in the join properties.

Click on the view, Create an Analytical Dataset and add the desired measures in view properties.

Save and deploy the Analytical Dataset.

2. Pre-requisites for the Federated ML Library for Azure ML:

2.1 Create a new Azure subscription or access existing subscription information from the Azure portal.

2.2 Create an AzureML workspace by referring the guide . Once the workspace is created, download the ‘config.json’ file as shown below and take a note of its configurations:

2.3 Create AzureML Notebook and compute by referring this guide.

2.4 Create a new file named ‘config.json’ in the AzureML Notebooks section by following the guide.

Copy the contents of config.json under the pre-requisite section of the documentation to the newly created ‘config.json’ file in the AzureML Notebook section and provide appropriate values for each of the fields specified.

Note: Create a Database user in SAP Data Warehouse Cloud, if not already created by referring the  guide.

The values for the fields in the newly created ‘config.json’ can be found in the Database User details (step 4 of the guide), by clicking on the info icon as shown below:

3. Steps to build a Machine Learning Model on Azure using Federated ML Library for Azure ML:

3.1 Download the Federated ML Library for Azure ML:

Download the library using the link.It will be downloaded as a .whl file format on your local system.

3.2 Install the library in the AzureML Notebook:

Upload the library file to the AzureML Studio (link) Notebooks section as follows:

Install the library using the following command:

pip install fedml_azure-1.0.0-py3-none-any.whl --force-reinstall

3.3 Initialization of AzureML resources required for training:

The following steps provide a simple way to create the resources required for training:

Note: If the resources for training are already created, skip to step 3.4.

3.3.1 Initialize the workspace:

The ‘subscription_id’, ‘resource_group’ and ‘workspace_name’ in the below cell must be replaced with the configurations in ‘config.json’ downloaded in step 2.2.

Refer the documentation on the ‘create_workspace’ method and parameters.

from fedml_azure import create_workspace
workspace=create_workspace(workspace_args={ "subscription_id": "<subscription_id>", "resource_group": "<resource_group>", "workspace_name": "<workspace_name>" }) 

3.3.2 Create a Compute target:

Provide the desired ‘compute_name’ in the below cell.

Refer the documentation on the ‘create_compute’ method and parameters.

from fedml_azure import create_compute
compute=create_compute(workspace=workspace, compute_type='AmlComputeCluster', compute_args={ 'vm_size':'Standard_D12_v2', 'vm_priority':'lowpriority', 'compute_name':'<compute_name>', 'min_nodes':0, 'max_nodes':4 })

3.3.3 Create an Environment:

Provide the desired ‘environment_name’ in the below cell. Provide the path of the library file uploaded in step 3.2 to ‘pip_wheel_files’ as a list in the below cell.

Refer the documentation on the ‘create_environment’ method and parameters.

from fedml_azure import create_environment
environment=create_environment(workspace=workspace, environment_type='CondaPackageEnvironment', environment_args={ 'name':'<environmant_name>', 'conda_packages':['scikit-learn'], 'pip_wheel_files':['<path_to_fedml_azure-1.0.0-py3-none-any.whl>'] })

3.4 Training the model

3.4.1 Instantiate the training class which assigns the resources required for training:

Provide the desired ‘experiment_name’ in the below cell.

Refer the documentation on the ‘DwcAzureTrain’ class.

from fedml_azure import DwcAzureTrain
train=DwcAzureTrain(workspace=workspace, environment=environment, experiment_args={'name':'<experiment_name>'}, compute=compute)

3.4.2 Read the federated data from Athena and BigQuery via SAP Data Warehouse Cloud in the training script.

Create a training folder under the AzureML Studio (link) Notebooks section to hold the files/script for training:

Create a training script inside the training folder by referring the guide. A sample training script can be found here

Use the following code in the training script to get the federated data from Athena and BigQuery via SAP Data Warehouse Cloud.

from fedml_azure import DbConnection
db = DbConnection()
train_data = db.execute_query('<your_query_to_fetch_train_data>')
#The query should ideally fetch only the data that would be needed to train the model
train_data = pd.DataFrame(train_data[0], columns=train_data[1])

Refer the documentation for more details on the DbConnection class.

3.4.3 Generate the run config which packages together the configuration information needed to submit a run in Azure ML

Provide the file path of ‘config.json’ created in step 2.4, to ‘config_file_path’ in the below cell. This configuration file is used to connect to SAP Data Warehouse Cloud.

Provide the path of the training folder created in step 3.4.2 to ‘source_directory’ and name of the training script created to ‘script’ in the below cell.

Provide the name of the model file to be created to ‘model_file_name’ and other optional arguments in the below cell.

Refer the documentation on the ‘generate_run_config’ method and parameters.

src=train.generate_run_config(config_file_path='<path_to_config_file>', config_args={ 'source_directory':'<path_to_training_folder>', 'script':'<training_script_name>', 'arguments':[ '--model_file_name','<model_file_name>.pkl', '--table_name','<table/view_to_be_queried>' ] })

3.4.4 Submit the training job with the option to download the model outputs

Refer the documentation on ‘submit_run’ method and parameters.

run=train.submit_run(src,is_download=True)

3.4.5 Register the model

Provide the ‘model_path’ as ‘outputs/<model_file_name>.pkl’ where ‘model_file_name’ is the name of the .pkl model file specified in step 3.4.3.

Provide the desired ‘model_name’ in the below cell.

Refer the documentation on ‘register_model’ method and parameters.

model=train.register_model(run=run, model_args={ 'model_name':'<model_name>', 'model_path':'outputs/<model_file_name>.pkl' }, resource_config_args={'cpu':1, 'memory_in_gb':0.5}, is_sklearn_model=True)

More information on the use of the library and sample notebooks with the corresponding training scripts can be found here.

In summary, the Federated Machine Learning library provides an effective and convenient way to federate the data from multiple data storages, perform cross-platform ETL’s and train machine learning models on hyperscalers, without the overhead of any data migration or replication.

If you have any question, please reach out in the comments section.