Hands-On Tutorial: Manage & Run Machine Learning Models in R through Docker and in SAP BTP, Kyma runtime

I wrote my bachelor thesis about startup success, where I tried to find the best model to predict whether a startup will be successful or not. Success, in this case, is defined as the event that gives its founders and investors large sums of money through the process of an M&A (Merger and Acquisition) or an IPO (Initial Public Offering). An unsuccessful startup on the other hand had to shut down. I used the dataset to train several models of a logistic regression and a random forest to compare which model would perform best in classifying the startups. The best of those models was the random forest which we will be using in this blog post.

In case I want to share my model with other data scientists it is almost impossible to just send it per mail because my environment – which I used to train the model – is most likely to be different from other environments used by fellow data scientists. To solve this problem we will use docker to be able to define and share the dependencies for the model. So if I want to share my model now, I can share my docker image to make sure everybody is executing the code in the same environment and the model behaves exactly as in the environment it was trained in. But that might not be enough.

Imagine sending the model to someone with a very large dataset he/she wants to run through the model. Hence, this might require a lot of extra processing and memory power, one docker container might not be enough. But there’s a solution to that as well. Kubernetes enables the adding and management of multiple docker containers which gives us the possibility to scale our model to be applied on bigger datasets. Since Kubernetes needs a bit of work to set up and handle it we will use Kyma 2.0 which basically manages the Kubernetes environment for us so we can focus on building the model and share it easily with our colleagues.

This blog post is inspired by the Hands-On blog post from Yannick Schaper where he explains how to deploy and manage a Machine Learning Model with Python on the SAP BTP, Kyma runtime. Because there are some small but essential differences if you want to deploy your model from R I decided to write this blog post. My intention is to show you an easy start on how to prepare a model in R to be deployed in the free tier option for SAP BTP, Kyma runtime.

I will show you how to deploy an application in R with our trained model in Docker to enable you to follow Yannick’s  blog post if you want to deploy it in on the free tier option for SAP BTP, Kyma runtime. To do this, we need some technologies for which I will give a brief overview.

In Docker we can build a so-called container. Within that container you can use and specify the required versions of software and libraries for your machine learning model. So even if you are updating the software that is running on your local computer to another version, the code in your container will always run the same way as it did when you set it up in the first place. You could think of the container like a terrarium with its own ecosystem in it. If you don’t have Docker running yet, you can install Docker for Windows under the following link and create an account under Docker Hub.

Kyma is an open-source project built on top of Kubernetes. It is a fully managed Kubernetes based runtime that allows you to build extensions through microservices and serverless functions. This blog post describes in detail how you can build a docker image to enable it via Kyma as part of the SAP BTP.

Beakr is a minimalist web framework which enables you to build web services in R. With Beakr we will develop a small application that provides the prediction for a new observed startup. Beakr also enables us to handle API request, that we will need to send the new startup data to our model and to provide the prediction.

Further I recommend having a quick look at the following blog posts to get a brief introduction to the technologies we will be using in this blog post:

What you will learn in this blog post:

  1. Create an API endpoint to be able to communicate with the application and the trained model
  2. Containerize R script in Docker image with all the requirements and dependencies
  3. Upload the Docker image to Docker Hub to enable access to the image

In case you want to take a look at the used code I’ve set up a GitHub repository from which you can download all necessary files to follow this blog post.

Let’s start by creating our R script. In this case I will name the file “Startup.R”. To get our trained model running we first must import the required libraries and load the model from our working directory. Further we need to define a function that handles the data provided by the GET request, transforms the variables in formats which the model can handle and generates a dataframe for the observation. After that the prediction for the observation is made and the result is being returned. At last, we generate a new Beakr server with the API endpoint “predict” that runs on the localhost with port 8001. This is where our defined function will be executed with an incoming GET request.

#load required libraries
library(beakr)
library(mlr) #load trained model named 'rforest' from the current working directory
load("RandomForest.RData") # function that gets all the parameters of an obeservation, combines them in a # dataframe, calculates a prediction with the previously loaded model and returns # the result within a string get_prediction = function(longitude, latitude, affy, alfy, afmy, almy, relationships, funding_rounds, funding_total_usd, milestones, is_CA, is_NY, is_MA, is_otherstate, is_software, is_web, is_mobile, is_enterprise, is_advertising, is_othercategory, has_VC, has_angel, has_roundA, has_roundB, has_roundC, has_roundD, is_top500){ # combine parameters in data.frame startup_obs = data.frame( # latitude of the startup latitude = as.numeric(as.vector(latitude)), # longitude of the startup longitude = as.numeric(as.vector(longitude)), # age in years at the first funding round age_first_funding_year = as.numeric(as.vector(affy)), longitude = as.numeric(as.vector(longitude)), # age in years at the last funding round age_last_funding_year = as.numeric(as.vector(alfy)), longitude = as.numeric(as.vector(longitude)), # age in years at reaching the first milestone age_first_milestone_year = as.numeric(as.vector(afmy)), longitude = as.numeric(as.vector(longitude)), # age in years at reaching the last milestone age_last_milestone_year = as.numeric(as.vector(almy)), # number of relationships (ex. accountants, vendors, investors, mentors,...) relationships = as.integer(as.vector(relationships)), # number of executed funding rounds funding_rounds = as.integer(as.vector(funding_rounds)), # overall raised money during funding rounds funding_total_usd = as.numeric(as.vector(funding_total_usd)), # number of reached milestones milestones = as.integer(as.vector(milestones)), # 1 if startup is located in California, else 0 is_CA = factor(is_CA, levels = c(0, 1)), # 1 if startup is located in New York, else 0 is_NY = factor(is_NY, levels = c(0, 1)), # 1 if startup is located in Massachusetts, else 0 is_MA = factor(is_MA, levels = c(0, 1)), # 1 if startup is located in another state, else 0 is_otherstate = factor(is_otherstate, levels = c(0, 1)), # 1 if startup is working in the software industry, else 0 is_software = factor(is_software, levels = c(0, 1)), # 1 if startup is working in the web industry, else 0 is_web = factor(is_web, levels = c(0, 1)), # 1 if startup is working in the mobile industry, else 0 is_mobile = factor(is_mobile, levels = c(0, 1)), # 1 if startup is working in the enterprise industry, else 0 is_enterprise = factor(is_enterprise, levels = c(0, 1)), # 1 if startup is working in the advertising industry, else 0 is_advertising = factor(is_advertising, levels = c(0, 1)), # 1 if startup is working in another industry, else 0 is_othercategory = factor(is_othercategory, levels = c(0, 1)), # 1 if startup has raised venture capital, else 0 has_VC = factor(has_VC, levels = c(0, 1)), # 1 if startup has an business angel, else 0 has_angel = factor(has_angel, levels = c(0, 1)), # 1 if startup conducted 1 funding round, else 0 has_roundA = factor(has_roundA, levels = c(0, 1)), # 1 if startup conducted 2 funding rounds, else 0 has_roundB = factor(has_roundB, levels = c(0, 1)), # 1 if startup conducted 3 funding rounds, else 0 has_roundC = factor(has_roundC, levels = c(0, 1)), # 1 if startup conducted 4 funding rounds, else 0 has_roundD = factor(has_roundD, levels = c(0, 1)), # 1 if startup was ranked among the top 500, else 0 is_top500 = factor(is_top500, levels = c(0, 1))) # calculate prediction for the observed startup pred = predict(rforest, newdata = startup_obs) #return the prediction return((paste0("The prediction for the observed startup is: ", as.numeric(as.vector(pred$data$response)), "\n", "With a probability of ", 100 * pred$data$prob.1, "% for a prediction of a successful startup (Threshold is 50%)")))
} # generate new beakr server newBeakr() %>% # Respond to GET requests at the "/predict" route httpGET(path = "/predict", decorate(get_prediction)) %>% # Handle any errors with a JSON response handleErrors() %>% # Start the server on port 8001 listen(host = "0.0.0.0", port = 8001)

To test the API endpoint, we can for example use Postman. Please, insert the hostname and port you used to generate the Beakr server with and execute the following GET request:

https://<hostname>:<port>/predict?longitude=-75&latitude=43&affy=3&alfy=5&afmy=2&almy=4&relationships=7&funding_rounds=5&funding_total_usd=500000&milestones=4&is_CA=0&is_NY=1&is_MA=0&is_otherstate=0&is_software=1&is_web=0&is_mobile=0&is_enterprise=0&is_advertising=0&is_othercategory=0&has_VC=1&has_angel=1&has_roundA=1&has_roundB=1&has_roundC=1&has_roundD=1&is_top500=1

You should see the following response in the terminal now.

At this point we already managed to load the model and generate an observation for an observed startup by sending its data via the GET request on the API.

The next step ahead is to containerize the script by creating a Docker image with the specified requirements and actions to deploy the application

The Dockerfile looks like the following and can be build by creating a new text document and removing the “.txt” part after saving it. Please keep in mind that the Dockerfile must have the name “Dockerfile”.

# Install R-version 3.6.3 as image
FROM rocker/r-ver:3.6.3 # Install required ubuntu libraries for 'mlr'
RUN apt-get update -qq && apt-get install -y \ libgdal-dev libgeos-dev libproj-dev r-cran-udunits2 libgsl-dev libgmp-dev libglu-dev r-cran-rjags libmpfr-dev libopenmpi-dev # Install required libraries
RUN R -e "install.packages('beakr')"
RUN R -e "install.packages('mlr')"
RUN R -e "install.packages('randomForest')" # Expose the used port from beakr
EXPOSE 8001 # Load Skript with model
ADD . /app # set current working directory to the added app directory
WORKDIR /app # Run the R script that contains the application
CMD ["Rscript", "./Startup.R"]

In the Dockerfile we install the R version 3.6.3. Because the library “mlr” requires some ubuntu libraries to work properly we need to install those before installing “mlr”. You can find the required ubuntu libraries on the CRAN page of every library. You can for example find out about the required libraries for “mlr” by visiting its CRAN website and looking for “System Requirements”. At last, we expose the port of our Beakr app and run it. Make sure you have all the needed files in one folder including the R script, the Dockerfile, requirements file and machine learning model. Please be aware that we will always install the latest version of the libraries “mlr”, “randomForest” and “beakr”. Additional updates in the installed libraries could lead to unexpected errors due to the version differences. In this blog post the following versions are being used:

  • mlr: 2.19.0
  • randomForest: 4.7-1
  • beakr: 0.4.3

If you want to ensure to always work in the same environment I suggest you to have a look at the package “remotes” which enables you to install specific versions of the required libraries.

Let’s open the command control in our folder to build the image. On Windows this can be done by clicking on the address field in your folder and typing “cmd”.

The next step is to build the image by typing:

docker build -t ml-app .

Please be aware that this might take some time.

To run the container we execute the following command.

docker run -p 8001:8001 ml-app

At this point you can check if everything is working so far by sending a GET request with Postman as we did before.

The next step is to upload the image to Docker Hub. This enables us to access it from multiple servers and devices.

For this you need to log in by executing the following command and entering your login credentials.

docker login

Next, we tag the Docker image.

docker tag ml-app <yourDockerID>/ml-app-image

Now you can push the image to your Docker repository by executing

docker push <yourDockerID>/ml-app-image

In your Docker Hub account you should be able to see the image if the push was successful.

Now, at this point we managed to containerize our application and push it to our Docker Hub account. The next step is to deploy the model on the SAP BTP, Kyma runtime.

To do this, please navigate to your Kyma Environment in the SAP BTP cockpit choose Namespaces and create a new one.

Afterwards click on “Deploy new workload” and choose “Create deployment”.

Navigate to “Advanced” and enter a name for the deployment. Further, add the Docker image with your Docker Id and the name of the image separated by a “/”. Because the used application needs a bit more memory capacities than the default settings are providing us we have to change it. Here I will change the values of “Memory Request” and “Memory Limits” to 1G. In addition, we have to extend “Service”, select the tick and change “Port” and “Target Port” to 8001, which is the port used by our application.

Then click “Create” and wait until the deployment is running successfully. You can click on the Pod to see further details.

Next we have to create a new API Rule to be able to communicate with the application. Therefore select it in the navigation menu on the left side and create a new API rule.

Add a name and a subdomain all in small letters and choose your service.

Copy the API rule to your clipboard and use it to execute a new GET request with Postman.

<your copied API rule>/predict?longitude=-75&latitude=43&affy=2&alfy=5&afmy=2&almy=4&relationships=7&funding_rounds=5&funding_total_usd=500000&milestones=4&is_CA=0&is_NY=1&is_MA=0&is_otherstate=0&is_software=1&is_web=0&is_mobile=0&is_enterprise=0&is_advertising=0&is_othercategory=0&has_VC=1&has_angel=1&has_roundA=1&has_roundB=1&has_roundC=1&has_roundD=1&is_top500=1

(Be careful, when copying the request. It might have some unexpected characters(space, tab or new line) added to the end. You can simply remove them, when editing the GET request in Postman)

Arriving at this point, we managed to build an application with R, deploy it in a local Docker container and upload the image to our Docker Hub account. We further used this image to deploy our application in SAP BTP, Kyma runtime and send new observations with to the application to create new predictions.

I hope this blog post helps you to successfully deploy your own R applications in SAP BTP, Kyma runtime. Whenever any questions occur, please feel free to comment an this blog post or to set up a question regarding the main tag (SAP BTP, Kyma runtime). Further, if you think this was helpful feel free to like this blog post.

Coming to an end I want to thank Yannick Schaper and Sarah Detzler for their support while writing this blog post and I wish you a successful and happy deployment.

Cheers,

Patrick