EDUCAÇÃO E TECNOLOGIA

SAP Data Intelligence : Hive Custom Operator (Read) for Secured Kerberos Connection


This blog post explains how we Built a Custom Operator to incorporate python hivejdbc in the SAP Data Intelligence Pipeline Modeler (Version: 3.1-2010) by building a docker image that includes all the pre-rec for the hivejdbc

Background:

We got an opportunity to work for a large Bank in Philippines who were looking for model risk management and governance requirement. SAP proposed below solution, Integrating with HIVE on Cloudera was a keystone to the project ,that too using Https and Kerberos authentication over VPN tunnel. Since there is no standard operator we were exploring the options on how we can write a python code or build a custom operator to address this requirement, and then we came across the product team git repository https://github.com/SAP-samples/datahub-integration-examples/tree/master/HiveOperator.

Architecture

Architecture

Got very excited that there is an already a working operator that we can quickly solve the customer problem, but our excitement was short lived. Even though the product team had made effort to build a working operator the operator lacks the security aspect of the connection to be deployed in a Production environment. The customer was not willing to compromise on the security and they were only allowing an SSL connection with Kerberos.

We used the product team’s Custom Operator as a Base and started addressing the missing Security and SSL connection.

If you want to know more about the business requirement, architecture or the deployed solution as a whole , they you can reach out to me or Ankit Garg at ankit.garg03@sap.com

The Journey Began !!!!

We explored multiple python hive implementation that make use of Kerberos and SSL and finally zeroed-in on https://pypi.org/project/hivejdbc/ .

Even though the documentation tried to simplify things, we had tough time figuring out the way we can build and test in SAP Data Intelligence. Especially the docker image need to have lot of pre-rec mentioned in the documentation .

The Docker Image

The toughest part was to get a docker image with all the pre-rec especially the Java runtime.

Got the standard python Docker image (More info on the standard Python Docker can be found here : https://hub.docker.com/_/python) and started adding the missing pre-rec. Here is the final docker image I came up with. Building Docker image in SAP DI is bit tricky as there is not much logs to look at. I would recommend to you to use Docker on your Laptop to check the build and then if the build works on your Laptop then you can use the same build file in SAP DI.

Note: Laptop in my case its Windows, You may try on Mac / Linux also.

————————————————————————————————————————————–

FROM python:3.6.4-slim-stretch ENV DEBIAN_FRONTEND=noninteractive RUN mkdir -p /usr/share/man/man1mkdir -p /usr/share/man/man1 && \
apt-get update && \
apt install -y python3-pip && \
apt-get install -y python3-dev && \
apt-get install -y krb5-user && \
apt-get install -y libsasl2-dev && \
apt-get install -y libsasl2-modules-gssapi-mit && \
apt-get install -y libsasl2-2 && \
apt-get install -y openjdk-8-jre-headless && \
apt-get install -y openjdk-8-jdk && \
apt-get install -y wget && \
apt-get install -y zip unzip && \
mkdir /hive_operator && \
mkdir /keytabs && \
mkdir /usr/local/hadoop && \
mkdir /usr/local/hive RUN pip3 install pandas 'six==1.12.0' 'bit_array==0.1.0' 'thrift==0.9.3' 'thrift_sasl==0.2.1' 'sasl==0.2.1' 'hivejdbc' 'tornado' RUN wget -P /usr/local/ https://repo1.maven.org/maven2/org/apache/hive/hive-jdbc/2.1.1/hive-jdbc-2.1.1-standalone.jar RUN wget -P /usr/local/ https://github.com/timveil/hive-jdbc-uber-jar/releases/download/v1.8-2.6.3/hive-jdbc-uber-2.6.3.0-235.jar #Intall Beeline - This is just to test the connectivity ( not really required in the image)
RUN wget -P /usr/local/ https://archive.apache.org/dist/hadoop/core/hadoop-2.7.3/hadoop-2.7.3.tar.gz
RUN wget -P /usr/local/ https://archive.apache.org/dist/hive/hive-2.1.1/apache-hive-2.1.1-bin.tar.gz
RUN tar -xvzf /usr/local/hadoop-2.7.3.tar.gz -C /usr/local/hadoop/
RUN tar -xvzf /usr/local/apache-hive-2.1.1-bin.tar.gz -C /usr/local/hive/ ENV HADOOP_HOME=/usr/local/hadoop/hadoop-2.7.3
ENV HIVE_HOME=/usr/local/hive/apache-hive-2.1.1-bin
ENV PATH=$PATH:$HIVE_HOME/bin RUN groupadd -g 1972 vflow && useradd -g 1972 -u 1972 -m vflow
USER 1972:1972 ENV HOME=/home/vflow
ENV JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/jre/
ENV CLASSPATH=/usr/local/* Tags: { "hive_python": "", "python36": "\"\"", "tornado": "5.0.2"
}

Custom Operator

We used Python based Operator and we build all the necessary interface to HIVE using Kerberos authentication using krb.

Refer the following link to create the custom operator.

https://help.sap.com/viewer/29ff74dc606c41acad117003f6034ac7/1.2.latest/en-US/049d2f3cc69c4281a3f4570c0d2d066e.html

Here are some screenshot of the Custom Operator created using Base Python Operator.

Operator-Ports

Operator-Ports

Operator-Tags

Operator-Tags

Operator-Config

Operator-Config

Operator-Script

Operator-Script

You can refer the script from the link : https://github.com/SAP-samples/datahub-integration-examples/blob/master/HiveOperator/src/vrep/vflow/subengines/com/sap/python36/operators/examples/HiveOperator/script.py

************************************************************************************************************

You need to upload the Kerberos krb , truststore.jks and keytab file to the docker image via Operator Upload option to the docker image.

The files will be uploaded to default /vrep/vflow/subengines/com/sap/python36/operators/ubp/com/sap/python36/hive/ location on the docker image. You can move them to any location using the python script in the Operator.

Operator-FileUploadOption

Operator-FileUploadOption

Test Graph

With the Docker Image and the Operator ready we can now test the Operator. The Operator will take the input from Terminal as SQL statement and send the response from the HIVE to be written the the Terminal and File Location.

TestGraph

TestGraph

Observation

  • The script in the operator shared in this blog post is very minimal and it is just intend to test the connection to HIVE and see if the HIVE respond back to our SQL request.
  • Please note that the script don’t check for any syntax and hence the graph will fail if the SQL syntax is incorrect.
  • You will not get the header for the output, if you need the output with the header you need to add some more logic in the script using DESCRIBE TableName and then append to the result set.
  • Never run Select * on a table having huge volume of data without out any filter, the query may not return/respond and graph might fail.
  • This Operator and Graph was primarily designed for Read from HIVE, however this can be expanded to carry out INSERT bulk data from a File to HIVE. We made many enhancement to the python code, you can reach out to me in comments if you need more info.
  • One bug was observed : If we have any NULL value in the records then the SELECT output after the NULL value were set to NULL. e.g if the table had 100 records and there was a NULL value in one of the column at records 50 then all records after records 50 were set to NULL. We over came this issue by using nvl(columname,0) in the SELECT statement.

To Conclude

SAP Data Intelligence Cloud with its Open Technology support was a great tool to build a working Operator with all the necessary security aspect and able to integrate HIVE data with other source data and successfully address the critical Business Use case.