SAP Community Question Analysis (E-Mail automation)

In the first blog post of this series, we looked at the possibility to extract the data from the SAP community question forum and store them in a database. Those extracted questions are either answered or unanswered. Whereas distributing all open questions within the forum to the appropriate persons incorporated beforehand manual work we try to automate the e-mail sending process with the open questions to the right persons. A special thanks for this blog post need to go out to Thorsten Hapke who came up with the initial idea for the e-mail automation.

This blogpost shows the implementation of an e-mail automation based upon the extracted questions that are unanswered. The goal of the e-mail automation is to send open questions to the right person. The right person in this context means to the person that has the highest probability of being able to answer the question. There are a variety of ways how implementing the defined goal. What I decided upon was to look at the set of already answered question by an author and merging these questions together with the extracted keywords. This merge gives an overview of the topics that the author already answered. An overview of the complete pipeline can be found here:

Overview%20of%20Data%20Intelligence%20Pipeline%20for%20e-mail%20automation

Overview of Data Intelligence Pipeline for e-mail automation

In the following sections, the exact procedure implemented in SAP Data Intelligence will be outlined.

For the data aggregation part of the implementation, I differentiated two different parts. The first part is the data combination of the authors related to a question and keywords associated with a question. The second part for the data foundation covers the gathering of questions that are unanswered.

Answer author and question keywords

For the first part of the data foundation, we use the table consumer operator and the data transform operator. The table consumer operator is used to extract the data from the database where the data has been stored in. As we want to get data from two tables, we use two table producer operators to extract the data tables from the database. For both tables we specify the connection from where the data is present and define which table we want to extract.

After the extraction of the two labels, we want to join the two data tables. For that goal I used the Data Transform Operator. The Data Transform operator gives the possibility to easily transform different data sets together. A detailed overview of the operator and also the table consumer can be found in the following blog post from Jens Rannacher.

Data%20Transformation%20for%20unanswered%20questions

Data Transformation to enrich answer author data with keywords

The two data sets are joined together by using a left join on the field QUESTIONID. For the data that will be passed to the next operator we select 3 columns. The columns Questionid, Authorname and keyword are those we use for our e-mail creation operator.

Join%20Configuration%20of%20AnswerAuthor%20table%20with%20Keyword%20table

Join Configuration of AnswerAuthor table with Keyword table

In the next section the second input part is covered where I will explain the configuration to extract open questions.

Unanswered questions

To extract data related to unanswered questions we use the table consumer operator and configure it similar to the outlined configuration from section answer author and question keywords. For the extraction we use the questiondata table. To only look at open questions, we can directly define a filter within the table consumer operator. To filter on not answered question, we select the column Answered and enter the string No after the equal sign to filter only those questions out that are not answered or commented by anyone. In the following picture you see the related configuration for the filter.

Filter%20configuration%20to%20extract%20unanswered%20questions

Filter configuration to extract unanswered questions

In the next section I will in detail explain how based on these two extracted data tables the search for the author that could answer the question the best could look like.

Preparation

The process for generating automated e-mails is based on a python script which is deployed in the python operator. The script is divided into three different parts. The first part handles the preparation of the input, the second part handles the identification of the author that is most likely to answer the question and the third part covering the preparing of the e-mail. For being able to let the Python operator run we needed to install the following packages in the following docker container:

FROM $com.sap.sles.base
RUN python3 -m pip install --user pandas
RUN python3 -m pip install --user torch
RUN python3 -m pip install --user sentence-transformers

Preparing Input

The Python Operator uses two inputs. One input uses the data of not answered questions and the second input is the data related which question an author has answered. For both data tables we transform the table into a format that enables pandas to read in the data and use it in a pandas data frame. For both inputs the data is read in with the StringIO method. After the input is streamed in the method read_csv from pandas is used to create a pandas dataframe. In the following you see the template for the read of the input.

sampleData = StringIO(input1.body)
sampleDf = pd.read_csv(sampleData, names = [column1, column2, column3, …, column n])

To match the Author ID with the e-mail of the respective author we create a pandas dataframe that has in one column the author id and in the other column the respective e-mail of the author.

After preparing the data for the procedure we will look on how the procedure for matching the most fitting author is realized.

Selecting author

To select the right person to answer an open question we rely on the keywords associated to an author and to a question. To semantically compare each keyword, we need to vectorize each word. For that in this project we use the all-mpnet-base-v2 model that we load in via the sentence-transformers library:

model = SentenceTransformer('all-mpnet-base-v2')

To get all keywords associated with one author we group our data from the input containing the author of an answer and group it on the author id. Afterwards we transform the grouping with the keywords into a list.

keywordAuthorList = answerData.groupby('AUTHORNAME').KEYWORD.apply(list)

To match to each author the questions he should answer we create a python dictionary that uses the Author ID as a key in the dictionary and stores the Question ID as a list to respective key. To identify all the questions that should be matched to the authors, we iterate beforehand over all the current open questions. The first step of the iteration is to encode the list of keywords into a vector representation using the loaded model. After the transformation of the question keywords, we want to determine which author is the most suitable one for the question.

To determine which author is the most suitable for the specific questions we iterate over all authors that agreed to participate in the automatic e-mail sending. For the associated keywords the same procedure is executed by transforming the keywords into a vector representation. With this vector representation we compute the cosine similarity between the author keywords and the question keywords. This returns a list containing the similarity between each extracted vector. To limit the impact of keywords from another semantic area we only look at the top 5 results with the highest similarity score. Those top 5 results will be averaged to calculate how similar the previous answered questions of the author are to the open question. The similarity score of the question is then appended to an author JSON storing all authors having with their similarity measurement.

To define which author will get forwarded the open question, we not only look at the author with the highest similarity score. Before selecting the author, we define a threshold that filters out authors that do not have a high similarity score. If no similarity score of an author is over the threshold, the author is set to the name of an internal distribution list of Data Intelligence Experts. If authors are present in the JSON, we select the author with the highest similarity score as the preferred answerer. Finally, the authorID and the questionID is appended to a JSON called questionDict. The JSON structure looks like the following:

{ 'authorId1': [questionId1, questionId2, questionId3], 'authorId2': [questionId4, questionId7],
…
}

In the following you see the full script implementing the author question match.

for index, question in notAnsweredQuestions.iterrows(): questionId = question[['QUESTIONID']].values[0] keywordList = question[['KEYWORD1', 'KEYWORD2', 'KEYWORD3', 'KEYWORD4', 'KEYWORD5']].to_list() questionKeywordEmbedding = model.encode(keywordList, convert_to_tensor=True) authorDict = {} for authorName, authorKeywords in keywordAuthorList.items(): authorKeywordEmbedding = model.encode(authorKeywords, convert_to_tensor=True) cosine_scores = util.pytorch_cos_sim(questionKeywordEmbedding, authorKeywordEmbedding) author_cosine_score = torch.mean(torch.topk(cosine_scores, 5)[0]) authorDict[authorName] = author_cosine_score.item() authorDict = {key: value for key, value in authorDict.items() if value > 0.25} if bool(authorDict) is False: relevantAuthor = 'blackbelts' else: relevantAuthor = max(authorDict, key=authorDict.get) questionDict[relevantAuthor].append(questionId)

Preparing e-mail

After the JSON structure is filled with all authors and the corresponding open questions we prepare the e-mails by iterating over the JSON keys. For each open question we append the question title and the link to the question, each line representing a new question. After the content of the mail is filled with the open questions, the required JSON structure of the Send Mail operator needs to be filled and stored into a variable, in our case called mail. After the JSON Structure for the author is filled, we send the JSON structure to the following Send E-Mail operator with the following statement:

api.send("outputQuestionAuthor", mail)

After all authors are iterated over, a success message is sent to a graph terminator that stops the execution of the pipeline.

The Send E-Mail operator provides us with the functionality to send out e-mails. A detailed blog post about the Send E-Mail operator was written by Yuliya Reich that can be found here. For the Send E-Mail operator, we need an already existing connection of type SMTP. The operator expects as an input the following JSON structure to be passed:

mail = { "Attributes": { "email.from": yourEmail, "email.to": [authorMail], "email.subject": "Open SAP Community Questions" }, "Body": mailContent }

After an e-mail is send successfully the operator sends a success message to its output containing the recipients of the e-mail.

This blog post showed you what possibilities exist to implement an automated workflow that needs to be executed periodically. Data Intelligence does not only provide the possibility to transfer data but also gives you the possibility to implement such workflows that previously were done by a person manually.  This blog post marks the second blog post of the blog post series around the SAP Community analysis for SAP Data Intelligence. The third blog post of this series covers the visualization/ clustering of the content of keywords.

This blogpost and the related use case wouldn’t have been possible without the continuous support of my colleagues. A special thanks needs to go out to Britta Thoelking, Sarah Detzler, Daniel Ingenhaag and Yannick Schaper for their continuous support during the project and the preparation for this blog post.