A real-world case study: Worker Behavioral Analytics and the “Connected Workforce”, Part 3/3

This is the fifth article of a series of posts on Worker and People Behavioral Analytics and the “Connected Workforce”, which started with the introductory post (link to the post is here) which covered the following fundamental questions:

Q1) WHERE IS THE “CONNECTED WORKFORCE” APPLICABLE?

Q2) WHAT MAKES THE “CONNECTED WORKFORCE” VALUABLE?

Q3) WHAT MAKES THE “CONNECTED WORKFORCE” CHALLENGING?

Q4) BUT THEN, WHAT MAKES THE “CONNECTED WORKFORCE” POSSIBLE?

Q5) WHAT WE WILL SEE IN THE NEXT POSTS OF THE SERIES?

After the introductory article, the second article (link to the post is here) described a specific use case for a client, and covered the questions (also providing our main ALGORITHM PART 1 towards Episode Extraction):

Q6) WHAT IS THE OVERALL STRUCTURE OF OUR REAL-WORLD DATASET?

Q7) HOW TO EXAMINE THE ENTRY/EXIT LOG (X1) IN FURTHER DETAIL?

And then, the fourth article (link to the post is here), continued and also provided the main ALGORITHM PART 2 (Investigation for unclear episodes and classification into probable non-violation or violation), while answering the questions:

Q8) HOW TO EXAMINE THE TRACKER LOG (X2) TOWARDS OUR INITIAL GOALS?

Q9) HOW TO UTILIZE THE TRACKER LOG (X2) FOR OUR INITIAL GOALS?

Q10) WHAT ARE THE DESIDERATA, STEPS, AND DETAILS OF ALGORITHM PART 2?

Another post (which can be found here) is covering the implementation of the dashboards in SAP Analytics Cloud (SAC) as well as ALGORITHMS PART 1 & 2 in HANA.

As promised in the fourth article, here, in this article (fifth in the blog post series), we will show how HANA-ML (Machine Learning) can be used for further processing, towards the initial steps of a roadmap for more advanced use cases (beyond worker analytics and overstay estimation), and we will illustrate how they can be used to perform worker behavioral profiles and behavioral similarity assessments, and from there can be used for abnormality detection as well as hierarchical clustering as well as and multi-dimensional scaling, at multiple levels, among other use cases. So, let us start, with our usual format using structuring questions:

Q11) HOW TO CONNECT PYTHON JUPYTER NOTEBOOKS TO HANA?

In our use case, the three tables (two handed over by the customer: Entry/Exit log (X1) and Tracker Log (X2), and the Episode Log (X3) created through our ALGORITHM PART 1&2 described in the previous blog posts in this series and implemented in SQL), reside in HANA. It was chosen to use HANA-ML accessible through Python for further development and experimentation, through Jupyter notebooks running on a local Windows PC through the Visual Studio Code V1.63.2 development environment.

In such a setting, it is very important to realize that sometimes all the data remains in HANA and is processed locally there (and the relevant HANA-ML Python commands are just translated to equivalent statements run locally in the remote HANA machine), but there also exist other times, when data is transferred to our local PC, and processed there, or even transferred back to the remote HANA cloud server. This, being clearly about what is where and when do transfers take place, is of utmost importance when designing such code.

As a simple first piece of knowledge, it is worth noting that when accessing a remote table which resides in the HANA server, we are only effectively fetching a “pointer” to it locally; not the whole table. Then, we can write local python commands, that use such pointers as arguments, in order to execute remotely using the data in-situ, without any transfer. However, when we start using the method .collect() on a pointer to a remote HANA-ML dataframe, then we actually are initiating the transfer of its contents through the network, so that they are copied to a local python dataframe.

So, let us see how our code starts, by transferring over pointers to appropriate tables from the HANA Cloud machine to our local Jupyter Python environment:

Importing%20HANA-ML%20Dataframe%20pointers%20from%20HANA%20Cloud%20to%20local%20Jupyter%20notebook%20using%20Python

Importing HANA-ML Dataframe pointers from HANA Cloud to local Jupyter notebook using Python

Then, we can easily start inspecting the tables, without transferring all of them; but rather, by running the .head() method on the remote HANA server, and just using .collect() afterwards, i.e. by entering in the jupyter notebook:

y.head(10).collect()

and then running the block, we will get the first ten rows of the table, as well as the column names.

We can furthermore inspect the unique values of each column very easily; for example, we can see the unique Plant ID’s using: (and similarily for any other column)

y_plant = y.select(‘PLANTNAME’)

plantUIDs_=y_plant.distinct(‘PLANTNAME’)

plantUIDs_.collect()

Sometimes we need to rename the columns too; for example, to replace them with ALL CAPS and NO SPACES in order to simplify access in some cases. This is for example done in our case by:

Renaming%20the%20table%20columns%20so%20that%20they%20are%20ALL-CAPS%20and%20contain%20NO-SPACES

And now the real processing begins. We will start by collecting Data by Worker, in order to create Behavioral Profiles through which we will be able to assess Behavioral Similarity (of a worker with another worker, or of a worker in a time period as compared to the same worker in another time period, and many such variations), Most importantly, the behavioral profiles will have selectable attributes, as we shall see. But what exactly will we include in the “Behavioral Profile” of each worker?

Q12) HOW CAN WE DEFINE WORKER BEHAVIORAL PROFILES?

The approach we take here is multi-leveled. The full behavioral profile is more of a “lifelog” (in the DARPA sense) of a worker. This of course might consist of many different types of elements, including:

  • Entries / Exits to the plant: Which gates does he use? At what times?

Tracker traces in the plant: Which beacons? At what locations? With what sub-plant ID’s?

  • Relations that can be defined in terms of other workers: For example, with which other workers is he often co-located, for a specific radius and time overlap?

Of course this is just the beginning of a much longer list. There are three observations to be made here:

O9) First, that one can keep either whole histories of events (such as entry/exits, tracker traces, co-location events), or we can keep different types of summaries, keeping more information (less summarization) or keeping less information (more summarization). What is better? It all depends on the data and the application, as we shall see. The first three types of summaries we will use here, in decreasing order of summarization, are:

  • Single-element summaries: For example, the most frequent element of a type, such as the most frequent Entry Gate ID, the most frequent Beacon ID in terms of traces per day, the co-worker with whom a specific worker has most co-location (in terms of a specific definition of measure of total co-spatio-temporality). And not only can we define single element categorical summaries (such as the above), but we can also have single-element continuous-valued summaries: For example, the median time of entry to the plant. Also, in other datasets, element might contain not only places and movements, but also logged actions and/or activities and/or states of workers.
  • Set-level summaries: Here, we can for example keep the set of all Gate ID’s that have been used by a worker for Entry; or, all the beacon ID’s that he has ever visited, or a thresholded subset of those: for example, the top-10 beacon ID’s, or all beacons ID’s above a specific minimum frequency of visitation, etc. Similarly, we can create sets of numerical values, and not just categorical.
  • Set&Count-level (Histogram-like) summaries: A stereotypical example would be keeping all the Gate ID’s that have been used by a worker, and the number of times that each of them has been entered through – that is, we are essentially keeping a histogram (an estimation of a discrete probability distribution) in this case. in a similar way, one can be keeping histograms of continuous-valued events; for example times, after defining appropriate bins (for example: bins of duration of 10 minutes, say from 7:00:00-7:09:59, 7:10:00-7:19:59, etc.
  • Other types of summaries: Especially for continuous values (such as times), or even vectors, one can create many different types of summaries: A classic example being approximations through parametric probability distributions, such Mixtures-Of-Gaussians etc.

O10) Second, it is important to note, that if we are not focusing on single events (such as a Gate Entry event or a single Tracker Trace (beacon sighting) event), and we rather focus on sequences of events in time, for example sequences of Beacon ID’s or Plant ID’s with specific timestamps, then there are multiple levels of abstraction one might use:

  • Totally discarding sequence, relative timing, and absolute timing: For example, Bag-of-words with or without counts. Imagine that you want to summarize the beacon ID’s ticked on a single day by a single worker; then you forget about their temporal sequence and timing, and just represent them as a set-level summary or a set-and-count-level summary (see above)
  • Focusing on sequence, but discarding relative timing, and absolute timing: For example, by using NLP-like n-gram descriptions; i.e. probabilities of single events, and then probabilities of ordered pairs of events, and then probabilities of ordered triads of events, etc. One can also try to create probabilistic Finite State Machine (FSM)-like generative models that fit the observed events, and of course one could also assume the Markovian property if appropriate.
  • Focusing on sequence and relative timing (time durations between consecutive events), but not absolute timing: For example, one could discretize time, and use the self-transition probabilities in a finite state machine to model durations; or one could have full hidden markov models, with emission and transition probabilities, and if required also augmented with learnable probability distributions for the transition durations.
  • Taking into account sequence, relative timing, and absolute timing: One can further augment the above models, and instead of having additional objects for distributions for the transition durations, have additional objects for distributions of the absolute time (for example, as a time-of-day in a 24-hour frame) of the actual events (and not the transitions between the events).

I will not go into further detail here; there are many neat mathematical formalisms one can use, as well more recently developed ML tools for all of the above: For example, I have not touched upon many very powerful graph- or lattice-based methods. This is because the point I wanted to make here is just to illustrate what is kept and what is not kept depending on the summarization method use (and what is kept or not, refers to {sequence, relative timing, absolute timing} here).

O11) Finally, as a very quick side-note, if we did have real spatial locations in our tracker set (i.e. real latitude/longitude pairs of all thousands of beacons instead of just 30 rough centers), then one can start using real spatio-temporal methods (2D space + 1D time in this case), instead of sequences of (unlocalized) Beacon ID’s over time. And for the case of closed spaces and buildings (potentially with multiple floors), many more tools become relevant (for example, discretized floor-level height (z) representations, connectivity graphs between rooms, etc).

After this small parenthesis with the three observations O8-O11 which illustrate the wide range of possibilities for what a “behavioral profile” might consist of, let us go back to HANA-ML and our real-world case-study dataset, and create some behavioral profiles; and then also create behavioral similarity metrics, before proceeding.

Q13) HOW CAN WE CREATE WORKER BEHAVIORAL PROFILES?

First, let us obtain and keep the Unique ID’s of each worker in our dataset, in a similar way that we gathered the Plant ID’s in the beginning of this post:

xID = x.select(‘EMPLOYEE_ID’)
UID=xID.distinct(‘EMPLOYEE_ID’)
pUID = UID.collect()

The last command above will also display the ID’s; of course we can also display a subset (first 10):

pUID.head(10)

UID.count()

And now, let us set-level and set&count-level (histogram) summaries of the Entry/Exit Gate ID’s, per worker:

Creating%20Sets%20and%20Histograms%20of%20Entry%20and%20Exit%20Gates%20per%20Worker

Now, we have created:

GateInSetPerUID: Set-level description of Entry gates per Worker ID

nGateInElemPerUID: Number of elements (i.e. number of distinct Entry gates ever used) of each of the above sets per Worker ID

GateInCountPerUID: Number of times each of the gates in the respective GateInSetPerUID was entered through

And similarly for exit gates (GateOutSetPerUID etc.)

And important note to take here, is as the above process takes some time to run, we can save its results either in local memory, or in local files. For the first case, the way to save variables in local memory is to use the %store magic for lightweight persistence:

%store GateInSetPerUID
%store nGateInElemPerUID
%store GateInCountPerUID
%store GateOutSetPerUID
%store nGateOutElemPerUID
%store GateOutCountPerUID

And then, one can simply load them back, if required, by using the statements with the -r argument (read), such as:

%store -r GateInSetPerUID

Now, if one wants to save as files, then a way to do so is through the pickle library:

import pickle
file_name = “GateInSetPerUID”
# Open the file for writing
with open(file_name,’wb’) as my_file_obj:
pickle.dump(GateInSetPerUID,my_file_obj)

And then one can read back from the file to the local variable through:

file_name = “GateInSetPerUID”
# Open the file for reading
file_object = open(file_name,’rb’)
# load the object from the file into the variable
GateInSetPerUID= pickle.load(file_object)
print(GateInSetPerUID)

Now, after this interlude on storing/loading variables to memory and/or files, let us continue with the worker behavioral profiles, by creating set- and histogram-level summaries of the Beacon ID’s and Plant ID’s per worker:

And again, we can save the results using %store magic:

%store PlantNameSetPerUID
%store nPlantNameElemPerUID
%store PlantNameCountPerUID
%store BeaconIDSetPerUID
%store nBeaconIDElemPerUID
%store BeaconIDCountPerUID

But how do we define similarity metrics between two summaries, which we will later also use at the whole worker profile level, either for all time or for a limited time period? Let us see!

Q14) HOW CAN WE DEFINE AND CREATE THE SIMILARITY OF BEHAVIORAL PROFILES?

Regarding definitions, we usually need the range of output values for a similarity method to be between 0…1, where 1 means “maximum similarity”, which often translates to identity. The similarity calculation method accepts as input arguments two behavioral profiles, and outputs the similarity value. The specifics chosen for its calculation depend heavily on the form of the behavioral profiles: various mathematical formulas and computational algorithms can be employed, but they depend heavily of course on the form of their input arguments.

In the code of the previous section, we produced set-level and histogram-level summaries, for the entry gates, exit gates, beacon IDs, and Plant IDs, that each worker was passing through, in a chosen time period. So let us concentrate here on definitions of similarity methods that accept set-level or histogram-level arguments as inputs. The methods that were created are:

The main principles employed are:

  • Set-level similarity: We calculate the relative size of the Intersection of the two sets, as compared to their Union
  • Histogram-level similarity: We treat the two histograms as vectors in the n-D space generated by the categories (i.e. all Gate IDs or all Plant IDs etc), with the value in each dimension being equal to the count for that specific histogram bin. Then, we normalize to unit length, and calculate the dot product of the two vectors (each representing a histogram-level summary).

We furthermore create methods that call the two above methods consecutively in order to create a similarity matrix,, containing the similarities of each possible pair of workers:

Now, we cam create eight different similarity matrices; four at the set-level and four at the histogram-level: one for Entry Gates, one for Exit Gates, one for Beacon IDs, one for Plant IDs (per worker, of course). How can we combine those in order to define an “Overall” behavioral similarity?

There are many possible ways; the way we have chosen here is through weighted sums.

So, let’s first create all eight matrices, and store them locally:

And now, let us create the “combined” similarity matrix: First, we create a method for weighted-sum combination:

And then, we apply it to the histograms of Gate Entry, Gate Exit, and Plant ID. Why these three, out of the eight? Because through experimentation Beacon ID’s were found to be too irregular, and because in general, it was found that histogram-level similarities captured the behavioral differences of workers in more detail than set-level (note that set-level descriptions, in the case where the distribution is far from uniform, lose too much information; and that for set-level descriptions, a single visit to an element, counts as much as twenty visits; and thus they can be susceptible to low-frequency rare events, and give them equal importance as highly frequent events).

Thus, we create the following combination code:

w1=0.25 #Similarity of Distribution of Gates used for GATE_IN (or BIOMETRIC_IN)
w2=0.25 #Similarity of Distribution of Gates used for GATE_OUT (or BIOMETRIC_OUT)
w3=0.5 #Similarity of Distribution of Plant-ID Tracker Ticks for this person
CMatrix1 = CombineThreeSimMatrix(GateInHistSimMatrix,w1,GateOutHistSimMatrix,w2,PlantNameHistSimMatrix,w3)

Finally, we send the similarity matrices that were created back to HANA, so that they can be displayed in the SAP Analytics Cloud Dashboard Pages. But before sending them, we convert them to pivot format, by first creating the following method:

And then we calling the above method given the matrices as arguments in order to create the required pivots, we send the results back to HANA:

And thus, the full circle of this part of our processing is closed; having created pointers to the required tables in HANA at the beginning of the article, and having used HANA-ML to do processing in-situ on the cloud server and then to fetch some partial results for further processing locally using python, we now send back these results to HANA, to make them directly accessible to SAC for dashboard interactive visualizations.

RECAP – What we have seen in this article

In this article, we illustrated the use of HANA-ML on Jupyter notebooks with Python, towards creating behavioral profiles of workers on the basis of their Entry/Exit and Beacon ID/Plant ID patterns. We structured the article around the following four questions:

Q11) HOW TO CONNECT PYTHON JUPYTER NOTEBOOKS TO HANA?

Giving specific examples from our case-study real-world project code, and then:

Q12) HOW CAN WE DEFINE WORKER BEHAVIORAL PROFILES?

Discussing the various options available, in terms of elements (entries/exits, tracker ticks, actions, activities etc.) and also in terms of summary types (single-value, set-value, histogram-like, and more0, and also in terms of how to deal with temporal sequences, relative and absolute timing.

Q13) HOW CAN WE CREATE WORKER BEHAVIORAL PROFILES?

Giving specific examples of code from our case study

Q14) HOW CAN WE DEFINE AND CREATE THE SIMILARITY OF BEHAVIORAL PROFILES?

Discussing two of the mathematical/algorithmic options available, and illustrating them through code from our case study, and then applying them to all pairs of workers in order to get similarity matrices, and finally combining them and transforming them to pivot format, in order to send them back to HANA so that SAP Analytics Cloud can provide interactive visualizations of them.

In the next article of this series, we will have a look at some of the applications of similarity of behavioral profiles: For example, towards clustering of workers and towards abnormality detection: how can we have an alarm raised (usually for a human operator / supervisor to notice, investigate, and take action), whenever the behavioral profile of a worker (or a group of workers) changes?