Data Yoga-It is all about finding the right balance

A big live style topic these days is finding the right balance, either your work-life balance or with yoga and mindfulness your inner balance. In Data Science this is not always so easy. There is quite often a natural imbalance, so we very often deal with such data sets.  

Let’s start at the beginning: What does imbalanced mean for data sets? We often have data sets where we want to predict a certain category, for example, if our customers are churning or not and both categories don’t appear equally in our data set. On the contrary, one class has significantly more events than another. From a business point of view, this is a good thing we don’t want 50% of our customers to churn, we rather prefer it to be only 10%. Or when it comes to predictive quality, we don’t want our factories to produce 50% bad quality, we rather prefer it to be less than 2%. So here we are happy that we have unbalanced results, and we don’t pursue balance. The only problem that appears from a statistical point of view and especially when we take the predictive quality example is when we have rare cases, it is sometimes harder to find patterns in these imbalanced data sets. This makes it also more challenging to predict such events with machine learning. But we still want to use machine learning to reduce our churn rate or the bad quality even further, so what can we do?  

The first impulse is: To add more data. Good idea, but what can we do if we already have all our data available and the machine learning model still doesn’t give us sufficient results? Here over- and undersampling can help.  

The idea is that we randomly select from the rare case event (in our example the churned customer or the bad quality) cases and add them to our data set. This is called oversampling. For undersampling we randomly select data from the majority group (our loyal customers or the good quality) and remove them from the data set. In both cases, we will now get a less imbalanced data set. If we use these methods long enough, we can even generate completely balanced data sets, but from my experience, this is not necessary most of the time.  

So, the bottom line is over- and undersampling is pretty cool and sometimes necessary, as you can see in the visualization (I mean it is hard enough to hold that pose without balancing the unbalanced bowls). Therefore, I am very glad that the native machine learning library in SAP HANA called hana_ml offers not only cool algorithms to train your predictive models but also functions for over- and undersampling. 

For oversampling the Synthetic minority over-sampling technique (SMOTE) is used.  In standard oversampling you would stupidly copy the points from the less frequent category, add them to your data set and this way create a lot of duplicates. The idea behind SMOTE is not to generate duplicates, but rather to create synthetic data points that are slightly different from the original data points. The technical documentation is pretty good, so if you want to try it now on your own: 

hana_ml.algorithms.pal package — hana-ml 2.13.220715 documentation (sap.com) 

 

For undersampling the Tomek’s Links method is used. The idea behind this is to detect points that are closest neighbors and belong to different classes so-called Tomek Links. This point will then be removed, you have the choice to either remove both points or only one of them (traditionally the one belonging to the majority class is removed). My suggestion is to try what works best on your data set. The technical documentation for this can be found here: 

 hana_ml.algorithms.pal package — hana-ml 2.13.220722 documentation (sap.com) 

 

Of course, you can combine both methods. Luckily there is already such a combined procedure prepared for you in the hana_ml library: 

hana_ml.algorithms.pal package — hana-ml 2.13.220722 documentation (sap.com) 

 

I hope after this quick overview of over- and undersampling with hana_ml you now are super motivated to directly try it. My colleague Yannick Schaper wrote an amazing blog post on how to get started with training your first machine learning model in HANA. The use case he used is detecting predictive quality and as usual for such a case it is a highly unbalanced data set. Hence, my suggestion is to build on the Blog and use case from Yannick and challenge yourself to create better results using over- or undersampling. Have fun.