HANA Cloud advanced feature importance with help of HGBDT


Introduction

This blog post will be about advanced feature importance with help of tree analysis. We will have a look on the famous dataset WINE (easy multi-class classification dataset) and upload it to HANA Cloud instance. On top of it we will prepare specific class for analysis of feature importance.

Main Idea

Decision trees and especially gradient boosting – is quite powerful algorithm for classification and regression tasks. Also, it could be very useful to have a look on feature importance. But let’s add additional variation. Let’s have a look not only on basic importance, but with different max_depths. This approach could bring us additional insights – smth like level of importance (for tree). The most important features for splits will be on the first max_depth = 1 and for max_depth = 2 and so on – it is interaction important. So, we can have a look on the heatmap.

Realisation

Let’s have a look into code part and as first step – we need some libs:

import hana_ml as hml
from hana_ml.algorithms.pal.trees import GradientBoostingClassifier
from sklearn.datasets import load_wine
import pandas as pd
import seaborn as sns
from matplotlib.pylab import plt
%matplotlib inline

After that – some data and connection to HANA cloud

cc = hml.dataframe.ConnectionContext(address=..., port=443, user=..., password=..., encrypt=True )
data = load_wine(as_frame=True)
data.data['target']=data.target
data.data.head()

Let’s upload it to HANA Cloud

hml.dataframe.create_dataframe_from_pandas(cc,data.data,'WINE')

And now we can grab this data and check:

df = hml.dataframe.DataFrame(cc,'SELECT * FROM WINE')
df.head().collect()

Ok, so, we can add additional class for feature importance with variable max_depth:

class FImpy: def __init__(self,data:hml.dataframe.DataFrame,target_name:str,max_depth:int): self.data = data self.target_name = target_name self.max_depth=max_depth+1 self._calc_importance() def plot(self): plt.figure(figsize=(12,int(self.df.shape[0]/1.65))) sns.heatmap(self.df[self.df.columns[1:]],linewidths=.5, yticklabels=self.df['VARIABLE_NAME'],annot=True) def _calc_importance(self): self.df = self.get_importance(1) for depth in range(2,self.max_depth,1): self.df = self.df.merge(self.get_importance(depth),how='left',on='VARIABLE_NAME') def get_importance(self,max_depth:int): params={ 'n_estimators':10, 'fold_num':1, 'max_depth':max_depth } hgbc = HybridGradientBoostingClassifier(**params, evaluation_metric = 'error_rate', ref_metric=['auc'], calculate_importance=True) hgbc.fit(df,label=self.target_name) hres = hgbc.feature_importances_.collect().sort_values('IMPORTANCE',ascending=False) fscore_name = f'IMPScore_d{max_depth}' hres = hres.rename(mapper={'IMPORTANCE':fscore_name},axis=1) return hres def mean(self): cols=self.df.columns[1:] res = self.df[['VARIABLE_NAME']].copy() res['IMPScore_mean'] = self.df[cols].mean(axis=1) return res.sort_values('IMPScore_mean',ascending=False).style\ .bar(color='lightgreen')

And create new instance for feature importance analysis, where df – is our data, ‘target’ – name of target columns and 5 – max_depth (from 1 to 5)

fimportance = FImpy(df,'target',5)

So, we can have a look on the mean importance as easy as :

fimportance.mean()

And if we want full view – we have to call method plot()

fimportance.plot()

So, we can see that there are a lot of not so usefull features, and it is very interesting that color_intensity only shine after 3+ depth, so this feature need additional interaction to others.

The End

Try it yourself and share you thoughts about this method.