Python hana_ml: Classification Training with APL(GradientBoostingBinaryClassifier)

I am writing this blog to show training with APL using python package hana_ml.  With APL, you can automate preprocessing to some extent.

Environment is as below.

  • Python: 3.7.14(Google Colaboratory)
  • HANA: Cloud Edition 2022.16
  • APL: 2209

Python packages and their versions.

  • hana_ml: 2.14.22091801
  • pandas: 1.3.5
  • scikit-learn: 1.0.2

As for HANA Cloud, I activated scriptserver and created my users.  Though I don’t recognize other special configurations, I may miss something since our HANA Cloud was created long time before.

I didn’t use HDI here to make environment simple.

1. Install Python packages

Install python package hana_ml, which is not pre-installed on Google Colaboratory.

As for pandas and scikit-learn, I used pre-installed ones.

!pip install hana_ml

2. Import modules

Import python package modules.

import pprint from hana_ml.algorithms.apl.apl_base import get_apl_version
from hana_ml.algorithms.apl.gradient_boosting_classification \ import GradientBoostingBinaryClassifier
from hana_ml.algorithms.pal.partition import train_test_val_split
from hana_ml.dataframe import ConnectionContext, create_dataframe_from_pandas
from hana_ml.model_storage import ModelStorage
from hana_ml.visualizers.unified_report import UnifiedReport
import pandas as pd
from sklearn.datasets import make_classification

3. Connect to HANA Cloud

Connect to HANA Cloud and check its version.

ConnectionContext class is for connection to HANA.  You can check the APL version with get_apl_version function.

HOST = '<HANA HOST NAME>'
SCHEMA = USER = '<USER NAME>'
PASS = '<PASSWORD>'
conn = ConnectionContext(address=HOST, port=443, user=USER, password=PASS, schema=SCHEMA) print(conn.hana_version()) # APL.Version.ServicePack is APL
print(get_apl_version(conn))
4.00.000.00.1660640318 (fa/CE2022.16) name value
0 APL.Version.Major 4
1 APL.Version.Minor 400
2 APL.Version.ServicePack 2209
3 APL.Version.Patch 1
4 APL.Info Automated Predictive Library
5 AFLSDK.Version.Major 2
6 AFLSDK.Version.Minor 16
7 AFLSDK.Version.Patch 0
8 AFLSDK.Info 2.16.0
9 AFLSDK.Build.Version.Major 2
10 AFLSDK.Build.Version.Minor 13
11 AFLSDK.Build.Version.Patch 0
12 AutomatedAnalytics.Version.Major 10
13 AutomatedAnalytics.Version.Minor 2209
14 AutomatedAnalytics.Version.ServicePack 1
15 AutomatedAnalytics.Version.Patch 0
16 AutomatedAnalytics.Info Automated Analytics
17 HDB.Version 4.00.000.00.1660640318
18 SQLAutoContent.Date 2022-04-19
19 SQLAutoContent.Version 4.400.2209.1
20 SQLAutoContent.Caption Automated Predictive SQL Library for Hana Cloud

4. Create test data

Create test data using scikit-learn.

There are 3 features and 1 target variable.

def make_df(): X, y = make_classification(n_samples=1000, n_features=3, n_redundant=0) df = pd.DataFrame(X, columns=['X1', 'X2', 'X3']) df['CLASS'] = y return df df = make_df()
print(df)
df.info()

Here is dataframe overview.

 X1 X2 X3 CLASS
0 0.964229 1.995667 0.244143 1
1 -1.358062 -0.254956 0.502890 0
2 1.732057 0.261251 -2.214177 1
3 -1.519878 1.023710 -0.262691 0
4 4.020262 1.381454 -1.582143 1
.. ... ... ... ...
995 -0.247950 0.500666 -0.219276 1
996 -1.918810 0.183850 -1.448264 0
997 -0.605083 -0.491902 1.889303 0
998 -0.742692 0.265878 -0.792163 0
999 2.189423 0.742682 -2.075825 1 [1000 rows x 4 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 X1 1000 non-null float64 1 X2 1000 non-null float64 2 X3 1000 non-null float64 3 CLASS 1000 non-null int64 dtypes: float64(3), int64(1)
memory usage: 31.4 KB

5. define table and upload data

Define HANA Table and upload data using function “create_dataframe_from_pandas”.

The function is very useful, since it automatically define table and upload at the same time.  Please check options for further detail.

TRAIN_TABLE = 'PAL_TRAIN'
dfh = create_dataframe_from_pandas(conn, df, TRAIN_TABLE, schema=SCHEMA, force=True, # True: truncate and insert replace=True) # True: Null is replaced by 0

6. Split data into train and test dataset

Split dataset using function “train_test_val_split”.  The function needs key columns, so I added key column using function “add_id”.

train, test, _ = train_test_val_split(dfh.add_id(), testing_percentage=0.2, validation_percentage=0)
print(f'Train shape: {train.shape}, Test Shape: {test.shape}')
Train shape: [8000, 5], Test Shape: [2000, 5]

7. Training

Train with random forest by using class “GradientBoostingClassifier”.  Please make sure class AutoClassifier is deprecated.

model = GradientBoostingBinaryClassifier()
model.fit(train, label='CLASS', key='ID', build_report=True)

8. Training result

8.1. Unified Report

Model report shows with the below code.  Please see another article “Python hana_ml: PAL Classification Training(UnifiedClassification)” for the report content, which is basically same.

model.generate_notebook_iframe_report()
model.generate_html_report('apl')

8.2. Score

Score function returns mean average accuracy.

# score: mean average accuracy. cannot output other metrics
score = model.score(test)
print(score)

8.3. Summary

get_summary function returns model summary.

model.get_summary().deselect('OID').collect()

8.4. Metrics

get_performance_metrics function returns metrics information.

>> pprint.pprint(model.get_performance_metrics()) {'AUC': 0.991, 'BalancedClassificationRate': 0.964590677634156, 'BalancedErrorRate': 0.03540932236584404, 'BestIteration': 69, 'ClassificationRate': 0.9646017699115044, 'CohenKappa': 0.9291813552683117, 'GINI': 0.4823, 'KS': 0.9195, 'LogLoss': 0.12414480396790141, 'PredictionConfidence': 0.991, 'PredictivePower': 0.982, 'perf_per_iteration': {'LogLoss': [0.617163, 0.554102, 0.499026,
<omit> 0.125448, 0.125588]}}

8.5. Statistical Report

get_debrief_report function returns several type of statistical reports.  Please See Statistical Reports in the SAP HANA APL Reference Guide.

reports = ['Statistics_Partition', 'Statistics_Variables', 'Statistics_CategoryFrequencies', 'Statistics_GroupFrequencies', 'Statistics_ContinuousVariables', 'ClassificationRegression_VariablesCorrelation', 'ClassificationRegression_VariablesContribution', 'ClassificationRegression_VariablesExclusion', 'Classification_BinaryClass_ConfusionMatrix'] for report in reports: print('\n'+report) display(model.get_debrief_report(report).deselect('Oid').head(3).collect())

8.6. Indicators

get_indicators function returns all indicators with unified format.

model.get_indicators().collect()

8.7. Model info

get_model_info function returns several type of reports.

for model_info in model.get_model_info(): print('\n', model_info.source_table['TABLE_NAME']) display(model_info.deselect('OID').head(3).collect())

9. Predict

You can predict with function predict.

>> model.set_params(extra_applyout_settings={'APL/ApplyExtraMode': 'Individual Contributions'})
>> apply_out = model.predict(test)
>> print(apply_out.head(3).collect()) ID TRUE_LABEL PREDICTED gb_score_CLASS gb_contrib_X1 gb_contrib_X2 gb_contrib_X3 gb_contrib_constant_bias
0 12 0 0 2.592326 -0.222146 3.193908 -0.383197 0.003759
1 13 1 1 -4.876161 0.141867 -4.717393 -0.304394 0.003759
2 19 1 1 -4.074210 0.433828 -4.438335 -0.073464 0.003759

10. Save model

Just save model with class “ModelStorage” and function “save_model”.

ms = ModelStorage(conn)
# ms.clean_up()
model.name = 'My classification model name'
ms.save_model(model, if_exists='replace')

You can see the saved model.

# display(ms.list_models())
pprint.pprint(ms.list_models().to_dict())
{'CLASS': {0: 'hana_ml.algorithms.apl.gradient_boosting_classification.GradientBoostingBinaryClassifier'}, 'JSON': {0: '{"model_attributes": {"name": "My classification model name", ' '"version": 1, "log_level": 8, "model_format": "bin", "language": ' '"en", "label": "CLASS", "auto_metric_sampling": false}, ' '"fit_params": {}, "artifacts": {"schema": "I348221", ' '"model_tables": ["HANAML_APL_MODELS_DEFAULT"], "library": ' '"APL"}, "pal_meta": {}}'}, 'LIBRARY': {0: 'APL'}, 'MODEL_REPORT': {0: None}, 'MODEL_STORAGE_VER': {0: 1}, 'NAME': {0: 'My classification model name'}, 'SCHEDULE': {0: '{"schedule": {"status": "inactive", "schedule_time": "every ' '1 hours", "pid": null, "client": null, "connection": ' '{"userkey": "your_userkey", "encrypt": "false", ' '"sslValidateCertificate": "true"}, "hana_ml_obj": ' '"hana_ml.algorithms.pal.xx", "init_params": {}, ' '"fit_params": {}, "training_dataset_select_statement": ' '"SELECT * FROM YOUR_TABLE"}}'}, 'STORAGE_TYPE': {0: 'default'}, 'TIMESTAMP': {0: Timestamp('2022-09-21 08:57:33')}, 'VERSION': {0: 1}}

11. Close connection

Last but not least, close the connection.

conn.close()