Predicting bad SAP performance (Part 4)

In my last part I described how I started collecting data from FRUN. Now the next question is:

Which metrics will I use for evaluating the performance of a SAP system?

I don’t have very much time to check the FRUN metrics out individually, and the productive FRUN system is already under heavy load, so I cannot easily activate new metrics for monitoring, so I have to stick with what is already available.

In table MAI_UDM_STORE there is a column CATEGORY=’PERFORM’. I limit my selection further on only productive systems, and ABAP stacks. This leaves me with 26 so called TYPES, which are the performance metrics I will use. May the machine learning model later decide which of these are relevant for predicting the overall performance and which ones are not:

  • ABAP_INST_ASR_DIA_TOTAL
  • ABAP_INST_ASR_HTTPS_TOTAL
  • ABAP_INST_ASR_RFC_TOTAL
  • ABAP_INST_ASR_UPD_TOTAL
  • ABAP_INST_BTC_QUEUE_UTILIZATION
  • ABAP_INST_DIALOG_LONGRUNNING
  • ABAP_INST_DIALOG_RESPONSE_TIME_EVENT
  • ABAP_INST_ICM_CONN_USAGE
  • ABAP_INST_ICM_THREAD_USAGE
  • ABAP_INST_MEMORY_EM_USED
  • ABAP_INST_MEMORY_PAGING_AREA_USED
  • ABAP_INST_UPDATE_LONGRUNNING
  • ABAP_SYS_TRANSACTION_RESPONSETIME
  • ABAP_UPDATE1_RESPONSE_TIME
  • ABAP_UPDATE2_RESPONSE_TIME
  • BATCH_RESOURCES
  • DIALOG_FRONTEND_RESPONSE_TIME
  • DIALOG_QUEUE_TIME
  • DIALOG_RESOURCES
  • DIALOG_RESPONSE_TIME
  • ICM_RESOURCES
  • NUMBER_OF_FREE_DIALOG_WORK_PRO
  • SYSTEM_PERFORMANCE
  • UPDATE2_QUEUE_UTILIZATION
  • UPDATE_RESPONSE_TIME
  • UPDATE_RESSOURCES

This brings me to another important point. How do I determine if the performance is actually good or bad? This is a very important question, because I do not have the time to manually evaluate all available data to identify and label the cases of actual bad performance. I want to automate this and generate lots of labeled data for machine learning.

In this case, I simply use the standard evaluation from SAP FRUN for these metrics. Each of these 26 metrics gets a simple rating of OK, WARNING or CRITICAL, depending on preconfigured thresholds. While this standard rating from SAP FRUN might not be ideal, it is a valid starting point and a huge time saver.

I simply define at each point in time the “performance health” as how the 26 metrics got rated. For a rating of OK there are 2 points awarded, for WARNING just 1 point and for CRITICAL I use 0 points. Then that sum is normalized to a value between 0 to 100%.

If all performance evaluations are OK, then the performance health rating is 100% (= best case). If all performance evaluations are CRITICAL then the health rating is 0% (= worst case scenario). This can be easily calculated with a simple SQL statement from the database. In a next step, I can identify the incidents where SAP systems encountered a performance incident, and even get some idea about how long the problem persisted and how high the impact was.

In a way, I have now something I could call “Anomaly Detection”. In my big database, I can now identify incidents showing an abnormal bad SAP performance. This will be the basis for the next steps in the series, where I tackle “Anomaly Prediction”.