Help! Saptune says, my system is degraded!

Recently we got again questions about the system state in the output of saptune status, so it’s time to talk about it.

If everything is fine, you should get an output like this:

# saptune status
...
system state: running
...

But, if not, then you will get:

# saptune status
...
system state: degraded
... 

A degraded system sounds awful!

The state comes directly from the command systemctl is-system-running. If we check the man page of systemctl, we find:

...
is-system-running Checks whether the system is operational. This returns success (exit code 0) when the system is fully up and running, specifically not in startup, shutdown or maintenance mode, and with no failed services. ... degraded The system is operational but one or more units failed. ...
... 

So, the reason for a degraded system is also mostly a failed unit.
To figure out which unit failed, you can run either saptune_check or systemctl list-units --state=failed:

# saptune_check
...
[WARN] System is in status "degraded". Failed services are: saptune.service -> Check the cause and reset the state with systemctl reset-failed!
... # systemctl list-units --state=failed
UNIT LOAD ACTIVE SUB DESCRIPTION
● saptune.service loaded failed failed Optimise system for running SAP workloads LOAD = Reflects whether the unit definition was properly loaded.
ACTIVE = The high-level unit activation state, i.e. generalization of SUB.
SUB = The low-level unit activation state, values depend on unit type. 1 loaded units listed. 

In this case saptune.service failed for some reason and caused the degraded state.
If we investigate further, we can see why:

# systemctl status saptune.service
● saptune.service - Optimise system for running SAP workloads
Loaded: loaded (/usr/lib/systemd/system/saptune.service; disabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Thu 2022-08-11 14:52:20 CEST; 12min ago
Process: 2048 ExecStart=/usr/sbin/saptune service apply (code=exited, status=1/FAILURE)
Main PID: 2048 (code=exited, status=1/FAILURE) Aug 11 14:52:20 sles4sap15sp3 systemd[1]: Starting Optimise system for running SAP workloads...
Aug 11 14:52:20 sles4sap15sp3 saptune[2048]: ERROR: found an active sapconf, so refuse any action
Aug 11 14:52:20 sles4sap15sp3 systemd[1]: saptune.service: Main process exited, code=exited, status=1/FAILURE
Aug 11 14:52:20 sles4sap15sp3 systemd[1]: saptune.service: Failed with result 'exit-code'.
Aug 11 14:52:20 sles4sap15sp3 systemd[1]: Failed to start Optimise system for running SAP workloads. 

Ah, saptune refused to start, because sapconf has already tuned the system!

And this is the very reason, why saptune is printing the system state in the first place.
In the past we had often seen customer setups where both tools have been mixed up with strange results.
To spot such an easy solvable problem, both saptune status and saptune_check report issues with systemd’s system state.

But not always sapconf.service or saptune.service are the once listed as failed, but other units.
In such cases saptune status found issues, which have most likely nothing to do with saptune itself and most times not even will prevent saptune from doing its job.

So, if saptune will work, why reporting it and raise concerns?

Well, not reporting it would mean, hiding potential problems deliberately.
We think, you should know if something might be wrong and there could be a problem lurking in the shadows, which you haven’t spotted yet and waits to strike at the most inconvenient time!

The feedback we got so far, confirms this decision. Lately this even helped to discover and fix a bug in a service that had nothing to do with saptune and might not have been found for some time. Mission accomplished.

So, if you see a degraded system state and neither sapconf.service or saptune.service are involved, most certainly your tuning for SAP workload is fine. Best check it out with saptune note verify to be on the safe side.
Nevertheless you should investigate the reasons for the failed units to be sure that they don’t indicate a bigger problem.

By the way, in the upcoming version 3.1 we will rename it to systemd system status and add a few explanatory lines to the output, so that it s more obvious what is going on and what to do next.

So long!