Some day one of our Exadata compute nodes was evicted from cluster due to hangup of critical cluster process.
From crs log files we found out that server was evicted because the heartbeat didn’t answer anymore.
We decided to file an service request at My Oracle Support.
After logging the service request and sending trace file analyzer output the support engineer found out that the heartbeat process hangs up due to “pstack” execution from diagsnap.pl.
Explanation from support engineer:
diagsnap.pl is part of the Cluster Health Monitor (starting with GI 220.127.116.11) and under certain conditions, this script executes the “pstack” command against critical clusterware processes. The “pstack” command execution and locking can lead these key clusterware processes to hang (especially ocssd.bin) which can trigger clusterware crashes/node evictions.
Workaround is to disable the execution of diagsnap.pl using oclumon utility like that:
<GRID_HOME>/bin/oclumon manage -disable diagsnap
As of support engineer this will disable the diagsnap execution cluster wide.
After disable it using oclumon utility you have to check for running diagsnap.pl processes and kill them.
Keep in mind that on every newly added compute node or after installing Grid Infrastructure patches the execution of diagsnap.pl can be enabled again.