[gpfsug-discuss] Mmhealth events longwaiters_found and deadlock_detected
Anna Greim
Anna.Greim at de.ibm.com
Thu Apr 16 11:55:56 BST 2020
Hi Heiner,
I'm not really able to give you insights into the decision of the events'
states. Maybe somebody else is able to answer here.
But about your triggering debug data collection question, please have a
look at this documentation page:
https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.4/com.ibm.spectrum.scale.v5r04.doc/bl1adv_createscriptforevents.htm
This feature is in the product since the 5.0.x versions and should be
helpful here.
It will trigger your eventsCallback script when the event is raised. One
of the script's arguments is the event name. So it is possible to create a
script, that
checks for the event name longwaiters_found and then triggers a mmdiag
--deadlock and write it into a txt file.
The script call has a hard time out of 60 seconds so it does not interfere
too much with the mmsysmon internals, but better would be a run time less
than 1 second.
Mit freundlichen Grüßen / Kind regards
Anna Greim
Software Engineer, Spectrum Scale Development
IBM Systems
IBM Data Privacy Statement
IBM Deutschland Research & Development GmbH / Vorsitzender des
Aufsichtsrats: Gregor Pillen
Geschäftsführung: Dirk Wittkopp
Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart,
HRB 243294
From: "Billich Heinrich Rainer (ID SD)" <heinrich.billich at id.ethz.ch>
To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Date: 16/04/2020 10:36
Subject: [EXTERNAL] [gpfsug-discuss] Mmhealth events
longwaiters_found and deadlock_detected
Sent by: gpfsug-discuss-bounces at spectrumscale.org
Hello,
I?m puzzled about the difference between the two mmhealth events
longwaiters_found ERROR Detected Spectrum Scale long-waiters
and
deadlock_detected WARNING The cluster detected a Spectrum Scale
filesystem deadlock
Especially why the later has level WARNING only while the first has level
ERROR? Longwaiters_found is based on the output of ?mmdiag ?deadlock? and
occurs much more often on our clusters, while the later probably is
triggered by an external event and no internal mmsysmon check? Deadlock
detection is handled by mmfsd? Whenever a deadlock is detected some
debug data is collected, which is not true for longwaiters_detected. Hm,
so why is no deadlock detected whenever mmdiag ?deadlock shows waiting
threads? Shouldn?t the severity be the opposite way?
Finally: Can we trigger some debug data collection whenever a
longwaiters_found event happens ? just getting the output of ?mmdiag
?deadlock? on the single node could give some hints. Without I don?t see
any real chance to take any action.
Thank you,
Heiner
--
=======================
Heinrich Billich
ETH Zürich
Informatikdienste
Tel.: +41 44 632 72 56
heinrich.billich at id.ethz.ch
========================
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=XLDdnBDnIn497KhM7_npStR6ig1r198VHeSBY1WbuHc&m=QAa_5ZRNpy310ikXZzwunhWU4TGKsH_NWDoYwh57MNo&s=dKWX1clbfClbfJb5yKSzhoNC1aqCbT6-7s1DQdx8CzY&e=
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20200416/3d2373aa/attachment.htm>
More information about the gpfsug-discuss
mailing list