[gpfsug-discuss] Unkillable snapshots
Nathan Falk
nfalk at us.ibm.com
Thu Feb 20 22:13:56 GMT 2020
Good point, Simon. Yes, it is a "file system quiesce" not a "fileset
quiesce" so it is certainly possible that mmfsd is unable to quiesce
because there are processes keeping files open in another fileset.
Nate Falk
IBM Spectrum Scale Level 2 Support
Software Defined Infrastructure, IBM Systems
From: Simon Thompson <S.J.Thompson at bham.ac.uk>
To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Date: 02/20/2020 04:39 PM
Subject: [EXTERNAL] Re: [gpfsug-discuss] Unkillable snapshots
Sent by: gpfsug-discuss-bounces at spectrumscale.org
Hi Nate,
So we're trying to clean up snapshots from the GUI ... we've found that if
it fails to delete one night for whatever reason, it then doesn't go back
another day and clean up 😊
But yes, essentially running this by hand to clean up.
What I have found is that lsof hangs on some of the "suspect" nodes. But
if I strace it, its hanging on a process which is using a different
fileset. For example, the file-set we can't delete is:
rds-projects-b which is mounted as /rds/projects/b
But on some suspect nodes, strace lsof /rds, that hangs at a process which
has open files in:
/rds/projects/g which is a different file-set.
What I'm wondering if its these hanging processes in the "g" fileset which
is killing us rather than something in the "b" fileset. Looking at the "g"
processes, they look like a weather model and look to be dumping a lot of
files in a shared directory, so I wonder if the mmfsd process is busy
servicing that and so whilst its not got "b" locks, its just too slow to
respond?
Does that sound plausible?
Thanks
Simon
From: gpfsug-discuss-bounces at spectrumscale.org
<gpfsug-discuss-bounces at spectrumscale.org> on behalf of nfalk at us.ibm.com
<nfalk at us.ibm.com>
Sent: 20 February 2020 21:26:39
To: gpfsug main discussion list
Subject: Re: [gpfsug-discuss] Unkillable snapshots
Hello Simon,
Sadly, that "1036" is not a node ID, but just a counter.
These are tricky to troubleshoot. Usually, by the time you realize it's
happening and try to collect some data, things have already timed out.
Since this mmdelsnapshot isn't something that's on a schedule from cron or
the GUI and is a command you are running, you could try some heavy-handed
data collection.
You suspect a particular fileset already, so maybe have a 'mmdsh -N all
lsof /path/to/fileset' ready to go in one window, and the 'mmdelsnapshot'
ready to go in another window? When the mmdelsnapshot times out, you can
find the nodes it was waiting on in the file system manager
mmfs.log.latest and see what matches up with the open files identified by
lsof.
It sounds like you already know this, but the <c0n42> type of internal
node names in the log messages can be translated with 'mmfsadm dump
tscomm' or also plain old 'mmdiag --network'.
Thanks,
Nate Falk
IBM Spectrum Scale Level 2 Support
Software Defined Infrastructure, IBM Systems
From: Simon Thompson <S.J.Thompson at bham.ac.uk>
To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Date: 02/20/2020 03:14 PM
Subject: [EXTERNAL] Re: [gpfsug-discuss] Unkillable snapshots
Sent by: gpfsug-discuss-bounces at spectrumscale.org
Hmm ... mmdiag --tokenmgr shows:
Server stats: requests 195417431 ServerSideRevokes 120140
nTokens 2146923 nranges 4124507
designated mnode appointed 55481 mnode thrashing detected 1036
So how do I convert "1036" to a node?
Simon
From: gpfsug-discuss-bounces at spectrumscale.org
<gpfsug-discuss-bounces at spectrumscale.org> on behalf of Simon Thompson
<S.J.Thompson at bham.ac.uk>
Sent: 20 February 2020 19:45:02
To: gpfsug main discussion list
Subject: [gpfsug-discuss] Unkillable snapshots
Hi,
We have a snapshot which is stuck in the state "DeleteRequired". When
deleting, it goes through the motions but eventually gives up with:
Unable to quiesce all nodes; some processes are busy or holding required
resources.
mmdelsnapshot: Command failed. Examine previous error messages to
determine cause.
And in the mmfslog on the FS manager there are a bunch of retries and
"failure to quesce" on nodes. However in each retry its never the same set
of nodes. I suspect we have one HPC job somewhere killing us.
What's interesting is that we can delete other snapshots OK, it appears to
be one particular fileset.
My old goto "mmfsadm dump tscomm" isn't showing any particular node, and
waiters around just tend to point to the FS manager node.
So ... any suggestions? I'm assuming its some workload holding a lock open
or some such, but tracking it down is proving elusive!
Generally the FS is also "lumpy" ... at times it feels like a wifi
connection on a train using a terminal, I guess its all related though.
Thanks
Simon
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p3ZFejMgr8nrtvkuBSxsXg&m=eGuD3K3Va_jMinEQHJN-FU1-fi2V-VpqWjHiTVUK-L8&s=fX3QMwGX7-yxSM4VSqPqBUbkT41ntfZFRZnalg9PZBI&e=
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20200220/7970ff66/attachment.htm>
More information about the gpfsug-discuss
mailing list