From anacreo at gmail.com Sat Sep 2 19:10:47 2023 From: anacreo at gmail.com (Alec) Date: Sat, 2 Sep 2023 11:10:47 -0700 Subject: [gpfsug-discuss] How to properly debug CES / Ganesha? In-Reply-To: References: <3a2dbed3-3f88-97a2-e588-a5300f74d32a@psi.ch> Message-ID: Maybe a good reminder that you can also bring a tcpdump file (-w) into Wireshark for a GUI based analysis. Also I've seen IBM support do a lot with a packet capture on a ticket, as well as a snap. Alec On Tue, Aug 29, 2023, 2:17 AM Helge Hauglin wrote: > Hi, > > To identify which address sends most packages to and from a protocol > node, I use a variation of this: > > | tcpdump -c 20000 -i 2>/dev/null | grep IP | cut -d' ' -f3 | > sort | uniq -c | sort -nr | head -10 > > (Collect 20.000 packages, pick out sender address and port, sort and > count those, make a top 10 list.) > > You could limit to only NFS traffic by adding "port nfs" at the end of > the "tcpdump" command, but then you would not see e.g SMB clients with a > lot of traffic, if there are any of those. > > > Hallo, > > > > since some time we do have seemingly random issues with a particular > > customer accessing data over Ganesha / CES (5.1.8). What happens is > > that the CES server owning their IP gets a very high cpu load, and > > every operation on the NFS clients become sluggish. It does seem not > > related to throughput, and looking at the metrics [*] I do not see a > > correlation with e.g. increased NFS ops. I see no events in GPFS, and > > nothing suspicious in the ganesha and gpfs log files. > > > > What would be a good procedure to identify the misbehaving client (I > > suspect NFS, as it seems there is only 1 idle SMB client)? I have put > > now LOGLEVEL=INFO in ganesha to see if I catch anything interesting, > > but I would be curious on how this kind of apparently random issues > > could be better debugged and restricted to a client > > > > Thanks a lot! > > > > regards > > > > leo > > > > [*] > > > > for i in read write; do for j in ops queue lat req err; do mmperfmon > > query "ces-server|NFSIO|/export/path|NFSv41|nfs_${i}_$j" > > 2023-08-25-14:40:00 2023-08-25-15:05:00 -b60; done; done > > -- > Regards, > > Helge Hauglin > > ---------------------------------------------------------------- > Mr. Helge Hauglin, Senior Engineer > System administrator > Center for Information Technology, University of Oslo, Norway > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From anacreo at gmail.com Sat Sep 2 19:10:47 2023 From: anacreo at gmail.com (Alec) Date: Sat, 2 Sep 2023 11:10:47 -0700 Subject: [gpfsug-discuss] How to properly debug CES / Ganesha? In-Reply-To: References: <3a2dbed3-3f88-97a2-e588-a5300f74d32a@psi.ch> Message-ID: Maybe a good reminder that you can also bring a tcpdump file (-w) into Wireshark for a GUI based analysis. Also I've seen IBM support do a lot with a packet capture on a ticket, as well as a snap. Alec On Tue, Aug 29, 2023, 2:17 AM Helge Hauglin wrote: > Hi, > > To identify which address sends most packages to and from a protocol > node, I use a variation of this: > > | tcpdump -c 20000 -i 2>/dev/null | grep IP | cut -d' ' -f3 | > sort | uniq -c | sort -nr | head -10 > > (Collect 20.000 packages, pick out sender address and port, sort and > count those, make a top 10 list.) > > You could limit to only NFS traffic by adding "port nfs" at the end of > the "tcpdump" command, but then you would not see e.g SMB clients with a > lot of traffic, if there are any of those. > > > Hallo, > > > > since some time we do have seemingly random issues with a particular > > customer accessing data over Ganesha / CES (5.1.8). What happens is > > that the CES server owning their IP gets a very high cpu load, and > > every operation on the NFS clients become sluggish. It does seem not > > related to throughput, and looking at the metrics [*] I do not see a > > correlation with e.g. increased NFS ops. I see no events in GPFS, and > > nothing suspicious in the ganesha and gpfs log files. > > > > What would be a good procedure to identify the misbehaving client (I > > suspect NFS, as it seems there is only 1 idle SMB client)? I have put > > now LOGLEVEL=INFO in ganesha to see if I catch anything interesting, > > but I would be curious on how this kind of apparently random issues > > could be better debugged and restricted to a client > > > > Thanks a lot! > > > > regards > > > > leo > > > > [*] > > > > for i in read write; do for j in ops queue lat req err; do mmperfmon > > query "ces-server|NFSIO|/export/path|NFSv41|nfs_${i}_$j" > > 2023-08-25-14:40:00 2023-08-25-15:05:00 -b60; done; done > > -- > Regards, > > Helge Hauglin > > ---------------------------------------------------------------- > Mr. Helge Hauglin, Senior Engineer > System administrator > Center for Information Technology, University of Oslo, Norway > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From TROPPENS at de.ibm.com Tue Sep 5 15:27:51 2023 From: TROPPENS at de.ibm.com (Ulf Troppens) Date: Tue, 5 Sep 2023 14:27:51 +0000 Subject: [gpfsug-discuss] User Meeting in NYC at Sep 27, 2023 Message-ID: Greetings! IBM is organizing a Spectrum Scale User Meeting in New York City. We have an exciting agenda covering user stories, roadmap update, the latest insights into data fabrics, data orchestration and data management architectures, plus access to IBM experts and your peers. We look forward to welcoming you to this event. Please register here: https://www.spectrumscaleug.org/event/nyc-user-meeting-2023/ Draft Agenda: 9:00 Doors open, light breakfast and coffee 9:30 User Meeting - Welcome and introduction - Discussion of IBM Storage Defender - Storage Scale Strategy - What's New - Lightning Talk - Modernization of Storage Scale - Sneak previews - MCOT & MROT - Performance Engineering - ESS3500 - Fabric Hospital & Disk Hospital - High density Tape for Backup and Archive solutions - High Performance SMB with Tuxera - Customer Presentations 5:00-7:00 Reception - Wine/Beer Hors d'oeuvres and networking Ulf Troppens Product Manager - IBM Storage for Data and AI, Data-Intensive Workflows IBM Deutschland Research & Development GmbH Vorsitzender des Aufsichtsrats: Gregor Pillen / Gesch?ftsf?hrung: David Faller Sitz der Gesellschaft: B?blingen / Registergericht: Amtsgericht Stuttgart, HRB 243294 -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonathan.buzzard at strath.ac.uk Wed Sep 6 10:02:17 2023 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Wed, 6 Sep 2023 10:02:17 +0100 Subject: [gpfsug-discuss] mmbackup feature request Message-ID: Would it be possible to have the mmbackup output display the percentage output progress when backing up files? So at the top we you see something like this Tue Sep 5 23:13:35 2023 mmbackup:changed=747204, expired=427702, unsupported=0 for server [XXXX] Then after it does the expiration you see during the backup lines like Wed Sep 6 02:43:53 2023 mmbackup:Backing up files: 527024 backed up, 426018 expired, 4408 failed. (Backup job exit with 4) It would IMHO be helpful if it looked like Wed Sep 6 02:43:53 2023 mmbackup:Backing up files: 527024 (70.5%) backed up, 426018 (100%) expired, 4408 failed. (Backup job exit with 4) Just based on the number of files. Though as I look at it now I am curious about the discrepancy in the number of files expired, given that the expiration stage allegedly concluded with no errors? Tue Sep 5 23:21:49 2023 mmbackup:Completed policy expiry run with 0 policy errors, 0 files failed, 0 severe errors, returning rc=0. Tue Sep 5 23:21:49 2023 mmbackup:Policy for expiry returned 0 Highest TSM error 0 JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From st.graf at fz-juelich.de Wed Sep 6 10:26:52 2023 From: st.graf at fz-juelich.de (Stephan Graf) Date: Wed, 6 Sep 2023 11:26:52 +0200 Subject: [gpfsug-discuss] mmbackup feature request In-Reply-To: References: Message-ID: <4de6d5a1-0f29-25a0-94c7-3bd85613f2a7@fz-juelich.de> Hi I think it should be possible because mmbackup know, how many files are to be backed up, which have been already processed and how many are still to go. BTW it would also be nice to have an option in mmbackup to generate machine readable log file like JSON or CSV. But the right way to ask for a new feature or to look if there is already a request open is the IBM IDEA portal: https://ideas.ibm.com Stephan On 9/6/23 11:02, Jonathan Buzzard wrote: > > Would it be possible to have the mmbackup output display the percentage > output progress when backing up files? > > So at the top we you see something like this > > Tue Sep? 5 23:13:35 2023 mmbackup:changed=747204, expired=427702, > unsupported=0 for server [XXXX] > > Then after it does the expiration you see during the backup lines like > > Wed Sep? 6 02:43:53 2023 mmbackup:Backing up files: 527024 backed up, > 426018 expired, 4408 failed. (Backup job exit with 4) > > It would IMHO be helpful if it looked like > > Wed Sep? 6 02:43:53 2023 mmbackup:Backing up files: 527024 (70.5%) > backed up, 426018 (100%) expired, 4408 failed. (Backup job exit with 4) > > Just based on the number of files. Though as I look at it now I am > curious about the discrepancy in the number of files expired, given that > the expiration stage allegedly concluded with no errors? > > Tue Sep? 5 23:21:49 2023 mmbackup:Completed policy expiry run with 0 > policy errors, 0 files failed, 0 severe errors, returning rc=0. > Tue Sep? 5 23:21:49 2023 mmbackup:Policy for expiry returned 0 Highest > TSM error 0 > > > > JAB. > -- Stephan Graf Juelich Supercomputing Centre Phone: +49-2461-61-6578 Fax: +49-2461-61-6656 E-mail: st.graf at fz-juelich.de WWW: http://www.fz-juelich.de/jsc/ --------------------------------------------------------------------------------------------- --------------------------------------------------------------------------------------------- Forschungszentrum Juelich GmbH 52425 Juelich Sitz der Gesellschaft: Juelich Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498 Vorsitzender des Aufsichtsrats: MinDir Volker Rieke Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender), Karsten Beneke (stellv. Vorsitzender), Dr. Astrid Lambrecht, Prof. Dr. Frauke Melchior --------------------------------------------------------------------------------------------- --------------------------------------------------------------------------------------------- -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 5938 bytes Desc: S/MIME Cryptographic Signature URL: From marcus at koenighome.de Wed Sep 6 10:32:36 2023 From: marcus at koenighome.de (Marcus Koenig) Date: Wed, 6 Sep 2023 21:32:36 +1200 Subject: [gpfsug-discuss] mmbackup feature request In-Reply-To: <4de6d5a1-0f29-25a0-94c7-3bd85613f2a7@fz-juelich.de> References: <4de6d5a1-0f29-25a0-94c7-3bd85613f2a7@fz-juelich.de> Message-ID: I'm using this one liner to get the progress grep 'mmbackup:Backup job finished'|cut -d ":" -f 6|awk '{print $1}'|awk '{s+=$1}END{print s}' That can be compared to the files identified during the scan. On Wed, 6 Sept 2023, 21:29 Stephan Graf, wrote: > Hi > > I think it should be possible because mmbackup know, how many files are > to be backed up, which have been already processed and how many are > still to go. > > BTW it would also be nice to have an option in mmbackup to generate > machine readable log file like JSON or CSV. > > But the right way to ask for a new feature or to look if there is > already a request open is the IBM IDEA portal: > > https://ideas.ibm.com > > Stephan > > On 9/6/23 11:02, Jonathan Buzzard wrote: > > > > Would it be possible to have the mmbackup output display the percentage > > output progress when backing up files? > > > > So at the top we you see something like this > > > > Tue Sep 5 23:13:35 2023 mmbackup:changed=747204, expired=427702, > > unsupported=0 for server [XXXX] > > > > Then after it does the expiration you see during the backup lines like > > > > Wed Sep 6 02:43:53 2023 mmbackup:Backing up files: 527024 backed up, > > 426018 expired, 4408 failed. (Backup job exit with 4) > > > > It would IMHO be helpful if it looked like > > > > Wed Sep 6 02:43:53 2023 mmbackup:Backing up files: 527024 (70.5%) > > backed up, 426018 (100%) expired, 4408 failed. (Backup job exit with 4) > > > > Just based on the number of files. Though as I look at it now I am > > curious about the discrepancy in the number of files expired, given that > > the expiration stage allegedly concluded with no errors? > > > > Tue Sep 5 23:21:49 2023 mmbackup:Completed policy expiry run with 0 > > policy errors, 0 files failed, 0 severe errors, returning rc=0. > > Tue Sep 5 23:21:49 2023 mmbackup:Policy for expiry returned 0 Highest > > TSM error 0 > > > > > > > > JAB. > > > > -- > Stephan Graf > Juelich Supercomputing Centre > > Phone: +49-2461-61-6578 > Fax: +49-2461-61-6656 > E-mail: st.graf at fz-juelich.de > WWW: http://www.fz-juelich.de/jsc/ > > --------------------------------------------------------------------------------------------- > > --------------------------------------------------------------------------------------------- > Forschungszentrum Juelich GmbH > 52425 Juelich > Sitz der Gesellschaft: Juelich > Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498 > Vorsitzender des Aufsichtsrats: MinDir Volker Rieke > Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender), > Karsten Beneke (stellv. Vorsitzender), Dr. Astrid Lambrecht, > Prof. Dr. Frauke Melchior > > --------------------------------------------------------------------------------------------- > > --------------------------------------------------------------------------------------------- > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From christian.petersson at isstech.io Wed Sep 6 10:44:56 2023 From: christian.petersson at isstech.io (Christian Petersson) Date: Wed, 6 Sep 2023 11:44:56 +0200 Subject: [gpfsug-discuss] mmbackup feature request In-Reply-To: References: <4de6d5a1-0f29-25a0-94c7-3bd85613f2a7@fz-juelich.de> Message-ID: Just a follow up question, how do you backup multiple filesets? We have a 50 filesets to backup, at the moment do we have a text file that contains all of them and we run a for loop. But that is not at all scalable. Is it any other ways that are much better? /Christian ons 6 sep. 2023 kl. 11:35 skrev Marcus Koenig : > I'm using this one liner to get the progress > > grep 'mmbackup:Backup job finished'|cut -d ":" -f 6|awk '{print $1}'|awk > '{s+=$1}END{print s}' > > That can be compared to the files identified during the scan. > > On Wed, 6 Sept 2023, 21:29 Stephan Graf, wrote: > >> Hi >> >> I think it should be possible because mmbackup know, how many files are >> to be backed up, which have been already processed and how many are >> still to go. >> >> BTW it would also be nice to have an option in mmbackup to generate >> machine readable log file like JSON or CSV. >> >> But the right way to ask for a new feature or to look if there is >> already a request open is the IBM IDEA portal: >> >> https://ideas.ibm.com >> >> Stephan >> >> On 9/6/23 11:02, Jonathan Buzzard wrote: >> > >> > Would it be possible to have the mmbackup output display the percentage >> > output progress when backing up files? >> > >> > So at the top we you see something like this >> > >> > Tue Sep 5 23:13:35 2023 mmbackup:changed=747204, expired=427702, >> > unsupported=0 for server [XXXX] >> > >> > Then after it does the expiration you see during the backup lines like >> > >> > Wed Sep 6 02:43:53 2023 mmbackup:Backing up files: 527024 backed up, >> > 426018 expired, 4408 failed. (Backup job exit with 4) >> > >> > It would IMHO be helpful if it looked like >> > >> > Wed Sep 6 02:43:53 2023 mmbackup:Backing up files: 527024 (70.5%) >> > backed up, 426018 (100%) expired, 4408 failed. (Backup job exit with 4) >> > >> > Just based on the number of files. Though as I look at it now I am >> > curious about the discrepancy in the number of files expired, given >> that >> > the expiration stage allegedly concluded with no errors? >> > >> > Tue Sep 5 23:21:49 2023 mmbackup:Completed policy expiry run with 0 >> > policy errors, 0 files failed, 0 severe errors, returning rc=0. >> > Tue Sep 5 23:21:49 2023 mmbackup:Policy for expiry returned 0 Highest >> > TSM error 0 >> > >> > >> > >> > JAB. >> > >> >> -- >> Stephan Graf >> Juelich Supercomputing Centre >> >> Phone: +49-2461-61-6578 >> Fax: +49-2461-61-6656 >> E-mail: st.graf at fz-juelich.de >> WWW: http://www.fz-juelich.de/jsc/ >> >> --------------------------------------------------------------------------------------------- >> >> --------------------------------------------------------------------------------------------- >> Forschungszentrum Juelich GmbH >> 52425 Juelich >> Sitz der Gesellschaft: Juelich >> Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498 >> Vorsitzender des Aufsichtsrats: MinDir Volker Rieke >> Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender), >> Karsten Beneke (stellv. Vorsitzender), Dr. Astrid Lambrecht, >> Prof. Dr. Frauke Melchior >> >> --------------------------------------------------------------------------------------------- >> >> --------------------------------------------------------------------------------------------- >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org >> > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From st.graf at fz-juelich.de Wed Sep 6 11:26:26 2023 From: st.graf at fz-juelich.de (Stephan Graf) Date: Wed, 6 Sep 2023 12:26:26 +0200 Subject: [gpfsug-discuss] mmbackup feature request In-Reply-To: References: <4de6d5a1-0f29-25a0-94c7-3bd85613f2a7@fz-juelich.de> Message-ID: We have written a python scheduler which is using the mmlsfileset command. By options we can decide how many filesets are backed up in parallel. Stephan On 9/6/23 11:44, Christian Petersson wrote: > Just a follow up question, how do you backup multiple filesets? > We have a 50 filesets to backup, at the moment do we have a text file > that contains all of them and we run a for loop. But that is not at all > scalable. > > Is it any other ways that are much better? > > /Christian > > ons 6 sep. 2023 kl. 11:35 skrev Marcus Koenig >: > > I'm using this one liner to get the progress > > grep 'mmbackup:Backup job finished'|cut -d ":" -f 6|awk '{print > $1}'|awk '{s+=$1}END{print s}' > > That can be compared to the files identified during the scan. > > On Wed, 6 Sept 2023, 21:29 Stephan Graf, > wrote: > > Hi > > I think it should be possible because mmbackup know, how many > files are > to be backed up, which have been already processed and how many are > still to go. > > BTW it would also be nice to have an option in mmbackup to generate > machine readable log file like JSON or CSV. > > But the right way to ask for a new feature or to look if there is > already a request open is the IBM IDEA portal: > > https://ideas.ibm.com > > Stephan > > On 9/6/23 11:02, Jonathan Buzzard wrote: > > > > Would it be possible to have the mmbackup output display the > percentage > > output progress when backing up files? > > > > So at the top we you see something like this > > > > Tue Sep? 5 23:13:35 2023 mmbackup:changed=747204, > expired=427702, > > unsupported=0 for server [XXXX] > > > > Then after it does the expiration you see during the backup > lines like > > > > Wed Sep? 6 02:43:53 2023 mmbackup:Backing up files: 527024 > backed up, > > 426018 expired, 4408 failed. (Backup job exit with 4) > > > > It would IMHO be helpful if it looked like > > > > Wed Sep? 6 02:43:53 2023 mmbackup:Backing up files: 527024 > (70.5%) > > backed up, 426018 (100%) expired, 4408 failed. (Backup job > exit with 4) > > > > Just based on the number of files. Though as I look at it now > I am > > curious about the discrepancy in the number of files expired, > given that > > the expiration stage allegedly concluded with no errors? > > > > Tue Sep? 5 23:21:49 2023 mmbackup:Completed policy expiry run > with 0 > > policy errors, 0 files failed, 0 severe errors, returning rc=0. > > Tue Sep? 5 23:21:49 2023 mmbackup:Policy for expiry returned > 0 Highest > > TSM error 0 > > > > > > > > JAB. > > > > -- > Stephan Graf > Juelich Supercomputing Centre > > Phone:? +49-2461-61-6578 > Fax:? ? +49-2461-61-6656 > E-mail: st.graf at fz-juelich.de > WWW: http://www.fz-juelich.de/jsc/ > --------------------------------------------------------------------------------------------- > --------------------------------------------------------------------------------------------- > Forschungszentrum Juelich GmbH > 52425 Juelich > Sitz der Gesellschaft: Juelich > Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498 > Vorsitzender des Aufsichtsrats: MinDir Volker Rieke > Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt > (Vorsitzender), > Karsten Beneke (stellv. Vorsitzender), Dr. Astrid Lambrecht, > Prof. Dr. Frauke Melchior > --------------------------------------------------------------------------------------------- > --------------------------------------------------------------------------------------------- > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org -- Stephan Graf Juelich Supercomputing Centre Phone: +49-2461-61-6578 Fax: +49-2461-61-6656 E-mail: st.graf at fz-juelich.de WWW: http://www.fz-juelich.de/jsc/ --------------------------------------------------------------------------------------------- --------------------------------------------------------------------------------------------- Forschungszentrum Juelich GmbH 52425 Juelich Sitz der Gesellschaft: Juelich Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498 Vorsitzender des Aufsichtsrats: MinDir Volker Rieke Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender), Karsten Beneke (stellv. Vorsitzender), Dr. Astrid Lambrecht, Prof. Dr. Frauke Melchior --------------------------------------------------------------------------------------------- --------------------------------------------------------------------------------------------- -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 5938 bytes Desc: S/MIME Cryptographic Signature URL: From skylar2 at uw.edu Wed Sep 6 15:34:33 2023 From: skylar2 at uw.edu (Skylar Thompson) Date: Wed, 6 Sep 2023 07:34:33 -0700 Subject: [gpfsug-discuss] mmbackup feature request In-Reply-To: References: Message-ID: We have a hacky way of doing this by getting the size of the update and expire lists, i.e.: wc -l /gpfs/fs1/.mmbackupCfg/updatedFiles/.list.1.node-name And also each index number of the chunked file lists that dsmc is processing, for instance: -filelist=/gpfs/fs1/.mmbackupCfg/mmbackupChanged.ix.13589.0424C05C.465.fs1 Which shows index 465. The final bit of information you need is the MaxFiles setting of mmbackup, which is calculated automatically by default but is visible in mmapplypolicy and tsapolicy command lines in /proc in the -B value (ours is 32768). Once you have all of that information, you can calculate the number of file lists that will be processed across all the backup nodes by dividing the length of the updatedFiles and expireFiles lists by the MaxFiles setting to get the total number of file lists that will be processed, and the number of file lists in by adding up the index numbers from all of your backup nodes. That said, if there were an easy-to-parse option in the logs, we would certainly use it. :) On Wed, Sep 06, 2023 at 10:02:17AM +0100, Jonathan Buzzard wrote: > > Would it be possible to have the mmbackup output display the percentage > output progress when backing up files? > > So at the top we you see something like this > > Tue Sep 5 23:13:35 2023 mmbackup:changed=747204, expired=427702, > unsupported=0 for server [XXXX] > > Then after it does the expiration you see during the backup lines like > > Wed Sep 6 02:43:53 2023 mmbackup:Backing up files: 527024 backed up, 426018 > expired, 4408 failed. (Backup job exit with 4) > > It would IMHO be helpful if it looked like > > Wed Sep 6 02:43:53 2023 mmbackup:Backing up files: 527024 (70.5%) backed > up, 426018 (100%) expired, 4408 failed. (Backup job exit with 4) > > Just based on the number of files. Though as I look at it now I am curious > about the discrepancy in the number of files expired, given that the > expiration stage allegedly concluded with no errors? > > Tue Sep 5 23:21:49 2023 mmbackup:Completed policy expiry run with 0 policy > errors, 0 files failed, 0 severe errors, returning rc=0. > Tue Sep 5 23:21:49 2023 mmbackup:Policy for expiry returned 0 Highest TSM > error 0 > > > > JAB. > > -- > Jonathan A. Buzzard Tel: +44141-5483420 > HPC System Administrator, ARCHIE-WeSt. > University of Strathclyde, John Anderson Building, Glasgow. G4 0NG > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > https://urldefense.com/v3/__http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org__;!!K-Hz7m0Vt54!kKHB7ax8P1MfIPjQ35h55FpbmMkyaHMoKFgvcA6DjcVkZpEfeSXgkvXueM9LKJ0wWDND1YATiCfht16ed_PWfVNKevmKpYE$ -- -- Skylar Thompson (skylar2 at u.washington.edu) -- Genome Sciences Department (UW Medicine), System Administrator -- Foege Building S046, (206)-685-7354 -- Pronouns: He/Him/His From skylar2 at uw.edu Wed Sep 6 15:45:26 2023 From: skylar2 at uw.edu (Skylar Thompson) Date: Wed, 6 Sep 2023 07:45:26 -0700 Subject: [gpfsug-discuss] mmbackup feature request In-Reply-To: References: <4de6d5a1-0f29-25a0-94c7-3bd85613f2a7@fz-juelich.de> Message-ID: We use independent filesets so we can snapshot individually, and then a separate TSM node for each fileset. Finally we have a "canary" directory that's local to the node running the schedule that dsmc can backup as a virtual mount point that we use to track backup success/failure, since mmbackup-driven backups won't update the last backup time of the GPFS filesystem. All of the configuration is Puppet-managed, which sets up a separate dsmcad service for each fileset that connects to the TSM server using the fileset-specific node, which is associated with a schedule that runs the backup. Relevant details from one of our schedules: Node Name: GPFS-GS2-GS-VOL5 Schedule Name: DAILY_GPFS-GS2-GS-VOL5 Schedule Style: Classic Action: Incremental Options: -presched='/usr/local/sbin/backup-gpfs -s gpfs-gs2-gs-vol5' -dom=/mnt/canary.nobackup/gpfs-gs2-gs-vol5 Objects: Priority: 5 Next Execution: 13 Hours Duration: 5 Minutes Period: 1 Day Day of Week: Any The backup-gpfs script sets up the snapshot, backs up some metadata about the filesets so we can recover configuration easily (size, inode limit, junction, etc.), runs mmbackup, and will exit with success/failure depending on what the underlying commands return. It also logs mmbackup output to Splunk which makes log trawling a lot easier. If it runs into a problem, a trap removes the snapshot so we can run the backup again without manual intervention. If the /mnt/canary.nobackup/gpfs-gs2-gs-vol5 filespace associated with the GPFS-GS2-GS-VOL5 node gets more than a couple days behind, we'll notice that with our monitoring of the filespace last backup time (same as our non-GPFS filespaces) and we can go in and take a look. We use this methodology to backup five entire GPFS filesystems for our larger labs, along with a sixth that's divided into 76 filesets as a storage condo for our smaller labs. There's some quirks but scales better than trying to run mmbackup independently for everything. Hope that helps! On Wed, Sep 06, 2023 at 11:44:56AM +0200, Christian Petersson wrote: > Just a follow up question, how do you backup multiple filesets? > We have a 50 filesets to backup, at the moment do we have a text file that > contains all of them and we run a for loop. But that is not at all > scalable. > > Is it any other ways that are much better? > > /Christian > > ons 6 sep. 2023 kl. 11:35 skrev Marcus Koenig : > > > I'm using this one liner to get the progress > > > > grep 'mmbackup:Backup job finished'|cut -d ":" -f 6|awk '{print $1}'|awk > > '{s+=$1}END{print s}' > > > > That can be compared to the files identified during the scan. > > > > On Wed, 6 Sept 2023, 21:29 Stephan Graf, wrote: > > > >> Hi > >> > >> I think it should be possible because mmbackup know, how many files are > >> to be backed up, which have been already processed and how many are > >> still to go. > >> > >> BTW it would also be nice to have an option in mmbackup to generate > >> machine readable log file like JSON or CSV. > >> > >> But the right way to ask for a new feature or to look if there is > >> already a request open is the IBM IDEA portal: > >> > >> https://urldefense.com/v3/__https://ideas.ibm.com__;!!K-Hz7m0Vt54!mV9vXvf6GYeaY4hHi834eHy2L_41MW2qtO-ZhGwdc1U5YqN7WAiEI6GB6IH2aXbcUw_gfkBdDK9jsSAfPU3tTuUNUQmn$ > >> > >> Stephan > >> > >> On 9/6/23 11:02, Jonathan Buzzard wrote: > >> > > >> > Would it be possible to have the mmbackup output display the percentage > >> > output progress when backing up files? > >> > > >> > So at the top we you see something like this > >> > > >> > Tue Sep 5 23:13:35 2023 mmbackup:changed=747204, expired=427702, > >> > unsupported=0 for server [XXXX] > >> > > >> > Then after it does the expiration you see during the backup lines like > >> > > >> > Wed Sep 6 02:43:53 2023 mmbackup:Backing up files: 527024 backed up, > >> > 426018 expired, 4408 failed. (Backup job exit with 4) > >> > > >> > It would IMHO be helpful if it looked like > >> > > >> > Wed Sep 6 02:43:53 2023 mmbackup:Backing up files: 527024 (70.5%) > >> > backed up, 426018 (100%) expired, 4408 failed. (Backup job exit with 4) > >> > > >> > Just based on the number of files. Though as I look at it now I am > >> > curious about the discrepancy in the number of files expired, given > >> that > >> > the expiration stage allegedly concluded with no errors? > >> > > >> > Tue Sep 5 23:21:49 2023 mmbackup:Completed policy expiry run with 0 > >> > policy errors, 0 files failed, 0 severe errors, returning rc=0. > >> > Tue Sep 5 23:21:49 2023 mmbackup:Policy for expiry returned 0 Highest > >> > TSM error 0 > >> > > >> > > >> > > >> > JAB. > >> > > >> > >> -- > >> Stephan Graf > >> Juelich Supercomputing Centre > >> > >> Phone: +49-2461-61-6578 > >> Fax: +49-2461-61-6656 > >> E-mail: st.graf at fz-juelich.de > >> WWW: https://urldefense.com/v3/__http://www.fz-juelich.de/jsc/__;!!K-Hz7m0Vt54!mV9vXvf6GYeaY4hHi834eHy2L_41MW2qtO-ZhGwdc1U5YqN7WAiEI6GB6IH2aXbcUw_gfkBdDK9jsSAfPU3tTkQ48VFb$ > >> > >> --------------------------------------------------------------------------------------------- > >> > >> --------------------------------------------------------------------------------------------- > >> Forschungszentrum Juelich GmbH > >> 52425 Juelich > >> Sitz der Gesellschaft: Juelich > >> Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498 > >> Vorsitzender des Aufsichtsrats: MinDir Volker Rieke > >> Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender), > >> Karsten Beneke (stellv. Vorsitzender), Dr. Astrid Lambrecht, > >> Prof. Dr. Frauke Melchior > >> > >> --------------------------------------------------------------------------------------------- > >> > >> --------------------------------------------------------------------------------------------- > >> _______________________________________________ > >> gpfsug-discuss mailing list > >> gpfsug-discuss at gpfsug.org > >> https://urldefense.com/v3/__http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org__;!!K-Hz7m0Vt54!mV9vXvf6GYeaY4hHi834eHy2L_41MW2qtO-ZhGwdc1U5YqN7WAiEI6GB6IH2aXbcUw_gfkBdDK9jsSAfPU3tTgsSVVvv$ > >> > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at gpfsug.org > > https://urldefense.com/v3/__http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org__;!!K-Hz7m0Vt54!mV9vXvf6GYeaY4hHi834eHy2L_41MW2qtO-ZhGwdc1U5YqN7WAiEI6GB6IH2aXbcUw_gfkBdDK9jsSAfPU3tTgsSVVvv$ > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > https://urldefense.com/v3/__http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org__;!!K-Hz7m0Vt54!mV9vXvf6GYeaY4hHi834eHy2L_41MW2qtO-ZhGwdc1U5YqN7WAiEI6GB6IH2aXbcUw_gfkBdDK9jsSAfPU3tTgsSVVvv$ -- -- Skylar Thompson (skylar2 at u.washington.edu) -- Genome Sciences Department (UW Medicine), System Administrator -- Foege Building S046, (206)-685-7354 -- Pronouns: He/Him/His From ewahl at osc.edu Wed Sep 6 18:14:53 2023 From: ewahl at osc.edu (Wahl, Edward) Date: Wed, 6 Sep 2023 17:14:53 +0000 Subject: [gpfsug-discuss] mmbackup feature request In-Reply-To: References: <4de6d5a1-0f29-25a0-94c7-3bd85613f2a7@fz-juelich.de> Message-ID: We have about 760-ish independent filesets. As was mentioned before in a reply, this allows for individual fileset snapshotting, and running on different TSM servers. We maintain a puppet-managed list that we use to divide up the filesets, . automation helps us round-robin new filesets across the 4 backup servers as they are added to attempt to balance things somewhat. We maintain 7 days of snapshots on the filesystem we backup, and no snapshots or backups on our scratch space. We hand out the mmbackups to 4 individual TSM backup clients which do both our daily mmbackup, and NetApp snappdiff backups for our user home directories as well. Those feed to another 4 TSM servers doing the tape migrations. We?re sitting on about ~20P of disk at this time and we?re (very) roughly 50% occupied. One of our challenges recently was re-balancing all this for remote Disaster Recovery/Replication. We ended up using colocation groups of the filesets in Spectrum Protect/TSM. While scaling backup infrastructure can be hard, balancing hundreds of Wildly differing filesets can be just as challenging. I?m happy to talk about these kinds of things here, or offline. Drop me a line if you have additional questions. Ed Wahl Ohio Supercomputer Center From: gpfsug-discuss On Behalf Of Christian Petersson Sent: Wednesday, September 6, 2023 5:45 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] mmbackup feature request Just a follow up question, how do you backup multiple filesets? We have a 50 filesets to backup, at the moment do we have a text file that contains all of them and we run a for loop. But that is not at all scalable.? Is it any other ways that Just a follow up question, how do you backup multiple filesets? We have a 50 filesets to backup, at the moment do we have a text file that contains all of them and we run a for loop. But that is not at all scalable. Is it any other ways that are much better? /Christian ons 6 sep. 2023 kl. 11:35 skrev Marcus Koenig >: I'm using this one liner to get the progress grep 'mmbackup:Backup job finished'|cut -d ":" -f 6|awk '{print $1}'|awk '{s+=$1}END{print s}' That can be compared to the files identified during the scan. On Wed, 6 Sept 2023, 21:29 Stephan Graf, > wrote: Hi I think it should be possible because mmbackup know, how many files are to be backed up, which have been already processed and how many are still to go. BTW it would also be nice to have an option in mmbackup to generate machine readable log file like JSON or CSV. But the right way to ask for a new feature or to look if there is already a request open is the IBM IDEA portal: https://ideas.ibm.com Stephan On 9/6/23 11:02, Jonathan Buzzard wrote: > > Would it be possible to have the mmbackup output display the percentage > output progress when backing up files? > > So at the top we you see something like this > > Tue Sep 5 23:13:35 2023 mmbackup:changed=747204, expired=427702, > unsupported=0 for server [XXXX] > > Then after it does the expiration you see during the backup lines like > > Wed Sep 6 02:43:53 2023 mmbackup:Backing up files: 527024 backed up, > 426018 expired, 4408 failed. (Backup job exit with 4) > > It would IMHO be helpful if it looked like > > Wed Sep 6 02:43:53 2023 mmbackup:Backing up files: 527024 (70.5%) > backed up, 426018 (100%) expired, 4408 failed. (Backup job exit with 4) > > Just based on the number of files. Though as I look at it now I am > curious about the discrepancy in the number of files expired, given that > the expiration stage allegedly concluded with no errors? > > Tue Sep 5 23:21:49 2023 mmbackup:Completed policy expiry run with 0 > policy errors, 0 files failed, 0 severe errors, returning rc=0. > Tue Sep 5 23:21:49 2023 mmbackup:Policy for expiry returned 0 Highest > TSM error 0 > > > > JAB. > -- Stephan Graf Juelich Supercomputing Centre Phone: +49-2461-61-6578 Fax: +49-2461-61-6656 E-mail: st.graf at fz-juelich.de WWW: http://www.fz-juelich.de/jsc/ --------------------------------------------------------------------------------------------- --------------------------------------------------------------------------------------------- Forschungszentrum Juelich GmbH 52425 Juelich Sitz der Gesellschaft: Juelich Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498 Vorsitzender des Aufsichtsrats: MinDir Volker Rieke Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender), Karsten Beneke (stellv. Vorsitzender), Dr. Astrid Lambrecht, Prof. Dr. Frauke Melchior --------------------------------------------------------------------------------------------- --------------------------------------------------------------------------------------------- _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From anacreo at gmail.com Wed Sep 6 18:59:24 2023 From: anacreo at gmail.com (Alec) Date: Wed, 6 Sep 2023 10:59:24 -0700 Subject: [gpfsug-discuss] mmbackup feature request In-Reply-To: References: <4de6d5a1-0f29-25a0-94c7-3bd85613f2a7@fz-juelich.de> Message-ID: I'll chime in with we don't use mmbackup at all... We use NetBackup accelerated backups to backup. It has worked well and we are able to do daily incrementals and weekend fulls with 100s of TBs in one filesystem and 100s of filesets. The biggest challenge we face is defining the numerous streams in NetBackup policy to keep things parallel. But we just do good record keeping in our change management. When we have to do a 'rescan' or BCP fail over it is a chore to reach equilibrium again. Our synthetic accelerated full backup runs at about 700GB/hr for our weekend fulls... so we finish well before most other smaller traditional fileserver clients. Best part is this is one place where we are nearly just a regular old commodity client with ridiculously impressive stats. Alec On Wed, Sep 6, 2023, 10:19 AM Wahl, Edward wrote: > > > We have about 760-ish independent filesets. As was mentioned before in > a reply, this allows for individual fileset snapshotting, and running on > different TSM servers. We maintain a puppet-managed list that we use to > divide up the filesets, . automation helps us round-robin new filesets > across the 4 backup servers as they are added to attempt to balance things > somewhat. We maintain 7 days of snapshots on the filesystem we backup, > and no snapshots or backups on our scratch space. > > > > We hand out the mmbackups to 4 individual TSM backup clients which do both > our daily mmbackup, and NetApp snappdiff backups for our user home > directories as well. Those feed to another 4 TSM servers doing the tape > migrations. > > We?re sitting on about ~20P of disk at this time and we?re (very) roughly > 50% occupied. > > > > One of our challenges recently was re-balancing all this for remote > Disaster Recovery/Replication. We ended up using colocation groups of the > filesets in Spectrum Protect/TSM. While scaling backup infrastructure can > be hard, balancing hundreds of Wildly differing filesets can be just as > challenging. > > > > I?m happy to talk about these kinds of things here, or offline. Drop me a > line if you have additional questions. > > > > Ed Wahl > > Ohio Supercomputer Center > > > > *From:* gpfsug-discuss *On Behalf Of *Christian > Petersson > *Sent:* Wednesday, September 6, 2023 5:45 AM > *To:* gpfsug main discussion list > *Subject:* Re: [gpfsug-discuss] mmbackup feature request > > > > Just a follow up question, how do you backup multiple filesets? We have a > 50 filesets to backup, at the moment do we have a text file that contains > all of them and we run a for loop. But that is not at all scalable. Is it > any other ways that > > Just a follow up question, how do you backup multiple filesets? > > We have a 50 filesets to backup, at the moment do we have a text file that > contains all of them and we run a for loop. But that is not at all > scalable. > > > > Is it any other ways that are much better? > > > > /Christian > > > > ons 6 sep. 2023 kl. 11:35 skrev Marcus Koenig : > > I'm using this one liner to get the progress > > > > grep 'mmbackup:Backup job finished'|cut -d ":" -f 6|awk '{print $1}'|awk > '{s+=$1}END{print s}' > > > > That can be compared to the files identified during the scan. > > > > On Wed, 6 Sept 2023, 21:29 Stephan Graf, wrote: > > Hi > > I think it should be possible because mmbackup know, how many files are > to be backed up, which have been already processed and how many are > still to go. > > BTW it would also be nice to have an option in mmbackup to generate > machine readable log file like JSON or CSV. > > But the right way to ask for a new feature or to look if there is > already a request open is the IBM IDEA portal: > > https://ideas.ibm.com > > > Stephan > > On 9/6/23 11:02, Jonathan Buzzard wrote: > > > > Would it be possible to have the mmbackup output display the percentage > > output progress when backing up files? > > > > So at the top we you see something like this > > > > Tue Sep 5 23:13:35 2023 mmbackup:changed=747204, expired=427702, > > unsupported=0 for server [XXXX] > > > > Then after it does the expiration you see during the backup lines like > > > > Wed Sep 6 02:43:53 2023 mmbackup:Backing up files: 527024 backed up, > > 426018 expired, 4408 failed. (Backup job exit with 4) > > > > It would IMHO be helpful if it looked like > > > > Wed Sep 6 02:43:53 2023 mmbackup:Backing up files: 527024 (70.5%) > > backed up, 426018 (100%) expired, 4408 failed. (Backup job exit with 4) > > > > Just based on the number of files. Though as I look at it now I am > > curious about the discrepancy in the number of files expired, given that > > the expiration stage allegedly concluded with no errors? > > > > Tue Sep 5 23:21:49 2023 mmbackup:Completed policy expiry run with 0 > > policy errors, 0 files failed, 0 severe errors, returning rc=0. > > Tue Sep 5 23:21:49 2023 mmbackup:Policy for expiry returned 0 Highest > > TSM error 0 > > > > > > > > JAB. > > > > -- > Stephan Graf > Juelich Supercomputing Centre > > Phone: +49-2461-61-6578 > Fax: +49-2461-61-6656 > E-mail: st.graf at fz-juelich.de > WWW: http://www.fz-juelich.de/jsc/ > > > --------------------------------------------------------------------------------------------- > > --------------------------------------------------------------------------------------------- > Forschungszentrum Juelich GmbH > 52425 Juelich > Sitz der Gesellschaft: Juelich > Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498 > Vorsitzender des Aufsichtsrats: MinDir Volker Rieke > Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender), > Karsten Beneke (stellv. Vorsitzender), Dr. Astrid Lambrecht, > Prof. Dr. Frauke Melchior > > --------------------------------------------------------------------------------------------- > > --------------------------------------------------------------------------------------------- > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From martin at uni-mainz.de Wed Sep 6 19:55:44 2023 From: martin at uni-mainz.de (Christoph Martin) Date: Wed, 6 Sep 2023 20:55:44 +0200 Subject: [gpfsug-discuss] frequent OOM killer due to high memory usage of mmfsd Message-ID: Hi all, on a three node GPFS cluster with CES enabled and AFM-DR mirroring to a second cluster we see frequent OOM killer events due to a constantly growing mmfsd. The machines have 256G memory. The pagepool is configured to 16G. The GPFS version is 5.1.6-1. After a restart mmfsd rapidly grows to about 100G usage and grows over some days up to 250G virtual and 220G physical memory usage. OOMkiller tries kill process like pmcollector or others and sometime kills mmfsd. Does anybody see a similar behavior? Any guess what could help with this problem? Regards Christoph Martin -- Christoph Martin Zentrum f?r Datenverarbeitung (ZDV) Leiter Unix & Cloud Johannes Gutenberg-Universit?t Mainz Anselm Franz von Bentzel-Weg 12, 55128 Mainz Tel: +49 6131 39 26337 martin at uni-mainz.de www.zdv.uni-mainz.de -------------- next part -------------- A non-text attachment was scrubbed... Name: OpenPGP_signature Type: application/pgp-signature Size: 833 bytes Desc: OpenPGP digital signature URL: From christian.petersson at isstech.io Wed Sep 6 21:33:31 2023 From: christian.petersson at isstech.io (Christian Petersson) Date: Wed, 6 Sep 2023 22:33:31 +0200 Subject: [gpfsug-discuss] frequent OOM killer due to high memory usage of mmfsd In-Reply-To: References: Message-ID: Hi, This is a settings, we had the exact same issue in the past and when we change the following parameters it has never killed the CES nodes anymore. maxFilesToCache=1000000 maxStatCache=100000 Thanks Christian On Wed, 6 Sept 2023 at 20:59, Christoph Martin wrote: > Hi all, > > on a three node GPFS cluster with CES enabled and AFM-DR mirroring to a > second cluster we see frequent OOM killer events due to a constantly > growing mmfsd. > The machines have 256G memory. The pagepool is configured to 16G. > The GPFS version is 5.1.6-1. > After a restart mmfsd rapidly grows to about 100G usage and grows over > some days up to 250G virtual and 220G physical memory usage. > OOMkiller tries kill process like pmcollector or others and sometime > kills mmfsd. > > Does anybody see a similar behavior? > Any guess what could help with this problem? > > Regards > Christoph Martin > > -- > Christoph Martin > Zentrum f?r Datenverarbeitung (ZDV) > Leiter Unix & Cloud > > Johannes Gutenberg-Universit?t Mainz > Anselm Franz von Bentzel-Weg 12, 55128 Mainz > Tel: +49 6131 39 26337 > martin at uni-mainz.de > www.zdv.uni-mainz.de > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org > -- Med V?nliga H?lsningar Christian Petersson E-Post: Christian.Petersson at isstech.io Mobil: 070-3251577 -------------- next part -------------- An HTML attachment was scrubbed... URL: From st.graf at fz-juelich.de Thu Sep 7 07:50:14 2023 From: st.graf at fz-juelich.de (Stephan Graf) Date: Thu, 7 Sep 2023 08:50:14 +0200 Subject: [gpfsug-discuss] frequent OOM killer due to high memory usage of mmfsd In-Reply-To: References: Message-ID: <057ef098-6734-30ab-fe3a-b732e40f688c@fz-juelich.de> Hi in the past we had issues with the mmdf heap memory. Due to special workload it increased and took GB of memory, but after usage it was not freed again. we had long discussions with IBM about it and it ends up in a Development User Story (261213) which was realized in 5.1.2: --- In this story, the InodeAllocSegment object will be allocated when accessed. For commands that iterates all InodeAllocSegment, we will release the object immediately after use. An undocumented configuration "!maxIAllocSegmentsToCache" is provided to control the upper limit of the count of InodeAllocSegment objects. When the count approaches the limit, a pre stealing thread will be started to steal and release some InodeAllocSegment objects. Its default value is 1000,000. --- since than we are fine so far. But this was on plain GPFS clients, no CES node where the service like NFS comes into play. You can monitor the heap memory usage by using "mmdiag --memory" @IBM colleagues: If there is something wrong in my explanation please correct me. Stephan On 9/6/23 20:55, Christoph Martin wrote: > Hi all, > > on a three node GPFS cluster with CES enabled and AFM-DR mirroring to a > second cluster we see frequent OOM killer events due to a constantly > growing mmfsd. > The machines have 256G memory. The pagepool is configured to 16G. > The GPFS version is 5.1.6-1. > After a restart mmfsd rapidly grows to about 100G usage and grows over > some days up to 250G virtual and 220G physical memory usage. > OOMkiller tries kill process like pmcollector or others and sometime > kills mmfsd. > > Does anybody see a similar behavior? > Any guess what could help with this problem? > > Regards > Christoph Martin > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org -- Stephan Graf Juelich Supercomputing Centre Phone: +49-2461-61-6578 Fax: +49-2461-61-6656 E-mail: st.graf at fz-juelich.de WWW: http://www.fz-juelich.de/jsc/ --------------------------------------------------------------------------------------------- --------------------------------------------------------------------------------------------- Forschungszentrum Juelich GmbH 52425 Juelich Sitz der Gesellschaft: Juelich Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498 Vorsitzender des Aufsichtsrats: MinDir Volker Rieke Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender), Karsten Beneke (stellv. Vorsitzender), Dr. Astrid Lambrecht, Prof. Dr. Frauke Melchior --------------------------------------------------------------------------------------------- --------------------------------------------------------------------------------------------- -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 5938 bytes Desc: S/MIME Cryptographic Signature URL: From Achim.Rehor at de.ibm.com Thu Sep 7 09:13:16 2023 From: Achim.Rehor at de.ibm.com (Achim Rehor) Date: Thu, 7 Sep 2023 08:13:16 +0000 Subject: [gpfsug-discuss] frequent OOM killer due to high memory usage of mmfsd In-Reply-To: <057ef098-6734-30ab-fe3a-b732e40f688c@fz-juelich.de> References: <057ef098-6734-30ab-fe3a-b732e40f688c@fz-juelich.de> Message-ID: <5f019fb538d38e14e6db554cfe0907e89a794285.camel@de.ibm.com> Thanks Stephan,? worthful hint! @Christoph: the memory footprint of mmfsd can get quite large on a CES system, given the usual tuning recommendations for CES nodes. mmfsd is using the pagepool (so 16GB in your case) plus the caches for the management of maxFilesToCache and maxStatCache outside the pagepool, which is ~10kB per FileToCache entry, and~0.5kB per StatCache entry. That sums up to like 40GB for 4000000mStatCache entries and 2GB for the same sized StatCache ..? In addition if your node is a manager node it will also use some space for TokenMem... (depending on your full clusters mFtC settings/ the number of manager nodes ..etc)? So 256GB should be largely sufficient forthe named scenario .. If themmfsd mem footprint is constantly growing, i'd recommend opening a ticket and uploading a snap, so support can have a more detailed look -- Mit freundlichen Gr??en / Kind regards Achim Rehor Technical Support Specialist S?pectrum Scale and ESS (SME) Advisory Product Services Professional IBM Systems Storage Support - EMEA Achim.Rehor at de.ibm.com +49-170- 4521194 ? IBM Deutschland GmbH? Vorsitzender des Aufsichtsrats: Sebastian Krause Gesch?ftsf?hrung: Gregor Pillen (Vorsitzender), Nicole Reimer,? Gabriele Schwarenthorer,?Christine Rupp, Frank Theisen Sitz der Gesellschaft: Ehningen / Registergericht: AmtsgerichtStuttgart, HRB 14562 / WEEE-Reg.-Nr. DE 99369940 -----Original Message----- From: Stephan Graf Reply-To: gpfsug main discussion list To: gpfsug-discuss at gpfsug.org Subject: [EXTERNAL] Re: [gpfsug-discuss] frequent OOM killer due to high memory usage of mmfsd Date: Thu, 07 Sep 2023 08:50:14 +0200 Hi in the past we had issues with the mmdf heap memory. Due to special workload it increased and took GB of memory, but after usage it was not freed again. we had long discussions with IBM about it and it ends up in a Development User Story (261213) which was realized in 5.1.2: --- In this story, the InodeAllocSegment object will be allocated when accessed. For commands that iterates all InodeAllocSegment, we will release the object immediately after use. An undocumented configuration "!maxIAllocSegmentsToCache" is provided to control the upper limit of the count of InodeAllocSegment objects. When the count approaches the limit, a pre stealing thread will be started to steal and release some InodeAllocSegment objects. Its default value is 1000,000. --- since than we are fine so far. But this was on plain GPFS clients, no CES node where the service like NFS comes into play. You can monitor the heap memory usage by using "mmdiag --memory" @IBM colleagues: If there is something wrong in my explanation please correct me. Stephan On 9/6/23 20:55, Christoph Martin wrote: > Hi all, > > on a three node GPFS cluster with CES enabled and AFM-DR mirroring to > a > second cluster we see frequent OOM killer events due to a constantly > growing mmfsd. > The machines have 256G memory. The pagepool is configured to 16G. > The GPFS version is 5.1.6-1. > After a restart mmfsd rapidly grows to about 100G usage and grows > over > some days up to 250G virtual and 220G physical memory usage. > OOMkiller tries kill process like pmcollector or others and sometime > kills mmfsd. > > Does anybody see a similar behavior? > Any guess what could help with this problem? > > Regards > Christoph Martin > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org From Thomas.Bernecker at EMEA.NEC.COM Thu Sep 7 09:36:03 2023 From: Thomas.Bernecker at EMEA.NEC.COM (Thomas Bernecker) Date: Thu, 7 Sep 2023 08:36:03 +0000 Subject: [gpfsug-discuss] Spectrum Scale, InfiniBand and QoS Message-ID: Dear all, we recently found this presentation from 2021: https://www.spectrumscaleug.org/wp-content/uploads/2021/05/SSSD21DE-06-Improving-Spectrum-Scale-performance-using-RDMA.pdf On page 10 it explains by setting the RDMA device in a specific way one could employ QoS with InfiniBand, which is something we would like to achieve as well. Has anyone a working environment using QoS with InfiniBand outside of the ESS domain? If so, would you share your experience? Sorry for asking a rather broad question, but it seems that this is not well-known stuff ... -- Best regards / Mit freundlichem Gru? Thomas Bernecker ------------------------------------------------------------------------------------------------------------------ Manager System Integration and Support, HPCE Division Mobile: +49 (1522) 2851523, Fax: +49 (211) 5369-199, Home Office: +49 (38821) 65091 NEC Deutschland GmbH, Fritz-Vomfelde-Stra?e 14-16, 40547 D?sseldorf, Germany Gesch?ftsf?hrer: Christopher Richard Jackson - Handelsregister D?sseldorf HRB 57941 ------------------------------------------------------------------------------------------------------------------- -------------- next part -------------- An HTML attachment was scrubbed... URL: From luis.bolinches at fi.ibm.com Thu Sep 7 09:50:35 2023 From: luis.bolinches at fi.ibm.com (Luis Bolinches) Date: Thu, 7 Sep 2023 08:50:35 +0000 Subject: [gpfsug-discuss] Spectrum Scale, InfiniBand and QoS In-Reply-To: References: Message-ID: Hi Assuming you are talking about GPFS QoS and not any fabric QoS. It would work with any supported fabric, ESS or not. It is a filesystem feature well above all the fabrics and HW. Limits IOPS per class https://www.ibm.com/docs/en/storage-scale/5.1.8?topic=reference-mmchqos-command I recommend (depending on number of nodes and seconds) fine-stats so then you can visualize which PID on which client node is doing what. I believe I did something of that on SSUG London 18 or 19 -- Yst?v?llisin terveisin/Regards/Saludos/Salutations/Salutacions Luis Bolinches Executive IT Specialist IBM Storage Scale Server (formerly ESS) developer Phone: +358503112585 Ab IBM Finland Oy Toinen linja 7 00530 Helsinki Uusimaa - Finland Visitors entrance: Siltasaarenkatu 22 "If you always give you will always have" -- Anonymous https://www.credly.com/users/luis-bolinches/badges From: gpfsug-discuss On Behalf Of Thomas Bernecker Sent: Thursday, 7 September 2023 11.36 To: gpfsug-discuss at gpfsug.org Subject: [EXTERNAL] [gpfsug-discuss] Spectrum Scale, InfiniBand and QoS Dear all, we recently found this presentation from 2021: https:?//www.?spectrumscaleug.?org/wp-content/uploads/2021/05/SSSD21DE-06-Improving-Spectrum-Scale-performance-using-RDMA.?pdf On page 10 it explains by setting the RDMA device in a specific ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. Report Suspicious ? ZjQcmQRYFpfptBannerEnd Dear all, we recently found this presentation from 2021: https://www.spectrumscaleug.org/wp-content/uploads/2021/05/SSSD21DE-06-Improving-Spectrum-Scale-performance-using-RDMA.pdf On page 10 it explains by setting the RDMA device in a specific way one could employ QoS with InfiniBand, which is something we would like to achieve as well. Has anyone a working environment using QoS with InfiniBand outside of the ESS domain? If so, would you share your experience? Sorry for asking a rather broad question, but it seems that this is not well-known stuff ? -- Best regards / Mit freundlichem Gru? Thomas Bernecker ------------------------------------------------------------------------------------------------------------------ Manager System Integration and Support, HPCE Division Mobile: +49 (1522) 2851523, Fax: +49 (211) 5369-199, Home Office: +49 (38821) 65091 NEC Deutschland GmbH, Fritz-Vomfelde-Stra?e 14-16, 40547 D?sseldorf, Germany Gesch?ftsf?hrer: Christopher Richard Jackson ? Handelsregister D?sseldorf HRB 57941 ------------------------------------------------------------------------------------------------------------------- Unless otherwise stated above: Oy IBM Finland Ab PL 265, 00101 Helsinki, Finland Business ID, Y-tunnus: 0195876-3 Registered in Finland -------------- next part -------------- An HTML attachment was scrubbed... URL: From Thomas.Bernecker at EMEA.NEC.COM Thu Sep 7 10:02:02 2023 From: Thomas.Bernecker at EMEA.NEC.COM (Thomas Bernecker) Date: Thu, 7 Sep 2023 09:02:02 +0000 Subject: [gpfsug-discuss] Spectrum Scale, InfiniBand and QoS In-Reply-To: References: Message-ID: Hi Luis, Thanks for your quick response. I was referring to fabric QoS, not Spectrum Scale QoS (which I have some experience with). The reference to slide 10 of said presentation says to configure the RDMA device as follows to use QoS verbsPorts = /// ? List of RDMA ports to be used ? : RDMA device, required, e.g. mlx5_0, mlx5_1 ? : RDMA port on device, default 1, valid values are 1 or 2 ? : virtual fabric number, default 0, valid values are >= 0 Only verbsPorts using a common are connected ? : QoS level, default is 0, valid values defined in SM configuration The references to SM and verbsPorts being part of RDMA configuration indicates (to me) that this is fabric QoS ? -- Best regards / Mit freundlichem Gru? Thomas Bernecker ------------------------------------------------------------------------------------------------------------------ Manager System Integration and Support, HPCE Division Mobile: +49 (1522) 2851523, Fax: +49 (211) 5369-199, Home Office: +49 (38821) 65091 NEC Deutschland GmbH, Fritz-Vomfelde-Stra?e 14-16, 40547 D?sseldorf, Germany Gesch?ftsf?hrer: Christopher Richard Jackson ? Handelsregister D?sseldorf HRB 57941 ------------------------------------------------------------------------------------------------------------------- From: gpfsug-discuss On Behalf Of Luis Bolinches Sent: Thursday, September 7, 2023 10:51 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Spectrum Scale, InfiniBand and QoS Hi Assuming you are talking about GPFS QoS and not any fabric QoS. It would work with any supported fabric, ESS or not. It is a filesystem feature well above all the fabrics and HW. Limits IOPS per class https:?//www.?ibm.?com/docs/en/storage-scale/5.?1.?8?topic=reference-mmchqos-command ZjQcmQRYFpfptBannerStart Be Careful With This Message The sender's identity could not be verified and someone may be impersonating the sender. Report Suspicious ? ZjQcmQRYFpfptBannerEnd Hi Assuming you are talking about GPFS QoS and not any fabric QoS. It would work with any supported fabric, ESS or not. It is a filesystem feature well above all the fabrics and HW. Limits IOPS per class https://www.ibm.com/docs/en/storage-scale/5.1.8?topic=reference-mmchqos-command [ibm.com] I recommend (depending on number of nodes and seconds) fine-stats so then you can visualize which PID on which client node is doing what. I believe I did something of that on SSUG London 18 or 19 -- Yst?v?llisin terveisin/Regards/Saludos/Salutations/Salutacions Luis Bolinches Executive IT Specialist IBM Storage Scale Server (formerly ESS) developer Phone: +358503112585 Ab IBM Finland Oy Toinen linja 7 00530 Helsinki Uusimaa - Finland Visitors entrance: Siltasaarenkatu 22 "If you always give you will always have" -- Anonymous https://www.credly.com/users/luis-bolinches/badges [credly.com] From: gpfsug-discuss > On Behalf Of Thomas Bernecker Sent: Thursday, 7 September 2023 11.36 To: gpfsug-discuss at gpfsug.org Subject: [EXTERNAL] [gpfsug-discuss] Spectrum Scale, InfiniBand and QoS Dear all, we recently found this presentation from 2021: https:?//www.?spectrumscaleug.?org/wp-content/uploads/2021/05/SSSD21DE-06-Improving-Spectrum-Scale-performance-using-RDMA.?pdf On page 10 it explains by setting the RDMA device in a specific ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. Report Suspicious ? ZjQcmQRYFpfptBannerEnd Dear all, we recently found this presentation from 2021: https://www.spectrumscaleug.org/wp-content/uploads/2021/05/SSSD21DE-06-Improving-Spectrum-Scale-performance-using-RDMA.pdf [spectrumscaleug.org] On page 10 it explains by setting the RDMA device in a specific way one could employ QoS with InfiniBand, which is something we would like to achieve as well. Has anyone a working environment using QoS with InfiniBand outside of the ESS domain? If so, would you share your experience? Sorry for asking a rather broad question, but it seems that this is not well-known stuff ? -- Best regards / Mit freundlichem Gru? Thomas Bernecker ------------------------------------------------------------------------------------------------------------------ Manager System Integration and Support, HPCE Division Mobile: +49 (1522) 2851523, Fax: +49 (211) 5369-199, Home Office: +49 (38821) 65091 NEC Deutschland GmbH, Fritz-Vomfelde-Stra?e 14-16, 40547 D?sseldorf, Germany Gesch?ftsf?hrer: Christopher Richard Jackson ? Handelsregister D?sseldorf HRB 57941 ------------------------------------------------------------------------------------------------------------------- Unless otherwise stated above: Oy IBM Finland Ab PL 265, 00101 Helsinki, Finland Business ID, Y-tunnus: 0195876-3 Registered in Finland -------------- next part -------------- An HTML attachment was scrubbed... URL: From jfosburg at mdanderson.org Thu Sep 7 11:44:12 2023 From: jfosburg at mdanderson.org (Fosburgh,Jonathan) Date: Thu, 7 Sep 2023 10:44:12 +0000 Subject: [gpfsug-discuss] [EXTERNAL] mmbackup feature request In-Reply-To: References: Message-ID: You might be interested in looking at Ideas GPFS-I-975 and GPFS-I-259. -- Jonathan Fosburgh, MS, CAPM Principal Application Systems Analyst IT Engineering Storage Team The University of Texas MD Anderson Cancer Center (713) 745-9346 [Graphical user interface Description automatically generated with low confidence][A picture containing text, room, gambling house Description automatically generated][Graphical user interface Description automatically generated with medium confidence] From: gpfsug-discuss on behalf of Jonathan Buzzard Date: Wednesday, September 6, 2023 at 04:04 To: gpfsug main discussion list Subject: [EXTERNAL] [gpfsug-discuss] mmbackup feature request THIS EMAIL IS A PHISHING RISK Do you trust the sender? The email address is: gpfsug-discuss-bounces at gpfsug.org While this email has passed our filters, we need you to review with caution before taking any action. If the email looks at all suspicious, click the Report a Phish button. Would it be possible to have the mmbackup output display the percentage output progress when backing up files? So at the top we you see something like this Tue Sep 5 23:13:35 2023 mmbackup:changed=747204, expired=427702, unsupported=0 for server [XXXX] Then after it does the expiration you see during the backup lines like Wed Sep 6 02:43:53 2023 mmbackup:Backing up files: 527024 backed up, 426018 expired, 4408 failed. (Backup job exit with 4) It would IMHO be helpful if it looked like Wed Sep 6 02:43:53 2023 mmbackup:Backing up files: 527024 (70.5%) backed up, 426018 (100%) expired, 4408 failed. (Backup job exit with 4) Just based on the number of files. Though as I look at it now I am curious about the discrepancy in the number of files expired, given that the expiration stage allegedly concluded with no errors? Tue Sep 5 23:21:49 2023 mmbackup:Completed policy expiry run with 0 policy errors, 0 files failed, 0 severe errors, returning rc=0. Tue Sep 5 23:21:49 2023 mmbackup:Policy for expiry returned 0 Highest TSM error 0 JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org https://urldefense.com/v3/__http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org__;!!PfbeBCCAmug!j_DKizkmTItFp2H_1xckC8hR7VkDw_ec7pvv0K1d3xe-_qXr1n5W8NpNKOwhxWMn62Z0a3v2zoA2hL13tIYXypi_grSBHAgTO1w$ The information contained in this e-mail message may be privileged, confidential, and/or protected from disclosure. This e-mail message may contain protected health information (PHI); dissemination of PHI should comply with applicable federal and state laws. If you are not the intended recipient, or an authorized representative of the intended recipient, any further review, disclosure, use, dissemination, distribution, or copying of this message or any attachment (or the information contained therein) is strictly prohibited. If you think that you have received this e-mail message in error, please notify the sender by return e-mail and delete all references to it and its contents from your systems. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 82684 bytes Desc: image001.png URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.png Type: image/png Size: 101360 bytes Desc: image002.png URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image003.png Type: image/png Size: 7924 bytes Desc: image003.png URL: From tina.friedrich at it.ox.ac.uk Thu Sep 7 14:28:55 2023 From: tina.friedrich at it.ox.ac.uk (Tina Friedrich) Date: Thu, 7 Sep 2023 14:28:55 +0100 Subject: [gpfsug-discuss] Question regarding quorum nodes Message-ID: Hello All, I hope someone can answer this quickly! We have - it seems - just lost one of our NSDs.The other took over as it should - but unfortunately, the protocol nodes (i.e. I have a number of nodes which now have stale file handles. (The file system is accessible on the remaining NSD server and a number of other clients.) Unfortunately, in the 'home' cluster - the one that ones the disks - I only have, five servers; three of which are quorum nodes, and of course the failed NSD server is one of them. The question is - can I add quorum nodes with one node down, and can I remove quorum functionality from a failed node? Tina From danny.lang at crick.ac.uk Thu Sep 7 14:39:00 2023 From: danny.lang at crick.ac.uk (Danny Lang) Date: Thu, 7 Sep 2023 13:39:00 +0000 Subject: [gpfsug-discuss] Question regarding quorum nodes In-Reply-To: References: Message-ID: Hi Tina, The command you're looking for is: mmchnode https://www.ibm.com/docs/en/storage-scale/4.2.0?topic=commands-mmchnode-command This will allow you to add quorum nodes and to remove. ------ I would advise checking everything prior to running commands. ? Thanks Danny ________________________________ From: gpfsug-discuss on behalf of Tina Friedrich Sent: 07 September 2023 2:28 PM To: gpfsug-discuss at gpfsug.org Cc: theteam at arc.ox.ac.uk Subject: [gpfsug-discuss] Question regarding quorum nodes External Sender: Use caution. Hello All, I hope someone can answer this quickly! We have - it seems - just lost one of our NSDs.The other took over as it should - but unfortunately, the protocol nodes (i.e. I have a number of nodes which now have stale file handles. (The file system is accessible on the remaining NSD server and a number of other clients.) Unfortunately, in the 'home' cluster - the one that ones the disks - I only have, five servers; three of which are quorum nodes, and of course the failed NSD server is one of them. The question is - can I add quorum nodes with one node down, and can I remove quorum functionality from a failed node? Tina _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss_gpfsug.org&data=05%7C01%7C%7Cb25b485dd562417c674d08dbafa6d05a%7C4eed7807ebad415aa7a99170947f4eae%7C0%7C0%7C638296903233929525%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=P62lXeMqP%2F1GLce6mv5dGDHgUILHmpOkzY0F%2BU%2BVGYU%3D&reserved=0 The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT -------------- next part -------------- An HTML attachment was scrubbed... URL: From tina.friedrich at it.ox.ac.uk Thu Sep 7 14:45:19 2023 From: tina.friedrich at it.ox.ac.uk (Tina Friedrich) Date: Thu, 7 Sep 2023 14:45:19 +0100 Subject: [gpfsug-discuss] Question regarding quorum nodes In-Reply-To: References: Message-ID: <912a7b19-c4ad-5f07-6c69-b91a56be7c02@it.ox.ac.uk> Thanks - I know the command. Can I remove quorum functionality from a node that is down - that's really what I want to know, I think. Is that safe to do? Tina On 07/09/2023 14:39, Danny Lang wrote: > Hi Tina, > > The command you're looking for is: > > > *mmchnode* > > https://www.ibm.com/docs/en/storage-scale/4.2.0?topic=commands-mmchnode-command > > This will allow you to add quorum nodes and to remove. > > ------ > > I would advise checking everything prior to running commands. ? > > Thanks > Danny > > ------------------------------------------------------------------------ > *From:* gpfsug-discuss on behalf of > Tina Friedrich > *Sent:* 07 September 2023 2:28 PM > *To:* gpfsug-discuss at gpfsug.org > *Cc:* theteam at arc.ox.ac.uk > *Subject:* [gpfsug-discuss] Question regarding quorum nodes > > External Sender: Use caution. > > > Hello All, > > I hope someone can answer this quickly! > > We have - it seems - just lost one of our NSDs.The other took over as it > should - but unfortunately, the protocol nodes (i.e. I have a number of > nodes which now have stale file handles. (The file system is accessible > on the remaining NSD server and a number of other clients.) > > Unfortunately, in the 'home' cluster - the one that ones the disks - I > only have, five servers; three of which are quorum nodes, and of course > the failed NSD server is one of them. > > The question is - can I add quorum nodes with one node down, and can I > remove quorum functionality from a failed node? > > Tina > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org > > The Francis Crick Institute Limited is a registered charity in England > and Wales no. 1140062 and a company registered in England and Wales no. > 06885462, with its registered office at 1 Midland Road London NW1 1AT > From ott.oopkaup at ut.ee Thu Sep 7 14:51:35 2023 From: ott.oopkaup at ut.ee (Ott Oopkaup) Date: Thu, 7 Sep 2023 16:51:35 +0300 Subject: [gpfsug-discuss] Question regarding quorum nodes In-Reply-To: <912a7b19-c4ad-5f07-6c69-b91a56be7c02@it.ox.ac.uk> References: <912a7b19-c4ad-5f07-6c69-b91a56be7c02@it.ox.ac.uk> Message-ID: Hi Do you have quorum loss already, or just trying to remove a node that is currently offline. Over the years I have had little to no problems removing quorum from a defunct node. Just make sure mmgetstate for the node is down and it should be fine to remove, since it's already out of the cluster. Just make sure you don't induce quorum loss with the moves. I'd probably add the node first after making sure the defunct node is completely down. Best Ott Oopkaup University of Tartu, High Performance Computing Centre On 9/7/23 16:45, Tina Friedrich wrote: > Thanks - I know the command. > > Can I remove quorum functionality from a node that is down - that's > really what I want to know, I think. Is that safe to do? > > Tina > > On 07/09/2023 14:39, Danny Lang wrote: >> Hi Tina, >> >> The command you're looking for is: >> >> >> *mmchnode* >> >> https://www.ibm.com/docs/en/storage-scale/4.2.0?topic=commands-mmchnode-command >> >> >> >> This will allow you to add quorum nodes and to remove. >> >> ------ >> >> I would advise checking everything prior to running commands. ? >> >> Thanks >> Danny >> >> ------------------------------------------------------------------------ >> *From:* gpfsug-discuss on behalf >> of Tina Friedrich >> *Sent:* 07 September 2023 2:28 PM >> *To:* gpfsug-discuss at gpfsug.org >> *Cc:* theteam at arc.ox.ac.uk >> *Subject:* [gpfsug-discuss] Question regarding quorum nodes >> >> External Sender: Use caution. >> >> >> Hello All, >> >> I hope someone can answer this quickly! >> >> We have - it seems - just lost one of our NSDs.The other took over as it >> should - but unfortunately, the protocol nodes (i.e. I have a number of >> nodes which now have stale file handles. (The file system is accessible >> on the remaining NSD server and a number of other clients.) >> >> Unfortunately, in the 'home' cluster - the one that ones the disks - I >> only have, five servers; three of which are quorum nodes, and of course >> the failed NSD server is one of them. >> >> The question is - can I add quorum nodes with one node down, and can I >> remove quorum functionality from a failed node? >> >> Tina >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org >> >> >> The Francis Crick Institute Limited is a registered charity in >> England and Wales no. 1140062 and a company registered in England and >> Wales no. 06885462, with its registered office at 1 Midland Road >> London NW1 1AT >> > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org From danny.lang at crick.ac.uk Thu Sep 7 14:53:27 2023 From: danny.lang at crick.ac.uk (Danny Lang) Date: Thu, 7 Sep 2023 13:53:27 +0000 Subject: [gpfsug-discuss] Question regarding quorum nodes In-Reply-To: References: <912a7b19-c4ad-5f07-6c69-b91a56be7c02@it.ox.ac.uk> Message-ID: Hi, You should be able to remove a node from quorum and it should be fairly safe. This is assuming you're on CCR. You can find the type with: mmlscluster Thanks Danny ________________________________ From: Ott Oopkaup Sent: 07 September 2023 2:51 PM To: gpfsug main discussion list ; Tina Friedrich ; Danny Lang Cc: theteam at arc.ox.ac.uk Subject: Re: [gpfsug-discuss] Question regarding quorum nodes External Sender: Use caution. Hi Do you have quorum loss already, or just trying to remove a node that is currently offline. Over the years I have had little to no problems removing quorum from a defunct node. Just make sure mmgetstate for the node is down and it should be fine to remove, since it's already out of the cluster. Just make sure you don't induce quorum loss with the moves. I'd probably add the node first after making sure the defunct node is completely down. Best Ott Oopkaup University of Tartu, High Performance Computing Centre On 9/7/23 16:45, Tina Friedrich wrote: > Thanks - I know the command. > > Can I remove quorum functionality from a node that is down - that's > really what I want to know, I think. Is that safe to do? > > Tina > > On 07/09/2023 14:39, Danny Lang wrote: >> Hi Tina, >> >> The command you're looking for is: >> >> >> *mmchnode* >> >> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.ibm.com%2Fdocs%2Fen%2Fstorage-scale%2F4.2.0%3Ftopic%3Dcommands-mmchnode-command&data=05%7C01%7C%7Ccdb5b906ab3b4d93d75f08dbafa98ff2%7C4eed7807ebad415aa7a99170947f4eae%7C0%7C0%7C638296915030390070%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=qS0JSTfvO01vihhdvIVXc2N4%2BwkXeCxINM%2F4yhd9s9A%3D&reserved=0 >> > >> >> >> This will allow you to add quorum nodes and to remove. >> >> ------ >> >> I would advise checking everything prior to running commands. ? >> >> Thanks >> Danny >> >> ------------------------------------------------------------------------ >> *From:* gpfsug-discuss on behalf >> of Tina Friedrich >> *Sent:* 07 September 2023 2:28 PM >> *To:* gpfsug-discuss at gpfsug.org >> *Cc:* theteam at arc.ox.ac.uk >> *Subject:* [gpfsug-discuss] Question regarding quorum nodes >> >> External Sender: Use caution. >> >> >> Hello All, >> >> I hope someone can answer this quickly! >> >> We have - it seems - just lost one of our NSDs.The other took over as it >> should - but unfortunately, the protocol nodes (i.e. I have a number of >> nodes which now have stale file handles. (The file system is accessible >> on the remaining NSD server and a number of other clients.) >> >> Unfortunately, in the 'home' cluster - the one that ones the disks - I >> only have, five servers; three of which are quorum nodes, and of course >> the failed NSD server is one of them. >> >> The question is - can I add quorum nodes with one node down, and can I >> remove quorum functionality from a failed node? >> >> Tina >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss_gpfsug.org&data=05%7C01%7C%7Ccdb5b906ab3b4d93d75f08dbafa98ff2%7C4eed7807ebad415aa7a99170947f4eae%7C0%7C0%7C638296915030390070%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=qMXTGt6c8Oc8voYd1MLVOfYXTzgob6YSDsYTrk%2BUIAM%3D&reserved=0 >> > >> >> The Francis Crick Institute Limited is a registered charity in >> England and Wales no. 1140062 and a company registered in England and >> Wales no. 06885462, with its registered office at 1 Midland Road >> London NW1 1AT >> > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss_gpfsug.org&data=05%7C01%7C%7Ccdb5b906ab3b4d93d75f08dbafa98ff2%7C4eed7807ebad415aa7a99170947f4eae%7C0%7C0%7C638296915030390070%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=qMXTGt6c8Oc8voYd1MLVOfYXTzgob6YSDsYTrk%2BUIAM%3D&reserved=0 The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.horton at icr.ac.uk Thu Sep 7 14:53:57 2023 From: robert.horton at icr.ac.uk (Robert Horton) Date: Thu, 7 Sep 2023 13:53:57 +0000 Subject: [gpfsug-discuss] Question regarding quorum nodes In-Reply-To: <912a7b19-c4ad-5f07-6c69-b91a56be7c02@it.ox.ac.uk> References: <912a7b19-c4ad-5f07-6c69-b91a56be7c02@it.ox.ac.uk> Message-ID: You can, although there's a curious restriction that you can only do one at a time (though I guess that's not an issue for you). I've only ever tried it on a cluster I was in the process of decommissioning anyway but it worked as expected. -----Original Message----- From: gpfsug-discuss On Behalf Of Tina Friedrich Sent: 07 September 2023 14:45 To: Danny Lang ; gpfsug-discuss at gpfsug.org Cc: theteam at arc.ox.ac.uk Subject: Re: [gpfsug-discuss] Question regarding quorum nodes CAUTION: This email originated from outside of the ICR. Do not click links or open attachments unless you recognize the sender's email address and know the content is safe. Thanks - I know the command. Can I remove quorum functionality from a node that is down - that's really what I want to know, I think. Is that safe to do? Tina On 07/09/2023 14:39, Danny Lang wrote: > Hi Tina, > > The command you're looking for is: > > > *mmchnode* > > https://www.ibm.com/docs/en/storage-scale/4.2.0?topic=commands-mmchnod > e-command > de-command> > > This will allow you to add quorum nodes and to remove. > > ------ > > I would advise checking everything prior to running commands. ? > > Thanks > Danny > > ---------------------------------------------------------------------- > -- > *From:* gpfsug-discuss on behalf > of Tina Friedrich > *Sent:* 07 September 2023 2:28 PM > *To:* gpfsug-discuss at gpfsug.org > *Cc:* theteam at arc.ox.ac.uk > *Subject:* [gpfsug-discuss] Question regarding quorum nodes > > External Sender: Use caution. > > > Hello All, > > I hope someone can answer this quickly! > > We have - it seems - just lost one of our NSDs.The other took over as > it should - but unfortunately, the protocol nodes (i.e. I have a > number of nodes which now have stale file handles. (The file system is > accessible on the remaining NSD server and a number of other clients.) > > Unfortunately, in the 'home' cluster - the one that ones the disks - I > only have, five servers; three of which are quorum nodes, and of > course the failed NSD server is one of them. > > The question is - can I add quorum nodes with one node down, and can I > remove quorum functionality from a failed node? > > Tina > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org > > > The Francis Crick Institute Limited is a registered charity in England > and Wales no. 1140062 and a company registered in England and Wales no. > 06885462, with its registered office at 1 Midland Road London NW1 1AT > _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org The Institute of Cancer Research: Royal Cancer Hospital, a charitable Company Limited by Guarantee, Registered in England under Company No. 534147 with its Registered Office at 123 Old Brompton Road, London SW7 3RP. This e-mail message is confidential and for use by the addressee only. If the message is received by anyone other than the addressee, please return the message to the sender by replying to it and then delete the message from your computer and network. From truongv at us.ibm.com Thu Sep 7 15:39:00 2023 From: truongv at us.ibm.com (Truong Vu) Date: Thu, 7 Sep 2023 14:39:00 +0000 Subject: [gpfsug-discuss] gpfsug-discuss Digest, Vol 138, Issue 10 In-Reply-To: References: Message-ID: <8911A7F1-5957-4E4D-9AEB-7CABCAF37BE9@us.ibm.com> From the description, I am assuming your cluster has 3 quorum nodes. One of the quorum nodes is down. To maintain node quorum, you should only add a non-quorum node to the cluster. Start gpfs on that node, then use mmchnode command to change that node to quorum node. Another option is to remove the quorum role on the failed node first, then add 1 quorum node. Regards, Tru. Vu ?On 9/7/23, 9:42 AM, "gpfsug-discuss on behalf of gpfsug-discuss-request at gpfsug.org " on behalf of gpfsug-discuss-request at gpfsug.org > wrote: Send gpfsug-discuss mailing list submissions to gpfsug-discuss at gpfsug.org To subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at gpfsug.org You can reach the person managing the list at gpfsug-discuss-owner at gpfsug.org When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." Today's Topics: 1. Question regarding quorum nodes (Tina Friedrich) 2. Re: Question regarding quorum nodes (Danny Lang) ---------------------------------------------------------------------- Message: 1 Date: Thu, 7 Sep 2023 14:28:55 +0100 From: Tina Friedrich > To: "gpfsug-discuss at gpfsug.org " > Cc: > Subject: [gpfsug-discuss] Question regarding quorum nodes Message-ID: > Content-Type: text/plain; charset="UTF-8"; format=flowed Hello All, I hope someone can answer this quickly! We have - it seems - just lost one of our NSDs.The other took over as it should - but unfortunately, the protocol nodes (i.e. I have a number of nodes which now have stale file handles. (The file system is accessible on the remaining NSD server and a number of other clients.) Unfortunately, in the 'home' cluster - the one that ones the disks - I only have, five servers; three of which are quorum nodes, and of course the failed NSD server is one of them. The question is - can I add quorum nodes with one node down, and can I remove quorum functionality from a failed node? Tina ------------------------------ Message: 2 Date: Thu, 7 Sep 2023 13:39:00 +0000 From: Danny Lang > To: "gpfsug-discuss at gpfsug.org " > Cc: "theteam at arc.ox.ac.uk " > Subject: Re: [gpfsug-discuss] Question regarding quorum nodes Message-ID: > Content-Type: text/plain; charset="utf-8" Hi Tina, The command you're looking for is: mmchnode https://www.ibm.com/docs/en/storage-scale/4.2.0?topic=commands-mmchnode-command This will allow you to add quorum nodes and to remove. ------ I would advise checking everything prior to running commands. ? Thanks Danny ________________________________ From: gpfsug-discuss > on behalf of Tina Friedrich > Sent: 07 September 2023 2:28 PM To: gpfsug-discuss at gpfsug.org

> Cc: theteam at arc.ox.ac.uk > Subject: [gpfsug-discuss] Question regarding quorum nodes External Sender: Use caution. Hello All, I hope someone can answer this quickly! We have - it seems - just lost one of our NSDs.The other took over as it should - but unfortunately, the protocol nodes (i.e. I have a number of nodes which now have stale file handles. (The file system is accessible on the remaining NSD server and a number of other clients.) Unfortunately, in the 'home' cluster - the one that ones the disks - I only have, five servers; three of which are quorum nodes, and of course the failed NSD server is one of them. The question is - can I add quorum nodes with one node down, and can I remove quorum functionality from a failed node? Tina _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss_gpfsug.org&data=05%7C01%7C%7Cb25b485dd562417c674d08dbafa6d05a%7C4eed7807ebad415aa7a99170947f4eae%7C0%7C0%7C638296903233929525%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=P62lXeMqP%2F1GLce6mv5dGDHgUILHmpOkzY0F%2BU%2BVGYU%3D&reserved=0 > The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT -------------- next part -------------- An HTML attachment was scrubbed... URL: > ------------------------------ Subject: Digest Footer _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org ------------------------------ End of gpfsug-discuss Digest, Vol 138, Issue 10 *********************************************** From ralph.wuerthner at de.ibm.com Thu Sep 7 16:13:49 2023 From: ralph.wuerthner at de.ibm.com (Ralph Wuerthner) Date: Thu, 7 Sep 2023 15:13:49 +0000 Subject: [gpfsug-discuss] Spectrum Scale, InfiniBand and QoS In-Reply-To: References: Message-ID: <714e14de4a099b0bb8e7eee369b29834dd9a8fb5.camel@de.ibm.com> Thomas, you are correct, in the verbsPorts configuration variable description verbsPorts = /// refers to the service level which is set for all RC queue pairs on the specified adapter port. This information is used by the subnet manager to control QoS. Unfortunately I'm not familiar with configuring QoS so I cannot provide additional information on this. If found the following white paper from Mellanox/Nvidia providing some more information: https://network.nvidia.com/pdf/whitepapers/deploying_qos_wp_10_19_2005.pdf . It is at least a starting point. Mit freundlichen Gr??en / Kind regards Ralph W?rthner IBM Storage Scale Development IBM Systems & Technology Group, Systems Software Development ________________________________ Mobile: +49 (0) 171 3089472 IBM Deutschland Research & Development GmbH [cid:4c6d823aedac83b4cc99633a0db02b7738e8aa77.camel at de.ibm.com-1] Email: ralph.wuerthner at de.ibm.com Wilhelm-Fay-Str. 32 65936 Frankfurt Germany ________________________________ IBM Data Privacy Statement IBM Deutschland Research & Development GmbH / Vorsitzende des Aufsichtsrats: Gregor Pillen Gesch?ftsf?hrung: David Faller Sitz der Gesellschaft: B?blingen / Registergericht: Amtsgericht Stuttgart, HRB 243294 -----Original Message----- From: Thomas Bernecker > Reply-To: gpfsug main discussion list > To: gpfsug main discussion list > Subject: [EXTERNAL] Re: [gpfsug-discuss] Spectrum Scale, InfiniBand and QoS Date: 09/07/2023 11:02:02 AM Hi Luis, Thanks for your quick response. I was referring to fabric QoS, not Spectrum Scale QoS (which I have some experience with). The reference to slide 10 of said presentation says to configure the RDMA device as follows to use QoS verbsPorts ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. Report Suspicious ZjQcmQRYFpfptBannerEnd Hi Luis, Thanks for your quick response. I was referring to fabric QoS, not Spectrum Scale QoS (which I have some experience with). The reference to slide 10 of said presentation says to configure the RDMA device as follows to use QoS verbsPorts = /// ? List of RDMA ports to be used ? : RDMA device, required, e.g. mlx5_0, mlx5_1 ? : RDMA port on device, default 1, valid values are 1 or 2 ? : virtual fabric number, default 0, valid values are >= 0 Only verbsPorts using a common are connected ? : QoS level, default is 0, valid values defined in SM configuration The references to SM and verbsPorts being part of RDMA configuration indicates (to me) that this is fabric QoS ? -- Best regards / Mit freundlichem Gru? Thomas Bernecker ------------------------------------------------------------------------------------------------------------------ Manager System Integration and Support, HPCE Division Mobile: +49 (1522) 2851523, Fax: +49 (211) 5369-199, Home Office: +49 (38821) 65091 NEC Deutschland GmbH, Fritz-Vomfelde-Stra?e 14-16, 40547 D?sseldorf, Germany Gesch?ftsf?hrer: Christopher Richard Jackson ? Handelsregister D?sseldorf HRB 57941 ------------------------------------------------------------------------------------------------------------------- From: gpfsug-discuss On Behalf Of Luis Bolinches Sent: Thursday, September 7, 2023 10:51 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Spectrum Scale, InfiniBand and QoS Hi Assuming you are talking about GPFS QoS and not any fabric QoS. It would work with any supported fabric, ESS or not. It is a filesystem feature well above all the fabrics and HW. Limits IOPS per class https:?//www.?ibm.?com/docs/en/storage-scale/5.?1.?8?topic=reference-mmchqos-command ZjQcmQRYFpfptBannerStart Be Careful With This Message The sender's identity could not be verified and someone may be impersonating the sender. Report Suspicious ? ZjQcmQRYFpfptBannerEnd Hi Assuming you are talking about GPFS QoS and not any fabric QoS. It would work with any supported fabric, ESS or not. It is a filesystem feature well above all the fabrics and HW. Limits IOPS per class https://www.ibm.com/docs/en/storage-scale/5.1.8?topic=reference-mmchqos-command [ibm.com] I recommend (depending on number of nodes and seconds) fine-stats so then you can visualize which PID on which client node is doing what. I believe I did something of that on SSUG London 18 or 19 -- Yst?v?llisin terveisin/Regards/Saludos/Salutations/Salutacions Luis Bolinches Executive IT Specialist IBM Storage ScaleServer (formerly ESS) developer Phone: +358503112585 Ab IBM Finland Oy Toinen linja 7 00530 Helsinki Uusimaa - Finland Visitors entrance: Siltasaarenkatu 22 "If you always give you will always have" -- Anonymous https://www.credly.com/users/luis-bolinches/badges [credly.com] From: gpfsug-discuss >On Behalf Of Thomas Bernecker Sent: Thursday, 7 September 2023 11.36 To: gpfsug-discuss at gpfsug.org Subject: [EXTERNAL] [gpfsug-discuss] Spectrum Scale, InfiniBand and QoS Dear all, we recently found this presentation from 2021: https:?//www.?spectrumscaleug.?org/wp-content/uploads/2021/05/SSSD21DE-06-Improving-Spectrum-Scale-performance-using-RDMA.?pdf On page 10 it explains by setting the RDMA device in a specific ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. Report Suspicious ? ZjQcmQRYFpfptBannerEnd Dear all, we recently found this presentation from 2021:https://www.spectrumscaleug.org/wp-content/uploads/2021/05/SSSD21DE-06-Improving-Spectrum-Scale-performance-using-RDMA.pdf [spectrumscaleug.org] On page 10 it explains by setting the RDMA device in a specific way one could employ QoS with InfiniBand, which is something we would like to achieve as well. Has anyone a working environment using QoS with InfiniBand outside of the ESS domain? If so, would you share your experience? Sorry for asking a rather broad question, but it seems that this is not well-known stuff ? -- Best regards / Mit freundlichem Gru? Thomas Bernecker ------------------------------------------------------------------------------------------------------------------ Manager System Integration and Support, HPCE Division Mobile: +49 (1522) 2851523, Fax: +49 (211) 5369-199, Home Office: +49 (38821) 65091 NEC Deutschland GmbH, Fritz-Vomfelde-Stra?e 14-16, 40547 D?sseldorf, Germany Gesch?ftsf?hrer: Christopher Richard Jackson ? Handelsregister D?sseldorf HRB 57941 ------------------------------------------------------------------------------------------------------------------- Unless otherwise stated above: Oy IBM Finland Ab PL 265, 00101 Helsinki, Finland Business ID, Y-tunnus: 0195876-3 Registered in Finland _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ATT00001.png Type: image/png Size: 70 bytes Desc: ATT00001.png URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ATT00002.png Type: image/png Size: 354 bytes Desc: ATT00002.png URL: From Thomas.Bernecker at EMEA.NEC.COM Fri Sep 8 09:52:02 2023 From: Thomas.Bernecker at EMEA.NEC.COM (Thomas Bernecker) Date: Fri, 8 Sep 2023 08:52:02 +0000 Subject: [gpfsug-discuss] Spectrum Scale, InfiniBand and QoS In-Reply-To: <714e14de4a099b0bb8e7eee369b29834dd9a8fb5.camel@de.ibm.com> References: <714e14de4a099b0bb8e7eee369b29834dd9a8fb5.camel@de.ibm.com> Message-ID: Dear Ralph, Thanks a lot for answering. We tested this yesterday in a test cluster from our customer and it works as it is supposed to be. We are using the rather outdated GPFS Version 5.1.1.4, so this appears to be working since quite some time. One can clearly see the traffic is separated by using perfquery -X -C ? -- Best regards / Mit freundlichem Gru? Thomas Bernecker ------------------------------------------------------------------------------------------------------------------ Manager System Integration and Support, HPCE Division Mobile: +49 (1522) 2851523, Fax: +49 (211) 5369-199, Home Office: +49 (38821) 65091 NEC Deutschland GmbH, Fritz-Vomfelde-Stra?e 14-16, 40547 D?sseldorf, Germany Gesch?ftsf?hrer: Christopher Richard Jackson ? Handelsregister D?sseldorf HRB 57941 ------------------------------------------------------------------------------------------------------------------- From: gpfsug-discuss On Behalf Of Ralph Wuerthner Sent: Thursday, September 7, 2023 5:14 PM To: gpfsug-discuss at gpfsug.org Subject: Re: [gpfsug-discuss] Spectrum Scale, InfiniBand and QoS Thomas, you are correct, in the verbsPorts configuration variable description verbsPorts = /// refers to the service level which is set for all RC queue pairs ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. Report Suspicious ? ZjQcmQRYFpfptBannerEnd Thomas, you are correct, in the verbsPorts configuration variable description verbsPorts = /// refers to the service level which is set for all RC queue pairs on the specified adapter port. This information is used by the subnet manager to control QoS. Unfortunately I'm not familiar with configuring QoS so I cannot provide additional information on this. If found the following white paper from Mellanox/Nvidia providing some more information: https://network.nvidia.com/pdf/whitepapers/deploying_qos_wp_10_19_2005.pdf [network.nvidia.com] . It is at least a starting point. Mit freundlichen Gr??en / Kind regards Ralph W?rthner IBM Storage Scale Development IBM Systems & Technology Group, Systems Software Development ________________________________ Mobile: +49 (0) 171 3089472 IBM Deutschland Research & Development GmbH [cid:image002.png at 01D9E241.EA4A26A0] Email: ralph.wuerthner at de.ibm.com Wilhelm-Fay-Str. 32 65936 Frankfurt Germany ________________________________ IBM Data Privacy Statement [ibm.com] IBM Deutschland Research & Development GmbH / Vorsitzende des Aufsichtsrats: Gregor Pillen Gesch?ftsf?hrung: David Faller Sitz der Gesellschaft: B?blingen / Registergericht: Amtsgericht Stuttgart, HRB 243294 -----Original Message----- From: Thomas Bernecker > Reply-To: gpfsug main discussion list > To: gpfsug main discussion list > Subject: [EXTERNAL] Re: [gpfsug-discuss] Spectrum Scale, InfiniBand and QoS Date: 09/07/2023 11:02:02 AM Hi Luis, Thanks for your quick response. I was referring to fabric QoS, not Spectrum Scale QoS (which I have some experience with). The reference to slide 10 of said presentation says to configure the RDMA device as follows to use QoS verbsPorts ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. Report Suspicious ? ZjQcmQRYFpfptBannerEnd Hi Luis, Thanks for your quick response. I was referring to fabric QoS, not Spectrum Scale QoS (which I have some experience with). The reference to slide 10 of said presentation says to configure the RDMA device as follows to use QoS verbsPorts = /// ? List of RDMA ports to be used ? : RDMA device, required, e.g. mlx5_0, mlx5_1 ? : RDMA port on device, default 1, valid values are 1 or 2 ? : virtual fabric number, default 0, valid values are >= 0 Only verbsPorts using a common are connected ? : QoS level, default is 0, valid values defined in SM configuration The references to SM and verbsPorts being part of RDMA configuration indicates (to me) that this is fabric QoS ? -- Best regards / Mit freundlichem Gru? Thomas Bernecker ------------------------------------------------------------------------------------------------------------------ Manager System Integration and Support, HPCE Division Mobile: +49 (1522) 2851523, Fax: +49 (211) 5369-199, Home Office: +49 (38821) 65091 NEC Deutschland GmbH, Fritz-Vomfelde-Stra?e 14-16, 40547 D?sseldorf, Germany Gesch?ftsf?hrer: Christopher Richard Jackson ? Handelsregister D?sseldorf HRB 57941 ------------------------------------------------------------------------------------------------------------------- From: gpfsug-discuss >On Behalf Of Luis Bolinches Sent: Thursday, September 7, 2023 10:51 AM To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Spectrum Scale, InfiniBand and QoS Hi Assuming you are talking about GPFS QoS and not any fabric QoS. It would work with any supported fabric, ESS or not. It is a filesystem feature well above all the fabrics and HW. Limits IOPS per class https:?//www.?ibm.?com/docs/en/storage-scale/5.?1.?8?topic=reference-mmchqos-command ZjQcmQRYFpfptBannerStart Be Careful With This Message The sender's identity could not be verified and someone may be impersonating the sender. Report Suspicious ? ZjQcmQRYFpfptBannerEnd Hi Assuming you are talking about GPFS QoS and not any fabric QoS. It would work with any supported fabric, ESS or not. It is a filesystem feature well above all the fabrics and HW. Limits IOPS per class https://www.ibm.com/docs/en/storage-scale/5.1.8?topic=reference-mmchqos-command [ibm.com] [ibm.com] I recommend (depending on number of nodes and seconds) fine-stats so then you can visualize which PID on which client node is doing what. I believe I did something of that on SSUG London 18 or 19 -- Yst?v?llisin terveisin/Regards/Saludos/Salutations/Salutacions Luis Bolinches Executive IT Specialist IBM Storage ScaleServer (formerly ESS) developer Phone: +358503112585 Ab IBM Finland Oy Toinen linja 7 00530 Helsinki Uusimaa - Finland Visitors entrance: Siltasaarenkatu 22 "If you always give you will always have" -- Anonymous https://www.credly.com/users/luis-bolinches/badges [credly.com] [credly.com] From: gpfsug-discuss >On Behalf Of Thomas Bernecker Sent: Thursday, 7 September 2023 11.36 To: gpfsug-discuss at gpfsug.org Subject: [EXTERNAL] [gpfsug-discuss] Spectrum Scale, InfiniBand and QoS Dear all, we recently found this presentation from 2021: https:?//www.?spectrumscaleug.?org/wp-content/uploads/2021/05/SSSD21DE-06-Improving-Spectrum-Scale-performance-using-RDMA.?pdf On page 10 it explains by setting the RDMA device in a specific ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. Report Suspicious ? ZjQcmQRYFpfptBannerEnd Dear all, we recently found this presentation from 2021:https://www.spectrumscaleug.org/wp-content/uploads/2021/05/SSSD21DE-06-Improving-Spectrum-Scale-performance-using-RDMA.pdf [spectrumscaleug.org] [spectrumscaleug.org] On page 10 it explains by setting the RDMA device in a specific way one could employ QoS with InfiniBand, which is something we would like to achieve as well. Has anyone a working environment using QoS with InfiniBand outside of the ESS domain? If so, would you share your experience? Sorry for asking a rather broad question, but it seems that this is not well-known stuff ? -- Best regards / Mit freundlichem Gru? Thomas Bernecker ------------------------------------------------------------------------------------------------------------------ Manager System Integration and Support, HPCE Division Mobile: +49 (1522) 2851523, Fax: +49 (211) 5369-199, Home Office: +49 (38821) 65091 NEC Deutschland GmbH, Fritz-Vomfelde-Stra?e 14-16, 40547 D?sseldorf, Germany Gesch?ftsf?hrer: Christopher Richard Jackson ? Handelsregister D?sseldorf HRB 57941 ------------------------------------------------------------------------------------------------------------------- Unless otherwise stated above: Oy IBM Finland Ab PL 265, 00101 Helsinki, Finland Business ID, Y-tunnus: 0195876-3 Registered in Finland _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org [gpfsug.org] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 70 bytes Desc: image001.png URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.png Type: image/png Size: 354 bytes Desc: image002.png URL: From ralph.wuerthner at de.ibm.com Fri Sep 8 10:18:00 2023 From: ralph.wuerthner at de.ibm.com (Ralph Wuerthner) Date: Fri, 8 Sep 2023 09:18:00 +0000 Subject: [gpfsug-discuss] Spectrum Scale, InfiniBand and QoS In-Reply-To: References: <714e14de4a099b0bb8e7eee369b29834dd9a8fb5.camel@de.ibm.com> Message-ID: <15a7778c3dc10758775ebdd69a94168d7c61bdbc.camel@de.ibm.com> Thomas, I'm happy that everything is working as expected! We are using the rather outdated GPFS Version 5.1.1.4, so this appears to be working since quite some time. Yes, support for setting the service level is available since 5.0.0 ptf 2. Mit freundlichen Gr??en / Kind regards Ralph W?rthner IBM Storage Scale Development IBM Systems & Technology Group, Systems Software Development ________________________________ Mobile: +49 (0) 171 3089472 IBM Deutschland Research & Development GmbH [cid:2b27ba0767fae0bc931a013bf0c1c559584bab44.camel at de.ibm.com-1] Email: ralph.wuerthner at de.ibm.com Wilhelm-Fay-Str. 32 65936 Frankfurt Germany ________________________________ IBM Data Privacy Statement IBM Deutschland Research & Development GmbH / Vorsitzende des Aufsichtsrats: Gregor Pillen Gesch?ftsf?hrung: David Faller Sitz der Gesellschaft: B?blingen / Registergericht: Amtsgericht Stuttgart, HRB 243294 -----Original Message----- From: Thomas Bernecker > Reply-To: gpfsug main discussion list > To: gpfsug main discussion list > Subject: [EXTERNAL] Re: [gpfsug-discuss] Spectrum Scale, InfiniBand and QoS Date: 09/08/2023 10:52:02 AM Dear Ralph, Thanks a lot for answering. We tested this yesterday in a test cluster from our customer and it works as it is supposed to be. We are using the rather outdated GPFS Version 5.?1.?1.?4, so this appears to be working since quite some ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. Report Suspicious ZjQcmQRYFpfptBannerEnd Dear Ralph, Thanks a lot for answering. We tested this yesterday in a test cluster from our customer and it works as it is supposed to be. We are using the rather outdated GPFS Version 5.1.1.4, so this appears to be working since quite some time. One can clearly see the traffic is separated by using perfquery -X -C ? -- Best regards / Mit freundlichem Gru? Thomas Bernecker ------------------------------------------------------------------------------------------------------------------ Manager System Integration and Support, HPCE Division Mobile: +49 (1522) 2851523, Fax: +49 (211) 5369-199, Home Office: +49 (38821) 65091 NEC Deutschland GmbH, Fritz-Vomfelde-Stra?e 14-16, 40547 D?sseldorf, Germany Gesch?ftsf?hrer: Christopher Richard Jackson ? Handelsregister D?sseldorf HRB 57941 ------------------------------------------------------------------------------------------------------------------- From: gpfsug-discuss On Behalf Of Ralph Wuerthner Sent: Thursday, September 7, 2023 5:14 PM To: gpfsug-discuss at gpfsug.org Subject: Re: [gpfsug-discuss] Spectrum Scale, InfiniBand and QoS Thomas, you are correct, in the verbsPorts configuration variable description verbsPorts = /// refers to the service level which is set for all RC queue pairs ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. Report Suspicious ? ZjQcmQRYFpfptBannerEnd Thomas, you are correct, in the verbsPorts configuration variable description verbsPorts = /// refers to the service level which is set for all RC queue pairs on the specified adapter port. This information is used by the subnet manager to control QoS. Unfortunately I'm not familiar with configuring QoS so I cannot provide additional information on this. If found the following white paper from Mellanox/Nvidia providing some more information: https://network.nvidia.com/pdf/whitepapers/deploying_qos_wp_10_19_2005.pdf [network.nvidia.com] . It is at least a starting point. Mit freundlichen Gr??en / Kind regards Ralph W?rthner IBM Storage Scale Development IBM Systems & Technology Group, Systems Software Development ________________________________ Mobile: +49 (0) 171 3089472 IBM Deutschland Research & Development GmbH [cid:image002.png at 01D9E241.EA4A26A0] Email: ralph.wuerthner at de.ibm.com Wilhelm-Fay-Str. 32 65936 Frankfurt Germany ________________________________ IBM Data Privacy Statement [ibm.com] IBM Deutschland Research & Development GmbH / Vorsitzende des Aufsichtsrats: Gregor Pillen Gesch?ftsf?hrung: David Faller Sitz der Gesellschaft: B?blingen / Registergericht: Amtsgericht Stuttgart, HRB 243294 -----Original Message----- From: Thomas Bernecker > Reply-To: gpfsug main discussion list > To: gpfsug main discussion list > Subject: [EXTERNAL] Re: [gpfsug-discuss] Spectrum Scale, InfiniBand and QoS Date: 09/07/2023 11:02:02 AM Hi Luis, Thanks for your quick response. I was referring to fabric QoS, not Spectrum Scale QoS (which I have some experience with). The reference to slide 10 of said presentation says to configure the RDMA device as follows to use QoS verbsPorts ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. Report Suspicious ? ZjQcmQRYFpfptBannerEnd Hi Luis, Thanks for your quick response. I was referring to fabric QoS, not Spectrum Scale QoS (which I have some experience with). The reference to slide 10 of said presentation says to configure the RDMA device as follows to use QoS verbsPorts = /// ? List of RDMA ports to be used ? : RDMA device, required, e.g. mlx5_0, mlx5_1 ? : RDMA port on device, default 1, valid values are 1 or 2 ? : virtual fabric number, default 0, valid values are >= 0 Only verbsPorts using a common are connected ? : QoS level, default is 0, valid values defined in SM configuration The references to SM and verbsPorts being part of RDMA configuration indicates (to me) that this is fabric QoS ? -- Best regards / Mit freundlichem Gru? Thomas Bernecker ------------------------------------------------------------------------------------------------------------------ Manager System Integration and Support, HPCE Division Mobile: +49 (1522) 2851523, Fax: +49 (211) 5369-199, Home Office: +49 (38821) 65091 NEC Deutschland GmbH, Fritz-Vomfelde-Stra?e 14-16, 40547 D?sseldorf, Germany Gesch?ftsf?hrer: Christopher Richard Jackson ? Handelsregister D?sseldorf HRB 57941 ------------------------------------------------------------------------------------------------------------------- From: gpfsug-discuss >On Behalf OfLuis Bolinches Sent: Thursday, September 7, 2023 10:51 AM To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Spectrum Scale, InfiniBand and QoS Hi Assuming you are talking about GPFS QoS and not any fabric QoS. It would work with any supported fabric, ESS or not. It is a filesystem feature well above all the fabrics and HW. Limits IOPS per class https:?//www.?ibm.?com/docs/en/storage-scale/5.?1.?8?topic=reference-mmchqos-command ZjQcmQRYFpfptBannerStart Be Careful With This Message The sender's identity could not be verified and someone may be impersonating the sender. Report Suspicious ? ZjQcmQRYFpfptBannerEnd Hi Assuming you are talking about GPFS QoS and not any fabric QoS. It would work with any supported fabric, ESS or not. It is a filesystem feature well above all the fabrics and HW. Limits IOPS per class https://www.ibm.com/docs/en/storage-scale/5.1.8?topic=reference-mmchqos-command [ibm.com] [ibm.com] I recommend (depending on number of nodes and seconds) fine-stats so then you can visualize which PID on which client node is doing what. I believe I did something of that on SSUG London 18 or 19 -- Yst?v?llisin terveisin/Regards/Saludos/Salutations/Salutacions Luis Bolinches Executive IT Specialist IBM Storage ScaleServer (formerly ESS) developer Phone: +358503112585 Ab IBM Finland Oy Toinen linja 7 00530 Helsinki Uusimaa - Finland Visitors entrance: Siltasaarenkatu 22 "If you always give you will always have" -- Anonymous https://www.credly.com/users/luis-bolinches/badges [credly.com] [credly.com] From: gpfsug-discuss >On Behalf OfThomas Bernecker Sent: Thursday, 7 September 2023 11.36 To: gpfsug-discuss at gpfsug.org Subject: [EXTERNAL] [gpfsug-discuss] Spectrum Scale, InfiniBand and QoS Dear all, we recently found this presentation from 2021: https:?//www.?spectrumscaleug.?org/wp-content/uploads/2021/05/SSSD21DE-06-Improving-Spectrum-Scale-performance-using-RDMA.?pdf On page 10 it explains by setting the RDMA device in a specific ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. Report Suspicious ? ZjQcmQRYFpfptBannerEnd Dear all, we recently found this presentation from 2021:https://www.spectrumscaleug.org/wp-content/uploads/2021/05/SSSD21DE-06-Improving-Spectrum-Scale-performance-using-RDMA.pdf [spectrumscaleug.org] [spectrumscaleug.org] On page 10 it explains by setting the RDMA device in a specific way one could employ QoS with InfiniBand, which is something we would like to achieve as well. Has anyone a working environment using QoS with InfiniBand outside of the ESS domain? If so, would you share your experience? Sorry for asking a rather broad question, but it seems that this is not well-known stuff ? -- Best regards / Mit freundlichem Gru? Thomas Bernecker ------------------------------------------------------------------------------------------------------------------ Manager System Integration and Support, HPCE Division Mobile: +49 (1522) 2851523, Fax: +49 (211) 5369-199, Home Office: +49 (38821) 65091 NEC Deutschland GmbH, Fritz-Vomfelde-Stra?e 14-16, 40547 D?sseldorf, Germany Gesch?ftsf?hrer: Christopher Richard Jackson ? Handelsregister D?sseldorf HRB 57941 ------------------------------------------------------------------------------------------------------------------- Unless otherwise stated above: Oy IBM Finland Ab PL 265, 00101 Helsinki, Finland Business ID, Y-tunnus: 0195876-3 Registered in Finland _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org [gpfsug.org] _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ATT00001.png Type: image/png Size: 70 bytes Desc: ATT00001.png URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ATT00002.png Type: image/png Size: 354 bytes Desc: ATT00002.png URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 70 bytes Desc: image001.png URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.png Type: image/png Size: 354 bytes Desc: image002.png URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 70 bytes Desc: image001.png URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 70 bytes Desc: image001.png URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 70 bytes Desc: image001.png URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 70 bytes Desc: image001.png URL: From p.childs at qmul.ac.uk Fri Sep 8 20:05:52 2023 From: p.childs at qmul.ac.uk (Peter Childs) Date: Fri, 8 Sep 2023 19:05:52 +0000 Subject: [gpfsug-discuss] [EXTERNAL] mmbackup feature request In-Reply-To: References: Message-ID: We have a python script that parses the mmbackup output log and outputs a number of bits of data, mostly the success/failure of the backup run, and date of last backup, so we can monitor how long it is since the last successful backup and alert us when its more than so many hours since the last backup. The Script can also pickup the % of progress though the backup, but I've not got around to getting it working quite yet. If people are interested I could look to checking it over and making it available. Peter Childs ITS Research Storage Queen Mary University of London ________________________________________ From: gpfsug-discuss on behalf of Fosburgh,Jonathan Sent: 07 September 2023 11:44 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] [EXTERNAL] mmbackup feature request CAUTION: This email originated from outside of QMUL. Do not click links or open attachments unless you recognise the sender and know the content is safe. You might be interested in looking at Ideas GPFS-I-975 and GPFS-I-259. -- Jonathan Fosburgh, MS, CAPM Principal Application Systems Analyst IT Engineering Storage Team The University of Texas MD Anderson Cancer Center (713) 745-9346 [Graphical user interface Description automatically generated with low confidence][A picture containing text, room, gambling house Description automatically generated][Graphical user interface Description automatically generated with medium confidence] From: gpfsug-discuss on behalf of Jonathan Buzzard Date: Wednesday, September 6, 2023 at 04:04 To: gpfsug main discussion list Subject: [EXTERNAL] [gpfsug-discuss] mmbackup feature request THIS EMAIL IS A PHISHING RISK Do you trust the sender? The email address is: gpfsug-discuss-bounces at gpfsug.org While this email has passed our filters, we need you to review with caution before taking any action. If the email looks at all suspicious, click the Report a Phish button. Would it be possible to have the mmbackup output display the percentage output progress when backing up files? So at the top we you see something like this Tue Sep 5 23:13:35 2023 mmbackup:changed=747204, expired=427702, unsupported=0 for server [XXXX] Then after it does the expiration you see during the backup lines like Wed Sep 6 02:43:53 2023 mmbackup:Backing up files: 527024 backed up, 426018 expired, 4408 failed. (Backup job exit with 4) It would IMHO be helpful if it looked like Wed Sep 6 02:43:53 2023 mmbackup:Backing up files: 527024 (70.5%) backed up, 426018 (100%) expired, 4408 failed. (Backup job exit with 4) Just based on the number of files. Though as I look at it now I am curious about the discrepancy in the number of files expired, given that the expiration stage allegedly concluded with no errors? Tue Sep 5 23:21:49 2023 mmbackup:Completed policy expiry run with 0 policy errors, 0 files failed, 0 severe errors, returning rc=0. Tue Sep 5 23:21:49 2023 mmbackup:Policy for expiry returned 0 Highest TSM error 0 JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org https://urldefense.com/v3/__http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org__;!!PfbeBCCAmug!j_DKizkmTItFp2H_1xckC8hR7VkDw_ec7pvv0K1d3xe-_qXr1n5W8NpNKOwhxWMn62Z0a3v2zoA2hL13tIYXypi_grSBHAgTO1w$ The information contained in this e-mail message may be privileged, confidential, and/or protected from disclosure. This e-mail message may contain protected health information (PHI); dissemination of PHI should comply with applicable federal and state laws. If you are not the intended recipient, or an authorized representative of the intended recipient, any further review, disclosure, use, dissemination, distribution, or copying of this message or any attachment (or the information contained therein) is strictly prohibited. If you think that you have received this e-mail message in error, please notify the sender by return e-mail and delete all references to it and its contents from your systems. From pradeep.srinivasagam at outlook.com Tue Sep 12 08:53:40 2023 From: pradeep.srinivasagam at outlook.com (Pradeep Srinivasagam) Date: Tue, 12 Sep 2023 07:53:40 +0000 Subject: [gpfsug-discuss] Reg: Help/Process on Spectrum Scale Encryption key renewal using SKLM Message-ID: Dear All, I'm Pradeep, and I manage Spectrum Scale in a Stretched cluster environment for a financial institution. Prior to this, I was supporting GPFS protocol nodes in the Media & Entertainment industry using a tailored environment. City: Basingstoke, UK Country: United Kingdom Here is my first post, for which I am asking for clarity. Subject: Query on renewing the certificates for Spectrum Scale via SKLM. Environment: Spectrum Scale Version: 5.1.1 We have 2 certificate present that seem to be authenticating to SKLM. One expires in October (REST) and one is next year (KMIP) [cid:cb0f8fd5-a48d-4d42-82eb-7ed583f89254] We are currently therefore seeing the rkmconf_certexp_warn event within the node health status? [cid:102dd4cb-190b-4824-9804-7715bc52555c] Query 1: We want to update the REST certificate; we have a key group setup in SKLM where the keys for Scale are held ? it is labelled as follows [cid:40dba048-5f9d-4324-87da-1f1d2c8de215] The key is stored in SKLM within these management groups. The question we have is, in terms of updating the key on the Spectrum Scale environment ? basically ? how do we do it. So, we would like if possible a step by step guide on how to replace the key on the Spectrum Scale side and how that interacts with SKLM. As encryption is already up and running and we are just refreshing / renewing an existing deployment I am really looking to know what I need to do and in what order, and where we drop between SKLM activities and Scale activities. Also once we have the key in place does it just propagate to all servers within scale once one has picked it up? Example 1. Create a key within Scale 2. Add third party data to key, and then chain together using scale utility ? example below? [cid:66d2775a-0456-4a80-8a15-41224f1c6cb7] 1. Register key? In effect how do we get the server to ?pickup? the new key? 2. Copy key over to SKLM server 3. Add key to SKLM within the existing group 4. Create a file check it is encrypted Query 2: What is KMIP Certificate for and how to renew that certificate before it expires. [cid:2a66373a-17d8-40d5-8d11-2d7d4599c234] Thanks in advance Regards Pradeep S -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 5257 bytes Desc: image.png URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 4585 bytes Desc: image.png URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 2596 bytes Desc: image.png URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 15940 bytes Desc: image.png URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image.png Type: image/png Size: 5257 bytes Desc: image.png URL: From wallyd at protonmail.com Tue Sep 12 13:57:08 2023 From: wallyd at protonmail.com (Wally Dietrich) Date: Tue, 12 Sep 2023 12:57:08 +0000 Subject: [gpfsug-discuss] Question regarding quorum nodes Message-ID: Hi, Tina. I hope the issues have been resolved. You said that "I have a number of nodes which now have stale file handles." In my experience, that happens pretty often when an NSD node goes down, even if another NSD node serves the needed data and metadata. I was almost always able to fix the stale file handle problem by running the mmmount command again on the clients that were getting the stale file system handles errors. I might have had to run mmumount first, but probably not. Regarding adding a quorum node, I'm not sure it will have much benefit because you will still have a single point of failure (because you only have one working NSD node). If you can get the failed NSD node back up quickly, you might not need to add another quorum node. Wally email: wallyd at protonmail.com ------- Original Message ------- On Thursday, September 7th, 2023 at 9:39 AM, gpfsug-discuss-request at gpfsug.org wrote: > Send gpfsug-discuss mailing list submissions to > gpfsug-discuss at gpfsug.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org > or, via email, send a message with subject or body 'help' to > gpfsug-discuss-request at gpfsug.org > > You can reach the person managing the list at > gpfsug-discuss-owner at gpfsug.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of gpfsug-discuss digest..." > > > Today's Topics: > > 1. Question regarding quorum nodes (Tina Friedrich) > 2. Re: Question regarding quorum nodes (Danny Lang) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Thu, 7 Sep 2023 14:28:55 +0100 > From: Tina Friedrich tina.friedrich at it.ox.ac.uk > > To: "gpfsug-discuss at gpfsug.org" gpfsug-discuss at gpfsug.org > > Cc: theteam at arc.ox.ac.uk > > Subject: [gpfsug-discuss] Question regarding quorum nodes > Message-ID: f0949731-e02b-99f1-a693-d67e2773fac4 at it.ox.ac.uk > > Content-Type: text/plain; charset="UTF-8"; format=flowed > > Hello All, > > I hope someone can answer this quickly! > > We have - it seems - just lost one of our NSDs.The other took over as it > should - but unfortunately, the protocol nodes (i.e. I have a number of > nodes which now have stale file handles. (The file system is accessible > on the remaining NSD server and a number of other clients.) > > Unfortunately, in the 'home' cluster - the one that ones the disks - I > only have, five servers; three of which are quorum nodes, and of course > the failed NSD server is one of them. > > The question is - can I add quorum nodes with one node down, and can I > remove quorum functionality from a failed node? > > Tina > > > > ------------------------------ > > Message: 2 > Date: Thu, 7 Sep 2023 13:39:00 +0000 > From: Danny Lang danny.lang at crick.ac.uk > > To: "gpfsug-discuss at gpfsug.org" gpfsug-discuss at gpfsug.org > > Cc: "theteam at arc.ox.ac.uk" theteam at arc.ox.ac.uk > > Subject: Re: [gpfsug-discuss] Question regarding quorum nodes > Message-ID: > DB9PR05MB8715AC665CCA2E428F5DCBF9C0EEA at DB9PR05MB8715.eurprd05.prod.outlook.com > > > Content-Type: text/plain; charset="utf-8" > > Hi Tina, > > The command you're looking for is: > > > mmchnode > > https://www.ibm.com/docs/en/storage-scale/4.2.0?topic=commands-mmchnode-command > > This will allow you to add quorum nodes and to remove. > > ------ > > I would advise checking everything prior to running commands. ? > > Thanks > Danny > > ________________________________ > From: gpfsug-discuss gpfsug-discuss-bounces at gpfsug.org on behalf of Tina Friedrich tina.friedrich at it.ox.ac.uk > > Sent: 07 September 2023 2:28 PM > To: gpfsug-discuss at gpfsug.org gpfsug-discuss at gpfsug.org > > Cc: theteam at arc.ox.ac.uk theteam at arc.ox.ac.uk > > Subject: [gpfsug-discuss] Question regarding quorum nodes > > > External Sender: Use caution. > > > Hello All, > > I hope someone can answer this quickly! > > We have - it seems - just lost one of our NSDs.The other took over as it > should - but unfortunately, the protocol nodes (i.e. I have a number of > nodes which now have stale file handles. (The file system is accessible > on the remaining NSD server and a number of other clients.) > > Unfortunately, in the 'home' cluster - the one that ones the disks - I > only have, five servers; three of which are quorum nodes, and of course > the failed NSD server is one of them. > > The question is - can I add quorum nodes with one node down, and can I > remove quorum functionality from a failed node? > > Tina > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss_gpfsug.org&data=05|01||b25b485dd562417c674d08dbafa6d05a|4eed7807ebad415aa7a99170947f4eae|0|0|638296903233929525|Unknown|TWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D|3000|||&sdata=P62lXeMqP%2F1GLce6mv5dGDHgUILHmpOkzY0F%2BU%2BVGYU%3D&reserved=0http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org > > > The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 1 Midland Road London NW1 1AT > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20230907/aadc27eb/attachment.htm > > > ------------------------------ > > Subject: Digest Footer > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org > > > ------------------------------ > > End of gpfsug-discuss Digest, Vol 138, Issue 10 > *********************************************** From ivano.talamo at psi.ch Thu Sep 14 14:25:09 2023 From: ivano.talamo at psi.ch (Talamo Ivano Giuseppe) Date: Thu, 14 Sep 2023 13:25:09 +0000 Subject: [gpfsug-discuss] Unexpected permissions with ACLs Message-ID: Hi all, I am currently working with ACLs to find out a proper set that would fit our use case. And narrowing down I found out a very simple case that looks quite weird. The use case is the following. I create a directory with 2770 mode and root:p15875 ownership, without applying any explicit ACLs. The system returns this as the default ACLs generated by the permissions/mode: #NFSv4 ACL #owner:root #group:p15875 special:owner@:rwxc:allow (X)READ/LIST (X)WRITE/CREATE (X)APPEND/MKDIR (X)SYNCHRONIZE (X)READ_ACL (X)READ_ATTR (X)READ_NAMED (-)DELETE (X)DELETE_CHILD (X)CHOWN (X)EXEC/SEARCH (X)WRITE_ACL (X)WRITE_ATTR (X)WRITE_NAMED special:group@:rwxc:allow (X)READ/LIST (X)WRITE/CREATE (X)APPEND/MKDIR (X)SYNCHRONIZE (X)READ_ACL (X)READ_ATTR (X)READ_NAMED (-)DELETE (X)DELETE_CHILD (X)CHOWN (X)EXEC/SEARCH (X)WRITE_ACL (X)WRITE_ATTR (X)WRITE_NAMED special:everyone@:----:allow (-)READ/LIST (-)WRITE/CREATE (-)APPEND/MKDIR (X)SYNCHRONIZE (X)READ_ACL (X)READ_ATTR (X)READ_NAMED (-)DELETE (-)DELETE_CHILD (-)CHOWN (-)EXEC/SEARCH (-)WRITE_ACL (-)WRITE_ATTR (-)WRITE_NAMED If I touch a new file inside that dir with a user that is a member of that group, it gets created with 644. So far so good. Now if via mmeditacl I add the following entry to the ACL of the dir, new files get created with 000 permissions. The new entry is the following: special:group@:rwx-:allow:DirInherit:InheritOnly (X)READ/LIST (X)WRITE/CREATE (X)APPEND/MKDIR (X)SYNCHRONIZE (X)READ_ACL (X)READ_ATTR (X)READ_NAMED (-)DELETE (X)DELETE_CHILD (X)CHOWN (X)EXEC/SEARCH (X)WRITE_ACL (X)WRITE_ATTR (X)WRITE_NAMED According to the manual, the DirInherit:InheritOnly should guarantee that the entry applies only to the new subdirectories but now it is also affecting new files in the main dir. Is this an expected behavior? The filesystem version is 5.1.5.0 and is configured with nfs4 ACLs only. In general, am struggling a lot with the NFS4 ACLs and I also find the IBM documentation [1] quite poor in this context. So if someone can point me to better resources that would be very welcome. Thanks, Ivano [1] https://www.ibm.com/docs/en/storage-scale/5.0.2?topic=administration-nfs-v4-acl-syntax __________________________________________ Paul Scherrer Institut Ivano Talamo WHGA/038 Forschungsstrasse 111 5232 Villigen PSI Schweiz Phone: +41 56 310 47 11 E-Mail: ivano.talamo at psi.ch -------------- next part -------------- An HTML attachment was scrubbed... URL: From christof.schmitt at us.ibm.com Thu Sep 14 17:02:00 2023 From: christof.schmitt at us.ibm.com (Christof Schmitt) Date: Thu, 14 Sep 2023 16:02:00 +0000 Subject: [gpfsug-discuss] Unexpected permissions with ACLs In-Reply-To: