From peter.chase at metoffice.gov.uk Wed Aug 2 10:09:49 2023 From: peter.chase at metoffice.gov.uk (Peter Chase) Date: Wed, 2 Aug 2023 09:09:49 +0000 Subject: [gpfsug-discuss] Inode size, and system pool subblock Message-ID: Good Morning, I have a question about inode size vs subblock size. Can anyone think of a reason that the chosen inode size of a scale filesystem should be smaller than the subblock size for the metadata pool? I'm looking at an existing filesystem, the inode size is 2KiB, and the subblock is 4KiB. It feels like I'm missing something. If I've understood the docs on blocks and subblocks correctly, it sounds like the subblock is the smallest atomic access size. Meaning with a 4K subblock, and a 2K inode, reading the inode would return its contents and 2K of empty subblock every time. So, in my head (and maybe only there), having a smaller inode size than the subblock size means there's a big wastage on disk usage, with no performance benefit to doing so. I believe I'm correct in saying that inodes are not the only things to live on the metadata pool, so I assume that some other metadata might benefit from the larger block/subblock size. But looking at the number of inodes, the inode size, and the space consumed in the system pool, it really looks like the majority of space consumed is by inodes. As I said, I feel like I'm missing something, so if anyone can tell me where I'm wrong it would be greatly appreciated! Sincerely, Pete Chase UKMO -------------- next part -------------- An HTML attachment was scrubbed... URL: From olaf.weiser at de.ibm.com Wed Aug 2 12:42:46 2023 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Wed, 2 Aug 2023 11:42:46 +0000 Subject: [gpfsug-discuss] Inode size, and system pool subblock In-Reply-To: References: Message-ID: Hallo Peter, [1] [...] having a smaller inode size than the subblock size means there's a big wastage on disk usage, with no performance benefit to doing so[...] in short - yes ? [2] [...] I believe I'm correct in saying that inodes are not the only things to live on the metadata pool, so I assume that some other metadata might benefit from the larger block/subblock size. But looking at the number of inodes, the inode size, and the space consumed in the system pool, it really looks like the majority of space consumed is by inodes.[...] you may need to consider snapshots and directories , which all contributes to MD space predicting the space requirements for MD for directories is always hard, because the size of a directory is depending on the file's name length, the users will create... further more, using a less than 4k inode size makes also not much sense, when taking into account, that NVMEs and other modern block storage devices comes with a hardware block size of 4k (even though GPFS still can deal with 512 Bytes per sector) hope this helps .. ________________________________ Von: gpfsug-discuss im Auftrag von Peter Chase Gesendet: Mittwoch, 2. August 2023 11:09 An: gpfsug-discuss at gpfsug.org Betreff: [EXTERNAL] [gpfsug-discuss] Inode size, and system pool subblock Good Morning, I have a question about inode size vs subblock size. Can anyone think of a reason that the chosen inode size of a scale filesystem should be smaller than the subblock size for the metadata pool? I'm looking at an existing filesystem, ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. Report Suspicious ZjQcmQRYFpfptBannerEnd Good Morning, I have a question about inode size vs subblock size. Can anyone think of a reason that the chosen inode size of a scale filesystem should be smaller than the subblock size for the metadata pool? I'm looking at an existing filesystem, the inode size is 2KiB, and the subblock is 4KiB. It feels like I'm missing something. If I've understood the docs on blocks and subblocks correctly, it sounds like the subblock is the smallest atomic access size. Meaning with a 4K subblock, and a 2K inode, reading the inode would return its contents and 2K of empty subblock every time. So, in my head (and maybe only there), having a smaller inode size than the subblock size means there's a big wastage on disk usage, with no performance benefit to doing so. I believe I'm correct in saying that inodes are not the only things to live on the metadata pool, so I assume that some other metadata might benefit from the larger block/subblock size. But looking at the number of inodes, the inode size, and the space consumed in the system pool, it really looks like the majority of space consumed is by inodes. As I said, I feel like I'm missing something, so if anyone can tell me where I'm wrong it would be greatly appreciated! Sincerely, Pete Chase UKMO -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.kidger at hpe.com Wed Aug 2 12:56:32 2023 From: daniel.kidger at hpe.com (Kidger, Daniel) Date: Wed, 2 Aug 2023 11:56:32 +0000 Subject: [gpfsug-discuss] Inode size, and system pool subblock In-Reply-To: References: Message-ID: Peter, "Meaning with a 4K subblock, and a 2K inode, reading the inode would return its contents and 2K of empty subblock every time" I believe that a 2k inode *does* save space, hence more files in the filesystem for a given size of the system metadata pool. However with modern 4k disk block sizes, you pay the price of a performance penalty. Hence unless space constrained, you should use 4k inodes always. Also remember that GPFS supports Data-on-Metadata (DoM in Lustre-speak), so 4k inodes can store small files (up to c. 3k), and so save significant space in the data pools (where the subblock size is at least 8kB and in your case probably 128kB. Daniel Kidger HPC Storage Solutions Architect, EMEA daniel.kidger at hpe.com +44 (0)7818 522266 hpe.com [cid:image001.png at 01D9C540.C15BD4A0] From: gpfsug-discuss On Behalf Of Peter Chase Sent: 02 August 2023 10:10 To: gpfsug-discuss at gpfsug.org Subject: [gpfsug-discuss] Inode size, and system pool subblock Good Morning, I have a question about inode size vs subblock size. Can anyone think of a reason that the chosen inode size of a scale filesystem should be smaller than the subblock size for the metadata pool? I'm looking at an existing filesystem, the inode size is 2KiB, and the subblock is 4KiB. It feels like I'm missing something. If I've understood the docs on blocks and subblocks correctly, it sounds like the subblock is the smallest atomic access size. Meaning with a 4K subblock, and a 2K inode, reading the inode would return its contents and 2K of empty subblock every time. So, in my head (and maybe only there), having a smaller inode size than the subblock size means there's a big wastage on disk usage, with no performance benefit to doing so. I believe I'm correct in saying that inodes are not the only things to live on the metadata pool, so I assume that some other metadata might benefit from the larger block/subblock size. But looking at the number of inodes, the inode size, and the space consumed in the system pool, it really looks like the majority of space consumed is by inodes. As I said, I feel like I'm missing something, so if anyone can tell me where I'm wrong it would be greatly appreciated! Sincerely, Pete Chase UKMO -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 2541 bytes Desc: image001.png URL: From ewahl at osc.edu Wed Aug 2 13:55:29 2023 From: ewahl at osc.edu (Wahl, Edward) Date: Wed, 2 Aug 2023 12:55:29 +0000 Subject: [gpfsug-discuss] Inode size, and system pool subblock In-Reply-To: References: Message-ID: >>[2] [...] I believe I'm correct in saying that inodes are not the only things to live on the metadata pool, so I assume that some other metadata might benefit from the larger block/subblock size. But looking at the number of inodes, the inode size, and the space consumed in the system pool, it really looks like the majority of space consumed is by inodes.[...] >you may need to consider snapshots and directories , which all contributes to MD space >predicting the space requirements for MD for directories is always hard, because the size of a directory is depending on the file's name length, the users will create... Unless you enable encryption. In which case NO metadata will be stored on MD disks/devices. Ed Wahl Ohio Supercomputer Center From: gpfsug-discuss On Behalf Of Olaf Weiser Sent: Wednesday, August 2, 2023 7:43 AM To: gpfsug-discuss at gpfsug.org Subject: Re: [gpfsug-discuss] Inode size, and system pool subblock Hallo Peter, [1] [.?.?.?] having a smaller inode size than the subblock size means there's a big wastage on disk usage, with no performance benefit to doing so[.?.?.?] in short - yes ? [2] [.?.?.?] I believe I'm correct in saying that inodes are not Hallo Peter, [1] [...] having a smaller inode size than the subblock size means there's a big wastage on disk usage, with no performance benefit to doing so[...] in short - yes ? [2] [...] I believe I'm correct in saying that inodes are not the only things to live on the metadata pool, so I assume that some other metadata might benefit from the larger block/subblock size. But looking at the number of inodes, the inode size, and the space consumed in the system pool, it really looks like the majority of space consumed is by inodes.[...] you may need to consider snapshots and directories , which all contributes to MD space predicting the space requirements for MD for directories is always hard, because the size of a directory is depending on the file's name length, the users will create... further more, using a less than 4k inode size makes also not much sense, when taking into account, that NVMEs and other modern block storage devices comes with a hardware block size of 4k (even though GPFS still can deal with 512 Bytes per sector) hope this helps .. ________________________________ Von: gpfsug-discuss > im Auftrag von Peter Chase > Gesendet: Mittwoch, 2. August 2023 11:09 An: gpfsug-discuss at gpfsug.org > Betreff: [EXTERNAL] [gpfsug-discuss] Inode size, and system pool subblock Good Morning, I have a question about inode size vs subblock size. Can anyone think of a reason that the chosen inode size of a scale filesystem should be smaller than the subblock size for the metadata pool? I'm looking at an existing filesystem, Good Morning, I have a question about inode size vs subblock size. Can anyone think of a reason that the chosen inode size of a scale filesystem should be smaller than the subblock size for the metadata pool? I'm looking at an existing filesystem, the inode size is 2KiB, and the subblock is 4KiB. It feels like I'm missing something. If I've understood the docs on blocks and subblocks correctly, it sounds like the subblock is the smallest atomic access size. Meaning with a 4K subblock, and a 2K inode, reading the inode would return its contents and 2K of empty subblock every time. So, in my head (and maybe only there), having a smaller inode size than the subblock size means there's a big wastage on disk usage, with no performance benefit to doing so. I believe I'm correct in saying that inodes are not the only things to live on the metadata pool, so I assume that some other metadata might benefit from the larger block/subblock size. But looking at the number of inodes, the inode size, and the space consumed in the system pool, it really looks like the majority of space consumed is by inodes. As I said, I feel like I'm missing something, so if anyone can tell me where I'm wrong it would be greatly appreciated! Sincerely, Pete Chase UKMO -------------- next part -------------- An HTML attachment was scrubbed... URL: From jan.heichler at gmx.net Wed Aug 2 14:09:29 2023 From: jan.heichler at gmx.net (Jan Heichler) Date: Wed, 2 Aug 2023 15:09:29 +0200 Subject: [gpfsug-discuss] Inode size, and system pool subblock In-Reply-To: References: Message-ID: <4FB56CDF-8532-4312-A64E-C4E12D63DEB7@gmx.net> > Am 02.08.2023 um 13:42 schrieb Olaf Weiser : > > Hallo Peter, > > [1] [...] having a smaller inode size than the subblock size means there's a big wastage on disk usage, with no performance benefit to doing so[...] > in short - yes ? > The expectation that there is a waste in space seems to come from the idea that inodes are stored as individual files - which then can?t be smaller than a subblock. Referring to: https://www.ibm.com/blogs/digitale-perspektive/wp-content/uploads/2020/03/04-SSSD20-SpectrumScale-Konzepte-Teil2-032020.pdf? 04-SSSD20-SpectrumScale-Konzepte-Teil2-032020 PDF-Dokument ? 1,2 MB Slide 21: ?Held in one invisible inode file?? -> I would understand from that that 2kiB inodes are just ligned up in a single file and worst case you lose 2kiB because you don?t completely match your 4kiB inodes Jan -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Vorschau.png Type: image/png Size: 193197 bytes Desc: not available URL: From olaf.weiser at de.ibm.com Wed Aug 2 14:44:17 2023 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Wed, 2 Aug 2023 13:44:17 +0000 Subject: [gpfsug-discuss] Inode size, and system pool subblock In-Reply-To: <4FB56CDF-8532-4312-A64E-C4E12D63DEB7@gmx.net> References: <4FB56CDF-8532-4312-A64E-C4E12D63DEB7@gmx.net> Message-ID: ok.. let me give some more context.. all inodes are in a (single) inode file.. so ... depending on the blocksize .. lets say ... 1MB ... you can have 256 inodes (in case of 4K inode size) ... in one block... in Peter's case .. 1 MB block would hold 512 inodes... the total number of allocated file system blocks of the inode file is a bit more complex ... off topic here to be clear .. the waste of space does not come from having small inode size/or mismatch of subblocksize ... itself.. the worst case (which is negligible) there is n unused fragment its more.. that small file's data can't (or less likely) written into the inode (data in inode) .. (( please note the good remark from Ed , only possible at all, - if there is no encryption )) so in Peter's case .. a file , that has , lets say 2.x KB ... can't be written into the inode... and so a file system block needs to be allocated.. if it is a new file.. a full block gets allocated first and then, on close of the file.. the size will be truncated to the next matching sub blocksize boundary so .. performance wise.. that adds latency and space wise.. this could be avoided .. (if the file's data fits into the inode) to be more accurate and correct, the best answer would have been .. it depends ? .. on the data structure... Am 02.?08.?2023 um 13:?42 schrieb Olaf Weiser : Hallo Peter, [1] [.?.?.?] having a smaller inode size than the subblock size means there's a big wastage on disk usage, with no performance benefit to doing so[.?.?.?] in ZjQcmQRYFpfptBannerStart This Message Is From an Untrusted Sender You have not previously corresponded with this sender. Report Suspicious ZjQcmQRYFpfptBannerEnd Am 02.08.2023 um 13:42 schrieb Olaf Weiser : Hallo Peter, [1] [...] having a smaller inode size than the subblock size means there's a big wastage on disk usage, with no performance benefit to doing so[...] in short - yes ? The expectation that there is a waste in space seems to come from the idea that inodes are stored as individual files - which then can?t be smaller than a subblock. Referring to: [Vorschau.png] 04-SSSD20-SpectrumScale-Konzepte-Teil2-032020 PDF-Dokument ? 1,2 MB Slide 21: ?Held in one invisible inode file?? -> I would understand from that that 2kiB inodes are just ligned up in a single file and worst case you lose 2kiB because you don?t completely match your 4kiB inodes Jan ________________________________ Von: gpfsug-discuss im Auftrag von Jan Heichler Gesendet: Mittwoch, 2. August 2023 15:09 An: gpfsug main discussion list Betreff: [EXTERNAL] Re: [gpfsug-discuss] Inode size, and system pool subblock Am 02.?08.?2023 um 13:?42 schrieb Olaf Weiser : Hallo Peter, [1] [.?.?.?] having a smaller inode size than the subblock size means there's a big wastage on disk usage, with no performance benefit to doing so[.?.?.?] in ZjQcmQRYFpfptBannerStart This Message Is From an Untrusted Sender You have not previously corresponded with this sender. Report Suspicious ZjQcmQRYFpfptBannerEnd Am 02.08.2023 um 13:42 schrieb Olaf Weiser : Hallo Peter, [1] [...] having a smaller inode size than the subblock size means there's a big wastage on disk usage, with no performance benefit to doing so[...] in short - yes ? The expectation that there is a waste in space seems to come from the idea that inodes are stored as individual files - which then can?t be smaller than a subblock. Referring to: [Vorschau.png] 04-SSSD20-SpectrumScale-Konzepte-Teil2-032020 PDF-Dokument ? 1,2 MB Slide 21: ?Held in one invisible inode file?? -> I would understand from that that 2kiB inodes are just ligned up in a single file and worst case you lose 2kiB because you don?t completely match your 4kiB inodes Jan -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Vorschau.png Type: image/png Size: 193197 bytes Desc: Vorschau.png URL: From anacreo at gmail.com Wed Aug 2 17:07:17 2023 From: anacreo at gmail.com (Alec) Date: Wed, 2 Aug 2023 09:07:17 -0700 Subject: [gpfsug-discuss] Inode size, and system pool subblock In-Reply-To: References: Message-ID: I think things are conflated here... The inode size is really just a call on how much functionality you need in an inode. I wouldn't even think about disk block size when setting this. Essentially the smaller the inode the less space I need for metadata but also the less capacity I have in my inode. The default is 4k and if you don't change it then GPFS will put up to a 3.8k file in the inode itself vs going to an indirect disk allocation. Someone mentioned encryption will bypass this feature, but it's actually encryption that perhaps requires larger inode sizes to store all the key meta info (you can have up to 8 keys per inode I believe). So essentially it you've got a smaller inode size your directories max size will max out sooner, your ACLs could be constrained, large file names can exhaust, you may not have enough space for Encryption details. But the upshot is you need to dedicate less space to metadata and can handle more file entries. So if you've got billions of files and are managing replicas then you should consider fine tuning inode size down. You can go from 3.5% of space going to inodes to 1% if you went from 4k to 512 bytes.. but there is a reason GPFS defaults to 4k... And doesn't expand on it too much. If you've guessed wrong you're kind of hosed. None of this has to do with hardware block sizes, subblock allocation and fragment sizes. And further compounded by 4k native block sizes vs emulated 512 block size some disk hardware does. For GPFS you generally will have a very large block size 256kb or 1MB and GPFS will divide those blocks into 32 fragments. So you may have your smallest unit being a 8kb or 32kb fragment. If you have a dedicated MD pool (highly recommended) you'd definitely specify a smaller block size than 1MB (128kb = 4kb fragments). The balance you're trying to strike here is the least amount of commands to retrieve your data efficiently. Think about the roundtrip on the bus being the same for a 4kb read vs a 1mb read so try to maximize this. Generally the goal of the file system is to ensure that the excess data that is read when trying to pull fragments is as useless as possible. I may also be confused but I wouldn't worry so much about inode size to block size.. just worry about getting large blocks working well for regular storage pool if your data is huge and using a smaller block size in MD if dedicate pool which is almost always recommended. Be very careful of specifying a small inode size because it's not just max filenames and max file counts in a directory.. it is much more.. and if you have a lot of small files don't underestimate the advantage of those files being stored directly in the inode. A 512 byte inode could only store about a 380byte file vs a 4k file storing 3800 byte file. These files tend to be shell scripts and config files which you really don't want to be waiting around for and occupying a huge 1mb read for and waisting a potentially larger 64kb fragment allocation on. Alec On Wed, Aug 2, 2023, 4:47 AM Olaf Weiser wrote: > Hallo Peter, > > [1] *[...] having a smaller inode size than the subblock size means* > * there's a big wastage on disk usage, with no performance benefit to > doing so[...] * > in short - yes ? > > > > [2] > *[...] I believe I'm correct in saying that inodes are not the only > things to live on the metadata pool, so I assume that some other metadata > might benefit from the larger block/subblock size. But looking at the > number of inodes, the inode size, and the space consumed in the system > pool, it really looks like the majority of space consumed is by > inodes.[...] * > you may need to consider snapshots and directories , which all contributes > to MD space > > predicting the space requirements for MD for directories is always hard, > because the size of a directory is depending on the file's name length, > the users will create... > > > further more, using a less than 4k inode size makes also not much sense, > when taking into account, that NVMEs and other modern block storage devices > comes with a hardware block size of 4k (even though GPFS still can deal > with 512 Bytes per sector) > > > hope this helps .. > > > > > > ------------------------------ > *Von:* gpfsug-discuss im Auftrag von > Peter Chase > *Gesendet:* Mittwoch, 2. August 2023 11:09 > *An:* gpfsug-discuss at gpfsug.org > *Betreff:* [EXTERNAL] [gpfsug-discuss] Inode size, and system pool > subblock > > Good Morning, I have a question about inode size vs subblock size. Can > anyone think of a reason that the chosen inode size of a scale filesystem > should be smaller than the subblock size for the metadata pool? I'm looking > at an existing filesystem, > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > Report Suspicious > > > > ZjQcmQRYFpfptBannerEnd > Good Morning, > > I have a question about inode size vs subblock size. Can anyone think of a > reason that the chosen inode size of a scale filesystem should be smaller > than the subblock size for the metadata pool? > I'm looking at an existing filesystem, the inode size is 2KiB, and the > subblock is 4KiB. > It feels like I'm missing something. If I've understood the docs on > blocks and subblocks correctly, it sounds like the subblock is the smallest > atomic access size. Meaning with a 4K subblock, and a 2K inode, reading the > inode would return its contents and 2K of empty subblock every time. So, in > my head (and maybe only there), having a smaller inode size than the > subblock size means there's a big wastage on disk usage, with no > performance benefit to doing so. > I believe I'm correct in saying that inodes are not the only things to > live on the metadata pool, so I assume that some other metadata might > benefit from the larger block/subblock size. But looking at the number of > inodes, the inode size, and the space consumed in the system pool, it > really looks like the majority of space consumed is by inodes. > > As I said, I feel like I'm missing something, so if anyone can tell me > where I'm wrong it would be greatly appreciated! > > Sincerely, > > Pete Chase > > UKMO > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ewahl at osc.edu Wed Aug 2 17:29:38 2023 From: ewahl at osc.edu (Wahl, Edward) Date: Wed, 2 Aug 2023 16:29:38 +0000 Subject: [gpfsug-discuss] Inode size, and system pool subblock In-Reply-To: References: Message-ID: >Someone mentioned encryption will bypass this feature, but it's actually encryption that perhaps requires larger inode sizes to store all the key meta info (you can have up to 8 keys per inode I believe). I believe that is incorrect. If encryption is used, the size of the inode makes no difference. This is due to the fact that Only data, NOT metadata is encrypted on the file system. So storing blocks in MD spaces is out. See the Scale documentation, and older GPFS documentation, for more information. (such as Encryption - IBM Documentation ) Until such time as they start encrypting the metadata, it?s pointless to size MD for small files. Ed Wahl Ohio Supercomputer Center From: gpfsug-discuss On Behalf Of Alec Sent: Wednesday, August 2, 2023 12:07 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Inode size, and system pool subblock I think things are conflated here.?.?. The inode size is really just a call on how much functionality you need in an inode. I wouldn't even think about disk block size when setting this. Essentially the smaller the inode the less space I I think things are conflated here... The inode size is really just a call on how much functionality you need in an inode. I wouldn't even think about disk block size when setting this. Essentially the smaller the inode the less space I need for metadata but also the less capacity I have in my inode. The default is 4k and if you don't change it then GPFS will put up to a 3.8k file in the inode itself vs going to an indirect disk allocation. Someone mentioned encryption will bypass this feature, but it's actually encryption that perhaps requires larger inode sizes to store all the key meta info (you can have up to 8 keys per inode I believe). So essentially it you've got a smaller inode size your directories max size will max out sooner, your ACLs could be constrained, large file names can exhaust, you may not have enough space for Encryption details. But the upshot is you need to dedicate less space to metadata and can handle more file entries. So if you've got billions of files and are managing replicas then you should consider fine tuning inode size down. You can go from 3.5% of space going to inodes to 1% if you went from 4k to 512 bytes.. but there is a reason GPFS defaults to 4k... And doesn't expand on it too much. If you've guessed wrong you're kind of hosed. None of this has to do with hardware block sizes, subblock allocation and fragment sizes. And further compounded by 4k native block sizes vs emulated 512 block size some disk hardware does. For GPFS you generally will have a very large block size 256kb or 1MB and GPFS will divide those blocks into 32 fragments. So you may have your smallest unit being a 8kb or 32kb fragment. If you have a dedicated MD pool (highly recommended) you'd definitely specify a smaller block size than 1MB (128kb = 4kb fragments). The balance you're trying to strike here is the least amount of commands to retrieve your data efficiently. Think about the roundtrip on the bus being the same for a 4kb read vs a 1mb read so try to maximize this. Generally the goal of the file system is to ensure that the excess data that is read when trying to pull fragments is as useless as possible. I may also be confused but I wouldn't worry so much about inode size to block size.. just worry about getting large blocks working well for regular storage pool if your data is huge and using a smaller block size in MD if dedicate pool which is almost always recommended. Be very careful of specifying a small inode size because it's not just max filenames and max file counts in a directory.. it is much more.. and if you have a lot of small files don't underestimate the advantage of those files being stored directly in the inode. A 512 byte inode could only store about a 380byte file vs a 4k file storing 3800 byte file. These files tend to be shell scripts and config files which you really don't want to be waiting around for and occupying a huge 1mb read for and waisting a potentially larger 64kb fragment allocation on. Alec On Wed, Aug 2, 2023, 4:47 AM Olaf Weiser > wrote: Hallo Peter, [1] [...] having a smaller inode size than the subblock size means there's a big wastage on disk usage, with no performance benefit to doing so[...] in short - yes ? [2] [...] I believe I'm correct in saying that inodes are not the only things to live on the metadata pool, so I assume that some other metadata might benefit from the larger block/subblock size. But looking at the number of inodes, the inode size, and the space consumed in the system pool, it really looks like the majority of space consumed is by inodes.[...] you may need to consider snapshots and directories , which all contributes to MD space predicting the space requirements for MD for directories is always hard, because the size of a directory is depending on the file's name length, the users will create... further more, using a less than 4k inode size makes also not much sense, when taking into account, that NVMEs and other modern block storage devices comes with a hardware block size of 4k (even though GPFS still can deal with 512 Bytes per sector) hope this helps .. ________________________________ Von: gpfsug-discuss > im Auftrag von Peter Chase > Gesendet: Mittwoch, 2. August 2023 11:09 An: gpfsug-discuss at gpfsug.org > Betreff: [EXTERNAL] [gpfsug-discuss] Inode size, and system pool subblock Good Morning, I have a question about inode size vs subblock size. Can anyone think of a reason that the chosen inode size of a scale filesystem should be smaller than the subblock size for the metadata pool? I'm looking at an existing filesystem, Good Morning, I have a question about inode size vs subblock size. Can anyone think of a reason that the chosen inode size of a scale filesystem should be smaller than the subblock size for the metadata pool? I'm looking at an existing filesystem, the inode size is 2KiB, and the subblock is 4KiB. It feels like I'm missing something. If I've understood the docs on blocks and subblocks correctly, it sounds like the subblock is the smallest atomic access size. Meaning with a 4K subblock, and a 2K inode, reading the inode would return its contents and 2K of empty subblock every time. So, in my head (and maybe only there), having a smaller inode size than the subblock size means there's a big wastage on disk usage, with no performance benefit to doing so. I believe I'm correct in saying that inodes are not the only things to live on the metadata pool, so I assume that some other metadata might benefit from the larger block/subblock size. But looking at the number of inodes, the inode size, and the space consumed in the system pool, it really looks like the majority of space consumed is by inodes. As I said, I feel like I'm missing something, so if anyone can tell me where I'm wrong it would be greatly appreciated! Sincerely, Pete Chase UKMO _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From anacreo at gmail.com Wed Aug 2 18:01:03 2023 From: anacreo at gmail.com (Alec) Date: Wed, 2 Aug 2023 10:01:03 -0700 Subject: [gpfsug-discuss] Inode size, and system pool subblock In-Reply-To: References: Message-ID: That part is true however the FEK (File Encryption Key) goes into the inode and they can be very large, and you can have up to 8 of them.. so may be good having one or 2 FEKs but if you go to rotate FEK's, and need an extra 2 to handle the change you could run out of room. In our FPO we don't encrypt .ksh, .sh, and other source type files. We have another policy in place that reencrypts unencrypted files that are larger 1mb by creating them under a different file name extension and then move them back over. That's how we get around this limitation. We explain that if a file is less than 1mb it shouldn't have data we are worried about encryption. Alec On Wed, Aug 2, 2023, 9:34 AM Wahl, Edward wrote: > > > >Someone mentioned encryption will bypass this feature, but it's actually > encryption that perhaps requires larger inode sizes to store all the key > meta info (you can have up to 8 keys per inode I believe). > > > > I believe that is incorrect. If encryption is used, the size of the inode > makes no difference. This is due to the fact that Only data, NOT metadata > is encrypted on the file system. So storing blocks in MD spaces is out. > See the Scale documentation, and older GPFS documentation, for more > information. (such as Encryption - IBM Documentation > > ) Until such time as they start encrypting the metadata, it?s pointless to > size MD for small files. > > > > Ed Wahl > > Ohio Supercomputer Center > > > > *From:* gpfsug-discuss *On Behalf Of * > Alec > *Sent:* Wednesday, August 2, 2023 12:07 PM > *To:* gpfsug main discussion list > *Subject:* Re: [gpfsug-discuss] Inode size, and system pool subblock > > > > I think things are conflated here. . . The inode size is really just a > call on how much functionality you need in an inode. I wouldn't even think > about disk block size when setting this. Essentially the smaller the inode > the less space I > > I think things are conflated here... > > > > The inode size is really just a call on how much functionality you need in > an inode. I wouldn't even think about disk block size when setting this. > Essentially the smaller the inode the less space I need for metadata but > also the less capacity I have in my inode. > > > > The default is 4k and if you don't change it then GPFS will put up to a > 3.8k file in the inode itself vs going to an indirect disk allocation. > Someone mentioned encryption will bypass this feature, but it's actually > encryption that perhaps requires larger inode sizes to store all the key > meta info (you can have up to 8 keys per inode I believe). > > > > So essentially it you've got a smaller inode size your directories max > size will max out sooner, your ACLs could be constrained, large file names > can exhaust, you may not have enough space for Encryption details. But the > upshot is you need to dedicate less space to metadata and can handle more > file entries. So if you've got billions of files and are managing replicas > then you should consider fine tuning inode size down. > > > > You can go from 3.5% of space going to inodes to 1% if you went from 4k to > 512 bytes.. but there is a reason GPFS defaults to 4k... And doesn't expand > on it too much. If you've guessed wrong you're kind of hosed. > > > > None of this has to do with hardware block sizes, subblock allocation and > fragment sizes. And further compounded by 4k native block sizes vs > emulated 512 block size some disk hardware does. > > > > For GPFS you generally will have a very large block size 256kb or 1MB and > GPFS will divide those blocks into 32 fragments. So you may have your > smallest unit being a 8kb or 32kb fragment. If you have a dedicated MD > pool (highly recommended) you'd definitely specify a smaller block size > than 1MB (128kb = 4kb fragments). > > > > The balance you're trying to strike here is the least amount of commands > to retrieve your data efficiently. Think about the roundtrip on the bus > being the same for a 4kb read vs a 1mb read so try to maximize this. > > > > Generally the goal of the file system is to ensure that the excess data > that is read when trying to pull fragments is as useless as possible. > > > > I may also be confused but I wouldn't worry so much about inode size to > block size.. just worry about getting large blocks working well for regular > storage pool if your data is huge and using a smaller block size in MD if > dedicate pool which is almost always recommended. > > > > Be very careful of specifying a small inode size because it's not just max > filenames and max file counts in a directory.. it is much more.. and if you > have a lot of small files don't underestimate the advantage of those files > being stored directly in the inode. A 512 byte inode could only store > about a 380byte file vs a 4k file storing 3800 byte file. These files tend > to be shell scripts and config files which you really don't want to be > waiting around for and occupying a huge 1mb read for and waisting a > potentially larger 64kb fragment allocation on. > > > > Alec > > > > > > > > On Wed, Aug 2, 2023, 4:47 AM Olaf Weiser wrote: > > Hallo Peter, > > > > [1] *[...] having a smaller inode size than the subblock size means** there's > a big wastage on disk usage, with no performance benefit to doing so[...] * > > in short - yes ? > > > > > > > > [2] *[...] I believe I'm correct in saying that inodes are not the only > things to live on the metadata pool, so I assume that some other metadata > might benefit from the larger block/subblock size. But looking at the > number of inodes, the inode size, and the space consumed in the system > pool, it really looks like the majority of space consumed is by > inodes.[...] * > > you may need to consider snapshots and directories , which all contributes > to MD space > > > > predicting the space requirements for MD for directories is always hard, > because the size of a directory is depending on the file's name length, > the users will create... > > > > > > further more, using a less than 4k inode size makes also not much sense, > when taking into account, that NVMEs and other modern block storage devices > comes with a hardware block size of 4k (even though GPFS still can deal > with 512 Bytes per sector) > > > > > > hope this helps .. > > > > > > > > > ------------------------------ > > *Von:* gpfsug-discuss im Auftrag von > Peter Chase > *Gesendet:* Mittwoch, 2. August 2023 11:09 > *An:* gpfsug-discuss at gpfsug.org > *Betreff:* [EXTERNAL] [gpfsug-discuss] Inode size, and system pool > subblock > > > > Good Morning, I have a question about inode size vs subblock size. Can > anyone think of a reason that the chosen inode size of a scale filesystem > should be smaller than the subblock size for the metadata pool? I'm looking > at an existing filesystem, > > Good Morning, > > > > I have a question about inode size vs subblock size. Can anyone think of a > reason that the chosen inode size of a scale filesystem should be smaller > than the subblock size for the metadata pool? > > I'm looking at an existing filesystem, the inode size is 2KiB, and the > subblock is 4KiB. > > It feels like I'm missing something. If I've understood the docs on blocks > and subblocks correctly, it sounds like the subblock is the smallest atomic > access size. Meaning with a 4K subblock, and a 2K inode, reading the inode > would return its contents and 2K of empty subblock every time. So, in my > head (and maybe only there), having a smaller inode size than the > subblock size means there's a big wastage on disk usage, with no > performance benefit to doing so. > > I believe I'm correct in saying that inodes are not the only things to > live on the metadata pool, so I assume that some other metadata might > benefit from the larger block/subblock size. But looking at the number of > inodes, the inode size, and the space consumed in the system pool, it > really looks like the majority of space consumed is by inodes. > > > > As I said, I feel like I'm missing something, so if anyone can tell me > where I'm wrong it would be greatly appreciated! > > > > Sincerely, > > > > Pete Chase > > UKMO > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From TROPPENS at de.ibm.com Mon Aug 14 13:10:08 2023 From: TROPPENS at de.ibm.com (Ulf Troppens) Date: Mon, 14 Aug 2023 12:10:08 +0000 Subject: [gpfsug-discuss] Save the date - Storage Scale User Meeting @ SC23 Message-ID: Greetings, IBM is organizing a Storage Scale User Meeting at SC23. We have an exciting agenda covering user stories, roadmap updates, insights into potential future product enhancements, plus access to IBM experts and your peers. We look forward to welcoming you to this event. The user meeting is followed by a Get Together to continue the discussion. Sunday, November 12th, 2023 - 12:00-18:00 Westin Denver Downtown Detailed agenda and registration link will be shared later on the event page: https://www.spectrumscaleug.org/event/storage-scale-user-meeting-sc23/ As always we are looking for customer and partner talks to share your experience. Please drop me a mail, if you are interested to speak. Best, Ulf Ulf Troppens Product Manager - IBM Storage for Data and AI, Data-Intensive Workflows IBM Deutschland Research & Development GmbH Vorsitzender des Aufsichtsrats: Gregor Pillen / Gesch?ftsf?hrung: David Faller Sitz der Gesellschaft: B?blingen / Registergericht: Amtsgericht Stuttgart, HRB 243294 -------------- next part -------------- An HTML attachment was scrubbed... URL: From amjadcsu at gmail.com Mon Aug 14 16:45:56 2023 From: amjadcsu at gmail.com (Amjad Syed) Date: Mon, 14 Aug 2023 16:45:56 +0100 Subject: [gpfsug-discuss] Vmtouch on GPFS is supported? Message-ID: Hi We are using GPFS to store a particular software GUI product that is accessed over VPN and nomachine software. It takes more then 1 min to load this software. We were planning to use vmtouch daemon to see if it can reduce loading time of this software/ https://github.com/hoytech/vmtouch Just wanted to check if any one used this and got some thoughts Amjad -------------- next part -------------- An HTML attachment was scrubbed... URL: From uwe.falke at kit.edu Tue Aug 15 07:25:15 2023 From: uwe.falke at kit.edu (Uwe Falke) Date: Tue, 15 Aug 2023 08:25:15 +0200 Subject: [gpfsug-discuss] Vmtouch on GPFS is supported? In-Reply-To: References: Message-ID: Hi, Amjad, vmtouch uses the OS filesystem caches, but GPFS uses its own caching (pagepool). I suppose, vmtouch won't help here. Uwe On 14.08.23 17:45, Amjad Syed wrote: > Hi > > We are using GPFS to store a particular software GUI product that is > accessed over VPN and nomachine software. > > It takes more then 1 min to load this software. We were planning to > use vmtouch daemon to see if it can reduce loading time of this software/ > https://github.com/hoytech/vmtouch > Just wanted to check if any one used this and got some thoughts > > Amjad > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org -- Karlsruhe Institute of Technology (KIT) Steinbuch Centre for Computing (SCC) Scientific Data Management (SDM) Uwe Falke Hermann-von-Helmholtz-Platz 1, Building 442, Room 187 D-76344 Eggenstein-Leopoldshafen Tel: +49 721 608 28024 Email: uwe.falke at kit.edu www.scc.kit.edu Registered office: Kaiserstra?e 12, 76131 Karlsruhe, Germany KIT ? The Research University in the Helmholtz Association -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 5814 bytes Desc: S/MIME Cryptographic Signature URL: From uwe.falke at kit.edu Tue Aug 15 07:32:06 2023 From: uwe.falke at kit.edu (Uwe Falke) Date: Tue, 15 Aug 2023 08:32:06 +0200 Subject: [gpfsug-discuss] Vmtouch on GPFS is supported? In-Reply-To: References: Message-ID: <82878d70-bdd1-eccc-2328-95a801bcaf86@kit.edu> second point: while there is probably no vmtouch4gpfs, you might check and tune your gpfs parameters (pagepool size, maxFilesToCache).? But first you should identify where the bottleneck is. Is your GPFS cluster spanning the VPN? Suppose not. So how do you know that it is really GPFS which is delaying your loading? nomachine is an remote desktop app, how do you load software efficiently through nomachine? Uwe On 14.08.23 17:45, Amjad Syed wrote: > Hi > > We are using GPFS to store a particular software GUI product that is > accessed over VPN and nomachine software. > > It takes more then 1 min to load this software. We were planning to > use vmtouch daemon to see if it can reduce loading time of this software/ > https://github.com/hoytech/vmtouch > Just wanted to check if any one used this and got some thoughts > > Amjad > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org -- Karlsruhe Institute of Technology (KIT) Steinbuch Centre for Computing (SCC) Scientific Data Management (SDM) Uwe Falke Hermann-von-Helmholtz-Platz 1, Building 442, Room 187 D-76344 Eggenstein-Leopoldshafen Tel: +49 721 608 28024 Email: uwe.falke at kit.edu www.scc.kit.edu Registered office: Kaiserstra?e 12, 76131 Karlsruhe, Germany KIT ? The Research University in the Helmholtz Association -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 5814 bytes Desc: S/MIME Cryptographic Signature URL: From jonathan.buzzard at strath.ac.uk Tue Aug 15 08:26:35 2023 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Tue, 15 Aug 2023 08:26:35 +0100 Subject: [gpfsug-discuss] Vmtouch on GPFS is supported? In-Reply-To: <82878d70-bdd1-eccc-2328-95a801bcaf86@kit.edu> References: <82878d70-bdd1-eccc-2328-95a801bcaf86@kit.edu> Message-ID: On 15/08/2023 07:32, Uwe Falke wrote: > > second point: > > while there is probably no vmtouch4gpfs, you might check and tune your > gpfs parameters (pagepool size, maxFilesToCache).? But first you should > identify where the bottleneck is. Is your GPFS cluster spanning the VPN? > Suppose not. So how do you know that it is really GPFS which is delaying > your loading? > > nomachine is an remote desktop app, how do you load software efficiently > through nomachine? I would say if it takes longer launching through nomachine than launching locally, then the problem is nomachine. This should IMHO be the first thing you test. If it takes a long time launching locally the application is a steaming pile and you need to resolve that first. We use thinlinc which is a similar Linux remote desktop solution extensively to provide a Linux desktop to our users with through VirtualGL 3D visualization capabilities and we do not have an issue with excessive launch times. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From uwe.falke at kit.edu Tue Aug 15 18:03:23 2023 From: uwe.falke at kit.edu (Uwe Falke) Date: Tue, 15 Aug 2023 19:03:23 +0200 Subject: [gpfsug-discuss] unlocking mmafmctl prefetch Message-ID: <64cf409d-9038-dc77-2165-942613e00411@kit.edu> Dear all, we had to kill a running mmafmctl prefetch ... including the children on another node. The processes appear all gone # mmdsh -N ALL 'ps -aef | egrep -i "(afm|prefetch)" | grep -v grep | wc -l' | sed -e 's/^[^:]*:/xxx:/' xxx:? 0 xxx:? 0 xxx:? 0 xxx:? 0 xxx:? 0 xxx:? 0 xxx:? 0 xxx:? 0 xxx:? 0 xxx:? 0 xxx:? 0 xxx:? 0 xxx:? 0 xxx:? 0 xxx:? 0 xxx:? 0 xxx:? 0 xxx:? 0 xxx:? 0 xxx:? 0 xxx:? 0 xxx:? 0 xxx:? 0 xxx:? 0 xxx:? 0 xxx:? 0 xxx:? 0 But trying to start another prefetch tells me it is locked (Cannot initiate prefetch for fileset root.? Recovery or another instance of prefetch may be in progress.). Any suggestion how to remove that? I had recycled mmfsd on the node i run the command and I moved the SG Mgr for the SG in question, had not helped. Thanks in advance Uwe -- Karlsruhe Institute of Technology (KIT) Steinbuch Centre for Computing (SCC) Scientific Data Management (SDM) Uwe Falke Hermann-von-Helmholtz-Platz 1, Building 442, Room 187 D-76344 Eggenstein-Leopoldshafen Tel: +49 721 608 28024 Email: uwe.falke at kit.edu www.scc.kit.edu Registered office: Kaiserstra?e 12, 76131 Karlsruhe, Germany KIT ? The Research University in the Helmholtz Association -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 5814 bytes Desc: S/MIME Cryptographic Signature URL: From olaf.weiser at de.ibm.com Tue Aug 15 18:13:00 2023 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Tue, 15 Aug 2023 17:13:00 +0000 Subject: [gpfsug-discuss] unlocking mmafmctl prefetch In-Reply-To: <64cf409d-9038-dc77-2165-942613e00411@kit.edu> References: <64cf409d-9038-dc77-2165-942613e00411@kit.edu> Message-ID: Hi Uwe, does this show show smth ? mmcommon showLocks ________________________________ Von: gpfsug-discuss im Auftrag von Uwe Falke Gesendet: Dienstag, 15. August 2023 19:03 An: gpfsug-discuss at gpfsug.org Betreff: [EXTERNAL] [gpfsug-discuss] unlocking mmafmctl prefetch Dear all, we had to kill a running mmafmctl prefetch ... including the children on another node. The processes appear all gone # mmdsh -N ALL 'ps -aef | egrep -i "(afm|prefetch)" | grep -v grep | wc -l' | sed -e 's/^[^:]*:/xxx:/' xxx: 0 xxx: 0 xxx: 0 xxx: 0 xxx: 0 xxx: 0 xxx: 0 xxx: 0 xxx: 0 xxx: 0 xxx: 0 xxx: 0 xxx: 0 xxx: 0 xxx: 0 xxx: 0 xxx: 0 xxx: 0 xxx: 0 xxx: 0 xxx: 0 xxx: 0 xxx: 0 xxx: 0 xxx: 0 xxx: 0 xxx: 0 But trying to start another prefetch tells me it is locked (Cannot initiate prefetch for fileset root. Recovery or another instance of prefetch may be in progress.). Any suggestion how to remove that? I had recycled mmfsd on the node i run the command and I moved the SG Mgr for the SG in question, had not helped. Thanks in advance Uwe -- Karlsruhe Institute of Technology (KIT) Steinbuch Centre for Computing (SCC) Scientific Data Management (SDM) Uwe Falke Hermann-von-Helmholtz-Platz 1, Building 442, Room 187 D-76344 Eggenstein-Leopoldshafen Tel: +49 721 608 28024 Email: uwe.falke at kit.edu www.scc.kit.edu Registered office: Kaiserstra?e 12, 76131 Karlsruhe, Germany KIT ? The Research University in the Helmholtz Association -------------- next part -------------- An HTML attachment was scrubbed... URL: From uwe.falke at kit.edu Tue Aug 15 18:23:15 2023 From: uwe.falke at kit.edu (Uwe Falke) Date: Tue, 15 Aug 2023 19:23:15 +0200 Subject: [gpfsug-discuss] unlocking mmafmctl prefetch In-Reply-To: References: <64cf409d-9038-dc77-2165-942613e00411@kit.edu> Message-ID: Hi, Olaf, nope ... except "No lock found." Thx -- Karlsruhe Institute of Technology (KIT) Steinbuch Centre for Computing (SCC) Scientific Data Management (SDM) Uwe Falke Hermann-von-Helmholtz-Platz 1, Building 442, Room 187 D-76344 Eggenstein-Leopoldshafen Tel: +49 721 608 28024 Email: uwe.falke at kit.edu www.scc.kit.edu Registered office: Kaiserstra?e 12, 76131 Karlsruhe, Germany KIT ? The Research University in the Helmholtz Association -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 5814 bytes Desc: S/MIME Cryptographic Signature URL: From olaf.weiser at de.ibm.com Tue Aug 15 19:07:32 2023 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Tue, 15 Aug 2023 18:07:32 +0000 Subject: [gpfsug-discuss] unlocking mmafmctl prefetch In-Reply-To: References: <64cf409d-9038-dc77-2165-942613e00411@kit.edu> Message-ID: if mmfsdm recycle didn't clean it up , then a.) report a SF ticket/open a case I expect, support 'll need some more data to analyze b.) let's meet tomorrow directly on the system and check (if so ..reach out to me directly) ________________________________ Von: gpfsug-discuss im Auftrag von Uwe Falke Gesendet: Dienstag, 15. August 2023 19:23 An: gpfsug-discuss at gpfsug.org Betreff: [EXTERNAL] Re: [gpfsug-discuss] unlocking mmafmctl prefetch Hi, Olaf, nope ... except "No lock found." Thx -- Karlsruhe Institute of Technology (KIT) Steinbuch Centre for Computing (SCC) Scientific Data Management (SDM) Uwe Falke Hermann-von-Helmholtz-Platz 1, Building 442, Room 187 D-76344 Eggenstein-Leopoldshafen Tel: +49 721 608 28024 Email: uwe.falke at kit.edu www.scc.kit.edu Registered office: Kaiserstra?e 12, 76131 Karlsruhe, Germany KIT ? The Research University in the Helmholtz Association -------------- next part -------------- An HTML attachment was scrubbed... URL: From vpuvvada at in.ibm.com Wed Aug 16 11:22:46 2023 From: vpuvvada at in.ibm.com (Venkateswara R Puvvada) Date: Wed, 16 Aug 2023 10:22:46 +0000 Subject: [gpfsug-discuss] unlocking mmafmctl prefetch In-Reply-To: References: <64cf409d-9038-dc77-2165-942613e00411@kit.edu> Message-ID: AFM prefetch is not allowed if the recovery is running. What is the caching mode? Check the fileset cache state using the command below mmafmctl device getState -j fileset You could also try commands mmafmctl stop/start. mmafmctl device stop -j fileset mmafmctl device start -j fileset ~Venkat (vpuvvada at in.ibm.com) ________________________________ From: gpfsug-discuss on behalf of Olaf Weiser Sent: Tuesday, August 15, 2023 11:37 PM To: gpfsug main discussion list Subject: [EXTERNAL] Re: [gpfsug-discuss] unlocking mmafmctl prefetch if mmfsdm recycle didn't clean it up , then a.?) report a SF ticket/open a case I expect, support 'll need some more data to analyze b.?) let's meet tomorrow directly on the system and check (if so ..?reach out to me directly) Von: gpfsug-discuss ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. Report Suspicious ZjQcmQRYFpfptBannerEnd if mmfsdm recycle didn't clean it up , then a.) report a SF ticket/open a case I expect, support 'll need some more data to analyze b.) let's meet tomorrow directly on the system and check (if so ..reach out to me directly) ________________________________ Von: gpfsug-discuss im Auftrag von Uwe Falke Gesendet: Dienstag, 15. August 2023 19:23 An: gpfsug-discuss at gpfsug.org Betreff: [EXTERNAL] Re: [gpfsug-discuss] unlocking mmafmctl prefetch Hi, Olaf, nope ... except "No lock found." Thx -- Karlsruhe Institute of Technology (KIT) Steinbuch Centre for Computing (SCC) Scientific Data Management (SDM) Uwe Falke Hermann-von-Helmholtz-Platz 1, Building 442, Room 187 D-76344 Eggenstein-Leopoldshafen Tel: +49 721 608 28024 Email: uwe.falke at kit.edu www.scc.kit.edu Registered office: Kaiserstra?e 12, 76131 Karlsruhe, Germany KIT ? The Research University in the Helmholtz Association -------------- next part -------------- An HTML attachment was scrubbed... URL: From uwe.falke at kit.edu Wed Aug 16 14:46:20 2023 From: uwe.falke at kit.edu (Uwe Falke) Date: Wed, 16 Aug 2023 15:46:20 +0200 Subject: [gpfsug-discuss] unlocking mmafmctl prefetch In-Reply-To: References: <64cf409d-9038-dc77-2165-942613e00411@kit.edu> Message-ID: <8a37a76e-bf16-502d-35b4-e7f13564e810@kit.edu> Thx, Venkat, the stop / start obviously cleaned up things, prefetch is running now. Thx again Uwe On 16.08.23 12:22, Venkateswara R Puvvada wrote: > AFM prefetch is not allowed if the recovery is running.? What is the > caching mode? Check the fileset cache state using the command below > > mmafmctl device getState -j fileset > > You could also try commands mmafmctl stop/start. > > mmafmctl device stop -j fileset > mmafmctl device start -j fileset > > > ~Venkat (vpuvvada at in.ibm.com) > ------------------------------------------------------------------------ > *From:* gpfsug-discuss on behalf > of Olaf Weiser > *Sent:* Tuesday, August 15, 2023 11:37 PM > *To:* gpfsug main discussion list > *Subject:* [EXTERNAL] Re: [gpfsug-discuss] unlocking mmafmctl prefetch > if mmfsdm recycle didn't clean it up , then a.?) report a SF > ticket/open a case I expect, support 'll need some more data to > analyze b.?) let's meet tomorrow directly on the system and check (if > so ..?reach out to me directly) Von: gpfsug-discuss > ZjQcmQRYFpfptBannerStart > This Message Is From an External Sender > This message came from outside your organization. > Report?Suspicious > > ZjQcmQRYFpfptBannerEnd > if mmfsdm recycle didn't clean it up , then > a.) report a SF ticket/open a case > I expect, support 'll need some more data to analyze > b.) let's meet tomorrow directly on the system and check (if so > ..reach out to me directly) > > ------------------------------------------------------------------------ > *Von:* gpfsug-discuss im Auftrag > von Uwe Falke > *Gesendet:* Dienstag, 15. August 2023 19:23 > *An:* gpfsug-discuss at gpfsug.org > *Betreff:* [EXTERNAL] Re: [gpfsug-discuss] unlocking mmafmctl prefetch > Hi, Olaf, > > nope ... except "No lock found." > > Thx > > -- > Karlsruhe Institute of Technology (KIT) > Steinbuch Centre for Computing (SCC) > Scientific Data Management (SDM) > > Uwe Falke > > Hermann-von-Helmholtz-Platz 1, Building 442, Room 187 > D-76344 Eggenstein-Leopoldshafen > > Tel: +49 721 608 28024 > Email: uwe.falke at kit.edu > www.scc.kit.edu > > Registered office: > Kaiserstra?e 12, 76131 Karlsruhe, Germany > > KIT ? The Research University in the Helmholtz Association > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org -- Karlsruhe Institute of Technology (KIT) Steinbuch Centre for Computing (SCC) Scientific Data Management (SDM) Uwe Falke Hermann-von-Helmholtz-Platz 1, Building 442, Room 187 D-76344 Eggenstein-Leopoldshafen Tel: +49 721 608 28024 Email:uwe.falke at kit.edu www.scc.kit.edu Registered office: Kaiserstra?e 12, 76131 Karlsruhe, Germany KIT ? The Research University in the Helmholtz Association -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 5814 bytes Desc: S/MIME Cryptographic Signature URL: From anacreo at gmail.com Wed Aug 16 20:28:29 2023 From: anacreo at gmail.com (Alec) Date: Wed, 16 Aug 2023 12:28:29 -0700 Subject: [gpfsug-discuss] Vmtouch on GPFS is supported? In-Reply-To: References: <82878d70-bdd1-eccc-2328-95a801bcaf86@kit.edu> Message-ID: You should keep in mind that GPFS primarily ships out configured for large sequential read/write as it's you know a multimedia file system... Not knowing your application, but assuming it maybe has a ton of little library files and such it's trying to keep track of I'd make sure to do the following optimizations: - Increase page pool maybe to 8gb or more. - Change maxfiles pagepool tracks from 4k to like 10k or 40k. - If you're not doing a separate meta device, i would ensure you are doing that. Then I'd also ensure it was on SSD or similar media and pinned to SSD if over SAN. For data like this on our AIX environment we will keep a jfs2 volume for it.. because it will just stay in RAM because we have 100gb free. And so for large sorts and quick load of random I/O this outperforms GPFS. GPFS secret benefit is that it prevents memory cache thrashing by not caching large data.... But this may be holding your app back. I couldn't find/remember the setting but I believe there is somewhere where it decides anything over like 1mb isn't worth caching, maybe that needs to be tuned for your instance. Alec On Tue, Aug 15, 2023, 12:29 AM Jonathan Buzzard < jonathan.buzzard at strath.ac.uk> wrote: > On 15/08/2023 07:32, Uwe Falke wrote: > > > > second point: > > > > while there is probably no vmtouch4gpfs, you might check and tune your > > gpfs parameters (pagepool size, maxFilesToCache). But first you should > > identify where the bottleneck is. Is your GPFS cluster spanning the VPN? > > Suppose not. So how do you know that it is really GPFS which is delaying > > your loading? > > > > nomachine is an remote desktop app, how do you load software efficiently > > through nomachine? > > I would say if it takes longer launching through nomachine than > launching locally, then the problem is nomachine. This should IMHO be > the first thing you test. > > If it takes a long time launching locally the application is a steaming > pile and you need to resolve that first. > > We use thinlinc which is a similar Linux remote desktop solution > extensively to provide a Linux desktop to our users with through > VirtualGL 3D visualization capabilities and we do not have an issue with > excessive launch times. > > > JAB. > > -- > Jonathan A. Buzzard Tel: +44141-5483420 > HPC System Administrator, ARCHIE-WeSt. > University of Strathclyde, John Anderson Building, Glasgow. G4 0NG > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From anacreo at gmail.com Wed Aug 16 20:32:34 2023 From: anacreo at gmail.com (Alec) Date: Wed, 16 Aug 2023 12:32:34 -0700 Subject: [gpfsug-discuss] RKM resilience questions testing and best practice Message-ID: Hello we are using a remote key server with GPFS I have two questions: First question: How can we verify that a key server is up and running when there are multiple key servers in an rkm pool serving a single key. The scenario is after maintenance or periodically we want to verify that all member of the pool are in service. Second question is: Is there any documentation or diagram officially from IBM that recommends having 2 keys from independent RKM environments for high availability as best practice that I could refer to? Alec -------------- next part -------------- An HTML attachment was scrubbed... URL: From ewahl at osc.edu Wed Aug 16 21:56:53 2023 From: ewahl at osc.edu (Wahl, Edward) Date: Wed, 16 Aug 2023 20:56:53 +0000 Subject: [gpfsug-discuss] RKM resilience questions testing and best practice In-Reply-To: References: Message-ID: > How can we verify that a key server is up and running when there are multiple key servers in an rkm pool serving a single key. Pretty simple. -Grab a compute node/client (and mark it offline if needed) unmount all encrypted File Systems. -Hack the RKM.conf to point to JUST the server you want to test (and maybe a backup) -Clear all keys: ?/usr/lpp/mmfs/bin/tsctl encKeyCachePurge all ? -Reload the RKM.conf: ?/usr/lpp/mmfs/bin/tsloadikm run? (this is a great command if you need to load new Certificates too) -Attempt to mount the encrypted FS, and then cat a few files. If you?ve not setup a 2nd server in your test you will see quarantine messages in the logs for a bad KMIP server. If it works, you can clear keys again and see how many were retrieved. >Is there any documentation or diagram officially from IBM that recommends having 2 keys from independent RKM environments for high availability as best practice that I could refer to? I am not an IBM-er? but I?m also not 100% sure what you are asking here. Two un-related SKLM setups? How would you sync the keys? How would this be better than multiple replicated servers? Ed Wahl Ohio Supercomputer Center From: gpfsug-discuss On Behalf Of Alec Sent: Wednesday, August 16, 2023 3:33 PM To: gpfsug main discussion list Subject: [gpfsug-discuss] RKM resilience questions testing and best practice Hello we are using a remote key server with GPFS I have two questions: First question: How can we verify that a key server is up and running when there are multiple key servers in an rkm pool serving a single key. The scenario is after maintenance Hello we are using a remote key server with GPFS I have two questions: First question: How can we verify that a key server is up and running when there are multiple key servers in an rkm pool serving a single key. The scenario is after maintenance or periodically we want to verify that all member of the pool are in service. Second question is: Is there any documentation or diagram officially from IBM that recommends having 2 keys from independent RKM environments for high availability as best practice that I could refer to? Alec -------------- next part -------------- An HTML attachment was scrubbed... URL: From anacreo at gmail.com Wed Aug 16 22:22:55 2023 From: anacreo at gmail.com (Alec) Date: Wed, 16 Aug 2023 14:22:55 -0700 Subject: [gpfsug-discuss] RKM resilience questions testing and best practice In-Reply-To: References: Message-ID: Ed Thanks for the response, I wasn't aware of those two commands. I will see if that unlocks a solution. I kind of need the test to work in a production environment. So can't just be adding spare nodes onto the cluster and forgetting with file systems. Unfortunately the logs don't indicate when a node has returned to health. Only that it's in trouble but as we patch often we see these regularly. For the second question, we would add a 2nd MEK key to each file so that two independent keys from two different RKM pools would be able to unlock any file. This would give us two whole independent paths to encrypt and decrypt a file. So I'm looking for a best practice example from IBM to indicate this so we don't have a dependency on a single RKM environment. Alec On Wed, Aug 16, 2023, 2:02 PM Wahl, Edward wrote: > > How can we verify that a key server is up and running when there are > multiple key servers in an rkm pool serving a single key. > > > > Pretty simple. > > -Grab a compute node/client (and mark it offline if needed) unmount all > encrypted File Systems. > > -Hack the RKM.conf to point to JUST the server you want to test (and maybe > a backup) > > -Clear all keys: ?/usr/lpp/mmfs/bin/tsctl encKeyCachePurge all ? > > -Reload the RKM.conf: ?/usr/lpp/mmfs/bin/tsloadikm run? (this is a > great command if you need to load new Certificates too) > > -Attempt to mount the encrypted FS, and then cat a few files. > > > > If you?ve not setup a 2nd server in your test you will see quarantine > messages in the logs for a bad KMIP server. If it works, you can clear > keys again and see how many were retrieved. > > > > >Is there any documentation or diagram officially from IBM that recommends > having 2 keys from independent RKM environments for high availability as > best practice that I could refer to? > > > > I am not an IBM-er? but I?m also not 100% sure what you are asking here. > Two un-related SKLM setups? How would you sync the keys? How would this > be better than multiple replicated servers? > > > > Ed Wahl > > Ohio Supercomputer Center > > > > *From:* gpfsug-discuss *On Behalf Of * > Alec > *Sent:* Wednesday, August 16, 2023 3:33 PM > *To:* gpfsug main discussion list > *Subject:* [gpfsug-discuss] RKM resilience questions testing and best > practice > > > > Hello we are using a remote key server with GPFS I have two questions: > First question: How can we verify that a key server is up and running when > there are multiple key servers in an rkm pool serving a single key. The > scenario is after maintenance > > Hello we are using a remote key server with GPFS I have two questions: > > > > First question: > > How can we verify that a key server is up and running when there are > multiple key servers in an rkm pool serving a single key. > > > > The scenario is after maintenance or periodically we want to verify that > all member of the pool are in service. > > > > Second question is: > > Is there any documentation or diagram officially from IBM that recommends > having 2 keys from independent RKM environments for high availability as > best practice that I could refer to? > > > > Alec > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert.stephenson at imperial.ac.uk Thu Aug 17 13:59:02 2023 From: robert.stephenson at imperial.ac.uk (Stephenson, Robert f) Date: Thu, 17 Aug 2023 12:59:02 +0000 Subject: [gpfsug-discuss] Hello from a new member Message-ID: Hi, my name is Rob Stephenson. We use GPFS for Academic group shares. OS is RHEL 8.7 and backend SAN attached storage is IBM V5000. We use TSM to backup GPFS data. We are currently migrating from and older GPFS instance to a new instance running: 5.1.6.0 We are using RSYNC to transfer the data. Organisation name: Imperial College London Sector: Education City / Country: London ; UK Regards Rob Rob Stephenson ICT Datacentre Services Imperial College London +44 (0)795 4176319 www.imperial.ac.uk/admin-services/ict/ -------------- next part -------------- An HTML attachment was scrubbed... URL: From janfrode at tanso.net Thu Aug 17 16:08:29 2023 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Thu, 17 Aug 2023 17:08:29 +0200 Subject: [gpfsug-discuss] RKM resilience questions testing and best practice In-Reply-To: References: Message-ID: Your second KMIP server don?t need to have an active replication relationship with the first one ? it just needs to contain the same MEK. So you could do a one time replication / copying between them, and they would not have to see each other anymore. I don?t think having them host different keys will work, as you won?t be able to fetch the second key from the one server your client is connected to, and then will be unable to encrypt with that key. >From what I?ve seen of KMIP setups with Scale, it?s a stupidly trivial service. It?s just a server that will tell you the key when asked + some access control to make sure no one else gets it. Also MEKs never changes? unless you actively change them in the file system policy, and then you could just post the new key to all/both your independent key servers when you do the change. -jf ons. 16. aug. 2023 kl. 23:25 skrev Alec : > Ed > Thanks for the response, I wasn't aware of those two commands. I will > see if that unlocks a solution. I kind of need the test to work in a > production environment. So can't just be adding spare nodes onto the > cluster and forgetting with file systems. > > Unfortunately the logs don't indicate when a node has returned to health. > Only that it's in trouble but as we patch often we see these regularly. > > > For the second question, we would add a 2nd MEK key to each file so that > two independent keys from two different RKM pools would be able to unlock > any file. This would give us two whole independent paths to encrypt and > decrypt a file. > > So I'm looking for a best practice example from IBM to indicate this so we > don't have a dependency on a single RKM environment. > > Alec > > > > On Wed, Aug 16, 2023, 2:02 PM Wahl, Edward wrote: > >> > How can we verify that a key server is up and running when there are >> multiple key servers in an rkm pool serving a single key. >> >> >> >> Pretty simple. >> >> -Grab a compute node/client (and mark it offline if needed) unmount all >> encrypted File Systems. >> >> -Hack the RKM.conf to point to JUST the server you want to test (and >> maybe a backup) >> >> -Clear all keys: ?/usr/lpp/mmfs/bin/tsctl encKeyCachePurge all ? >> >> -Reload the RKM.conf: ?/usr/lpp/mmfs/bin/tsloadikm run? (this is a >> great command if you need to load new Certificates too) >> >> -Attempt to mount the encrypted FS, and then cat a few files. >> >> >> >> If you?ve not setup a 2nd server in your test you will see quarantine >> messages in the logs for a bad KMIP server. If it works, you can clear >> keys again and see how many were retrieved. >> >> >> >> >Is there any documentation or diagram officially from IBM that >> recommends having 2 keys from independent RKM environments for high >> availability as best practice that I could refer to? >> >> >> >> I am not an IBM-er? but I?m also not 100% sure what you are asking here. >> Two un-related SKLM setups? How would you sync the keys? How would this >> be better than multiple replicated servers? >> >> >> >> Ed Wahl >> >> Ohio Supercomputer Center >> >> >> >> *From:* gpfsug-discuss *On Behalf Of >> *Alec >> *Sent:* Wednesday, August 16, 2023 3:33 PM >> *To:* gpfsug main discussion list >> *Subject:* [gpfsug-discuss] RKM resilience questions testing and best >> practice >> >> >> >> Hello we are using a remote key server with GPFS I have two questions: >> First question: How can we verify that a key server is up and running when >> there are multiple key servers in an rkm pool serving a single key. The >> scenario is after maintenance >> >> Hello we are using a remote key server with GPFS I have two questions: >> >> >> >> First question: >> >> How can we verify that a key server is up and running when there are >> multiple key servers in an rkm pool serving a single key. >> >> >> >> The scenario is after maintenance or periodically we want to verify that >> all member of the pool are in service. >> >> >> >> Second question is: >> >> Is there any documentation or diagram officially from IBM that recommends >> having 2 keys from independent RKM environments for high availability as >> best practice that I could refer to? >> >> >> >> Alec >> >> >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org >> > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From anacreo at gmail.com Thu Aug 17 16:52:08 2023 From: anacreo at gmail.com (Alec) Date: Thu, 17 Aug 2023 08:52:08 -0700 Subject: [gpfsug-discuss] RKM resilience questions testing and best practice In-Reply-To: References: Message-ID: Yesterday I proposed treating the replicated key servers as 2 different sets of servers. And having scale address two of the RKM servers by one rkmid/tenant/devicegrp/client name, and having a second rkmid/tenant/devicegrp/client name for the 2nd set of servers. So define the same cluster of key management servers in two separate stanzas of RKM.conf, an upper and lower half. If we do that and key management team takes one set offline, everything should work but scale would think one set of keys are offline and scream. I think we need an IBM ticket to help vet all that out. Alec On Thu, Aug 17, 2023, 8:11 AM Jan-Frode Myklebust wrote: > > Your second KMIP server don?t need to have an active replication > relationship with the first one ? it just needs to contain the same MEK. So > you could do a one time replication / copying between them, and they would > not have to see each other anymore. > > I don?t think having them host different keys will work, as you won?t be > able to fetch the second key from the one server your client is connected > to, and then will be unable to encrypt with that key. > > From what I?ve seen of KMIP setups with Scale, it?s a stupidly trivial > service. It?s just a server that will tell you the key when asked + some > access control to make sure no one else gets it. Also MEKs never changes? > unless you actively change them in the file system policy, and then you > could just post the new key to all/both your independent key servers when > you do the change. > > > -jf > > ons. 16. aug. 2023 kl. 23:25 skrev Alec : > >> Ed >> Thanks for the response, I wasn't aware of those two commands. I will >> see if that unlocks a solution. I kind of need the test to work in a >> production environment. So can't just be adding spare nodes onto the >> cluster and forgetting with file systems. >> >> Unfortunately the logs don't indicate when a node has returned to >> health. Only that it's in trouble but as we patch often we see these >> regularly. >> >> >> For the second question, we would add a 2nd MEK key to each file so that >> two independent keys from two different RKM pools would be able to unlock >> any file. This would give us two whole independent paths to encrypt and >> decrypt a file. >> >> So I'm looking for a best practice example from IBM to indicate this so >> we don't have a dependency on a single RKM environment. >> >> Alec >> >> >> >> On Wed, Aug 16, 2023, 2:02 PM Wahl, Edward wrote: >> >>> > How can we verify that a key server is up and running when there are >>> multiple key servers in an rkm pool serving a single key. >>> >>> >>> >>> Pretty simple. >>> >>> -Grab a compute node/client (and mark it offline if needed) unmount all >>> encrypted File Systems. >>> >>> -Hack the RKM.conf to point to JUST the server you want to test (and >>> maybe a backup) >>> >>> -Clear all keys: ?/usr/lpp/mmfs/bin/tsctl encKeyCachePurge all ? >>> >>> -Reload the RKM.conf: ?/usr/lpp/mmfs/bin/tsloadikm run? (this is a >>> great command if you need to load new Certificates too) >>> >>> -Attempt to mount the encrypted FS, and then cat a few files. >>> >>> >>> >>> If you?ve not setup a 2nd server in your test you will see quarantine >>> messages in the logs for a bad KMIP server. If it works, you can clear >>> keys again and see how many were retrieved. >>> >>> >>> >>> >Is there any documentation or diagram officially from IBM that >>> recommends having 2 keys from independent RKM environments for high >>> availability as best practice that I could refer to? >>> >>> >>> >>> I am not an IBM-er? but I?m also not 100% sure what you are asking >>> here. Two un-related SKLM setups? How would you sync the keys? How >>> would this be better than multiple replicated servers? >>> >>> >>> >>> Ed Wahl >>> >>> Ohio Supercomputer Center >>> >>> >>> >>> *From:* gpfsug-discuss *On Behalf >>> Of *Alec >>> *Sent:* Wednesday, August 16, 2023 3:33 PM >>> *To:* gpfsug main discussion list >>> *Subject:* [gpfsug-discuss] RKM resilience questions testing and best >>> practice >>> >>> >>> >>> Hello we are using a remote key server with GPFS I have two questions: >>> First question: How can we verify that a key server is up and running when >>> there are multiple key servers in an rkm pool serving a single key. The >>> scenario is after maintenance >>> >>> Hello we are using a remote key server with GPFS I have two questions: >>> >>> >>> >>> First question: >>> >>> How can we verify that a key server is up and running when there are >>> multiple key servers in an rkm pool serving a single key. >>> >>> >>> >>> The scenario is after maintenance or periodically we want to verify that >>> all member of the pool are in service. >>> >>> >>> >>> Second question is: >>> >>> Is there any documentation or diagram officially from IBM that >>> recommends having 2 keys from independent RKM environments for high >>> availability as best practice that I could refer to? >>> >>> >>> >>> Alec >>> >>> >>> >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at gpfsug.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org >>> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org >> > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From uwe.falke at kit.edu Thu Aug 17 16:54:29 2023 From: uwe.falke at kit.edu (Uwe Falke) Date: Thu, 17 Aug 2023 17:54:29 +0200 Subject: [gpfsug-discuss] GPL compilation failure Message-ID: <0eadd571-5863-945f-8148-99186611bd8e@kit.edu> Hi, just to let you know: building the 5.1.7.1 GPL layer for Linux Kernel 4.18.0-477.21.1.el8_8.x86_64 failed for me: [...] In file included from /usr/lpp/mmfs/src/gpl-linux/cfiles.c:87, from /usr/lpp/mmfs/src/gpl-linux/cfiles_cust.c:54: /usr/lpp/mmfs/src/gpl-linux/cxiCache.c: In function 'pcache_nfs_have_rdirplus': /usr/lpp/mmfs/src/gpl-linux/cxiCache.c:4011:31: error: 'NFS_INO_ADVISE_RDPLUS' undeclared (first use in this function); did you mean 'NFS_INO_ODIRECT'? && test_bit(NFS_INO_ADVISE_RDPLUS, &NFS_I(inodeP)->flags) ^~~~~~~~~~~~~~~~~~~~~ NFS_INO_ODIRECT /usr/lpp/mmfs/src/gpl-linux/cxiCache.c:4011:31: note: each undeclared identifier is reported only once for each function it appears in [...] If anyone managed to build that, a message would be nice, else you should expect problems. I opened a case with IBM but it is caught in the call entry ... (SF TS013920636, in case any supporter wants to look after it). Uwe -- Karlsruhe Institute of Technology (KIT) Steinbuch Centre for Computing (SCC) Scientific Data Management (SDM) Uwe Falke Hermann-von-Helmholtz-Platz 1, Building 442, Room 187 D-76344 Eggenstein-Leopoldshafen Tel: +49 721 608 28024 Email:uwe.falke at kit.edu www.scc.kit.edu Registered office: Kaiserstra?e 12, 76131 Karlsruhe, Germany KIT ? The Research University in the Helmholtz Association -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 5814 bytes Desc: S/MIME Cryptographic Signature URL: From Renar.Grunenberg at huk-coburg.de Thu Aug 17 17:04:03 2023 From: Renar.Grunenberg at huk-coburg.de (Grunenberg, Renar) Date: Thu, 17 Aug 2023 16:04:03 +0000 Subject: [gpfsug-discuss] GPL compilation failure In-Reply-To: <0eadd571-5863-945f-8148-99186611bd8e@kit.edu> References: <0eadd571-5863-945f-8148-99186611bd8e@kit.edu> Message-ID: Hallo Uwe, Kernel-Level 4.18.0-477 is only supported with scale 5.1.8.x. We had the same issue and must updated tot he last level. Renar Grunenberg Abteilung Informatik - Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de ________________________________ HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. Helen Reck, Dr. J?rg Rheinl?nder, Thomas Sehn, Daniel Thomas. ________________________________ Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. ________________________________ Von: gpfsug-discuss Im Auftrag von Uwe Falke Gesendet: Donnerstag, 17. August 2023 17:54 An: gpfsug-discuss at gpfsug.org Betreff: [gpfsug-discuss] GPL compilation failure Hi, just to let you know: building the 5.1.7.1 GPL layer for Linux Kernel 4.18.0-477.21.1.el8_8.x86_64 failed for me: [...] In file included from /usr/lpp/mmfs/src/gpl-linux/cfiles.c:87, from /usr/lpp/mmfs/src/gpl-linux/cfiles_cust.c:54: /usr/lpp/mmfs/src/gpl-linux/cxiCache.c: In function 'pcache_nfs_have_rdirplus': /usr/lpp/mmfs/src/gpl-linux/cxiCache.c:4011:31: error: 'NFS_INO_ADVISE_RDPLUS' undeclared (first use in this function); did you mean 'NFS_INO_ODIRECT'? && test_bit(NFS_INO_ADVISE_RDPLUS, &NFS_I(inodeP)->flags) ^~~~~~~~~~~~~~~~~~~~~ NFS_INO_ODIRECT /usr/lpp/mmfs/src/gpl-linux/cxiCache.c:4011:31: note: each undeclared identifier is reported only once for each function it appears in [...] If anyone managed to build that, a message would be nice, else you should expect problems. I opened a case with IBM but it is caught in the call entry ... (SF TS013920636, in case any supporter wants to look after it). Uwe -- Karlsruhe Institute of Technology (KIT) Steinbuch Centre for Computing (SCC) Scientific Data Management (SDM) Uwe Falke Hermann-von-Helmholtz-Platz 1, Building 442, Room 187 D-76344 Eggenstein-Leopoldshafen Tel: +49 721 608 28024 Email: uwe.falke at kit.edu www.scc.kit.edu Registered office: Kaiserstra?e 12, 76131 Karlsruhe, Germany KIT ? The Research University in the Helmholtz Association -------------- next part -------------- An HTML attachment was scrubbed... URL: From ben.nickell at inl.gov Thu Aug 17 17:16:44 2023 From: ben.nickell at inl.gov (Ben G. Nickell) Date: Thu, 17 Aug 2023 16:16:44 +0000 Subject: [gpfsug-discuss] [EXTERNAL] GPL compilation failure In-Reply-To: <0eadd571-5863-945f-8148-99186611bd8e@kit.edu> References: <0eadd571-5863-945f-8148-99186611bd8e@kit.edu> Message-ID: I had to upgrade to 5.1.8.0 or 5.1.8.1 to get it to build on that kernel, I think. Then it worked. From: gpfsug-discuss on behalf of Uwe Falke Date: Thursday, August 17, 2023 at 9:58 AM To: gpfsug-discuss at gpfsug.org Subject: [EXTERNAL] [gpfsug-discuss] GPL compilation failure Hi, just to let you know: building the 5.1.7.1 GPL layer for Linux Kernel 4.18.0-477.21.1.el8_8.x86_64 failed for me: [...] In file included from /usr/lpp/mmfs/src/gpl-linux/cfiles.c:87, from /usr/lpp/mmfs/src/gpl-linux/cfiles_cust.c:54: /usr/lpp/mmfs/src/gpl-linux/cxiCache.c: In function 'pcache_nfs_have_rdirplus': /usr/lpp/mmfs/src/gpl-linux/cxiCache.c:4011:31: error: 'NFS_INO_ADVISE_RDPLUS' undeclared (first use in this function); did you mean 'NFS_INO_ODIRECT'? && test_bit(NFS_INO_ADVISE_RDPLUS, &NFS_I(inodeP)->flags) ^~~~~~~~~~~~~~~~~~~~~ NFS_INO_ODIRECT /usr/lpp/mmfs/src/gpl-linux/cxiCache.c:4011:31: note: each undeclared identifier is reported only once for each function it appears in [...] If anyone managed to build that, a message would be nice, else you should expect problems. I opened a case with IBM but it is caught in the call entry ... (SF TS013920636, in case any supporter wants to look after it). Uwe -- Karlsruhe Institute of Technology (KIT) Steinbuch Centre for Computing (SCC) Scientific Data Management (SDM) Uwe Falke Hermann-von-Helmholtz-Platz 1, Building 442, Room 187 D-76344 Eggenstein-Leopoldshafen Tel: +49 721 608 28024 Email: uwe.falke at kit.edu www.scc.kit.edu Registered office: Kaiserstra?e 12, 76131 Karlsruhe, Germany KIT ? The Research University in the Helmholtz Association -------------- next part -------------- An HTML attachment was scrubbed... URL: From novosirj at rutgers.edu Thu Aug 17 17:38:32 2023 From: novosirj at rutgers.edu (Ryan Novosielski) Date: Thu, 17 Aug 2023 16:38:32 +0000 Subject: [gpfsug-discuss] GPL compilation failure In-Reply-To: <0eadd571-5863-945f-8148-99186611bd8e@kit.edu> References: <0eadd571-5863-945f-8148-99186611bd8e@kit.edu> Message-ID: <21B0BB52-C8CA-4A27-9385-84D4DFA5871A@rutgers.edu> Not supported on that GPFS version; probably need a 5.1.8.x ? definitely supported on 5.1.8.1. Always check here before upgrading the kernel: https://www.ibm.com/docs/en/storage-scale?topic=STXKQY/gpfsclustersfaq.html -- #BlackLivesMatter ____ || \\UTGERS, |---------------------------*O*--------------------------- ||_// the State | Ryan Novosielski - novosirj at rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\ of NJ | Office of Advanced Research Computing - MSB A555B, Newark `' On Aug 17, 2023, at 11:54, Uwe Falke wrote: Hi, just to let you know: building the 5.1.7.1 GPL layer for Linux Kernel 4.18.0-477.21.1.el8_8.x86_64 failed for me: [...] In file included from /usr/lpp/mmfs/src/gpl-linux/cfiles.c:87, from /usr/lpp/mmfs/src/gpl-linux/cfiles_cust.c:54: /usr/lpp/mmfs/src/gpl-linux/cxiCache.c: In function 'pcache_nfs_have_rdirplus': /usr/lpp/mmfs/src/gpl-linux/cxiCache.c:4011:31: error: 'NFS_INO_ADVISE_RDPLUS' undeclared (first use in this function); did you mean 'NFS_INO_ODIRECT'? && test_bit(NFS_INO_ADVISE_RDPLUS, &NFS_I(inodeP)->flags) ^~~~~~~~~~~~~~~~~~~~~ NFS_INO_ODIRECT /usr/lpp/mmfs/src/gpl-linux/cxiCache.c:4011:31: note: each undeclared identifier is reported only once for each function it appears in [...] If anyone managed to build that, a message would be nice, else you should expect problems. I opened a case with IBM but it is caught in the call entry ... (SF TS013920636, in case any supporter wants to look after it). Uwe -- Karlsruhe Institute of Technology (KIT) Steinbuch Centre for Computing (SCC) Scientific Data Management (SDM) Uwe Falke Hermann-von-Helmholtz-Platz 1, Building 442, Room 187 D-76344 Eggenstein-Leopoldshafen Tel: +49 721 608 28024 Email: uwe.falke at kit.edu www.scc.kit.edu Registered office: Kaiserstra?e 12, 76131 Karlsruhe, Germany KIT ? The Research University in the Helmholtz Association _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From novosirj at rutgers.edu Thu Aug 17 17:53:41 2023 From: novosirj at rutgers.edu (Ryan Novosielski) Date: Thu, 17 Aug 2023 16:53:41 +0000 Subject: [gpfsug-discuss] GPL compilation failure In-Reply-To: <21B0BB52-C8CA-4A27-9385-84D4DFA5871A@rutgers.edu> References: <0eadd571-5863-945f-8148-99186611bd8e@kit.edu> <21B0BB52-C8CA-4A27-9385-84D4DFA5871A@rutgers.edu> Message-ID: <6E40DCAC-4EB1-4414-A300-86F446A934B8@rutgers.edu> Sorry, one more thing: while you can often upgrade from say 3.10.0-1160.92.1 to 3.10.0-1160.95.1, like the kind of upgrade you?d see within an RHEL point release ? and you?ll often see an older version get updated to list support for those kinds of version increments ? you should definitely expect to potentially need to upgrade GPFS if you?re going between RHEL point releases, like 8.7 to 8.8 ? those ones that change from, say, 4.18.0-425.x to 4.18.0-477.x. Sometimes unsupported versions will work, but often times they will not, and if you have a support contract, that difference is sort of meaningless since you wouldn?t want to run an unsupported version anyway. On Aug 17, 2023, at 12:38, Ryan Novosielski wrote: Not supported on that GPFS version; probably need a 5.1.8.x ? definitely supported on 5.1.8.1. Always check here before upgrading the kernel: https://www.ibm.com/docs/en/storage-scale?topic=STXKQY/gpfsclustersfaq.html -- #BlackLivesMatter ____ || \\UTGERS, |---------------------------*O*--------------------------- ||_// the State | Ryan Novosielski - novosirj at rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\ of NJ | Office of Advanced Research Computing - MSB A555B, Newark `' On Aug 17, 2023, at 11:54, Uwe Falke wrote: Hi, just to let you know: building the 5.1.7.1 GPL layer for Linux Kernel 4.18.0-477.21.1.el8_8.x86_64 failed for me: [...] In file included from /usr/lpp/mmfs/src/gpl-linux/cfiles.c:87, from /usr/lpp/mmfs/src/gpl-linux/cfiles_cust.c:54: /usr/lpp/mmfs/src/gpl-linux/cxiCache.c: In function 'pcache_nfs_have_rdirplus': /usr/lpp/mmfs/src/gpl-linux/cxiCache.c:4011:31: error: 'NFS_INO_ADVISE_RDPLUS' undeclared (first use in this function); did you mean 'NFS_INO_ODIRECT'? && test_bit(NFS_INO_ADVISE_RDPLUS, &NFS_I(inodeP)->flags) ^~~~~~~~~~~~~~~~~~~~~ NFS_INO_ODIRECT /usr/lpp/mmfs/src/gpl-linux/cxiCache.c:4011:31: note: each undeclared identifier is reported only once for each function it appears in [...] If anyone managed to build that, a message would be nice, else you should expect problems. I opened a case with IBM but it is caught in the call entry ... (SF TS013920636, in case any supporter wants to look after it). Uwe -- Karlsruhe Institute of Technology (KIT) Steinbuch Centre for Computing (SCC) Scientific Data Management (SDM) Uwe Falke Hermann-von-Helmholtz-Platz 1, Building 442, Room 187 D-76344 Eggenstein-Leopoldshafen Tel: +49 721 608 28024 Email: uwe.falke at kit.edu www.scc.kit.edu Registered office: Kaiserstra?e 12, 76131 Karlsruhe, Germany KIT ? The Research University in the Helmholtz Association _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From uwe.falke at kit.edu Thu Aug 17 18:58:15 2023 From: uwe.falke at kit.edu (Uwe Falke) Date: Thu, 17 Aug 2023 19:58:15 +0200 Subject: [gpfsug-discuss] GPL compilation failure In-Reply-To: <6E40DCAC-4EB1-4414-A300-86F446A934B8@rutgers.edu> References: <0eadd571-5863-945f-8148-99186611bd8e@kit.edu> <21B0BB52-C8CA-4A27-9385-84D4DFA5871A@rutgers.edu> <6E40DCAC-4EB1-4414-A300-86F446A934B8@rutgers.edu> Message-ID: thanks, ryan. I would have loved to stay with RHEL 8.7, however RHEL chose to not patch the RHEL8.7 Kernel line it seems ... Uwe On 17.08.23 18:53, Ryan Novosielski wrote: > Sorry, one more thing: while you can often upgrade from say > 3.10.0-1160.92.1 to 3.10.0-1160.95.1, like the kind of upgrade you?d > see within an RHEL point release ? and you?ll often see an older > version get updated to list support for those kinds of version > increments ? you should definitely expect to potentially need to > upgrade GPFS if you?re going between RHEL point releases, like 8.7 to > 8.8 ? those ones that change from, say, 4.18.0-425.x to 4.18.0-477.x. > > Sometimes unsupported versions will work, but often times they will > not, and if you have a support contract, that difference is sort of > meaningless since you wouldn?t want to run an unsupported version anyway. > >> On Aug 17, 2023, at 12:38, Ryan Novosielski wrote: >> >> Not supported on that GPFS version; probably need a 5.1.8.x ? >> definitely supported on 5.1.8.1. >> >> Always check here before upgrading the kernel: >> >> https://www.ibm.com/docs/en/storage-scale?topic=STXKQY/gpfsclustersfaq.html >> >> -- >> #BlackLivesMatter >> ____ >> ||?\\UTGERS, |---------------------------*O*--------------------------- >> ||_// the State?| ? ? ? ? Ryan Novosielski - novosirj at rutgers.edu >> || \\ University | Sr. Technologist -?973/972.0922 (2x0922) ~*~ >> RBHS?Campus >> || ?\\ ? ?of NJ?| Office of Advanced?Research Computing - MSB >> A555B,?Newark >> ? ? ?`' >> >>> On Aug 17, 2023, at 11:54, Uwe Falke wrote: >>> >>> Hi, just to let you know: >>> >>> building the 5.1.7.1 GPL layer for Linux Kernel >>> 4.18.0-477.21.1.el8_8.x86_64 failed for me: >>> >>> [...] >>> >>> In file included from /usr/lpp/mmfs/src/gpl-linux/cfiles.c:87, >>> from /usr/lpp/mmfs/src/gpl-linux/cfiles_cust.c:54: >>> /usr/lpp/mmfs/src/gpl-linux/cxiCache.c: In function 'pcache_nfs_have_rdirplus': >>> /usr/lpp/mmfs/src/gpl-linux/cxiCache.c:4011:31: error: 'NFS_INO_ADVISE_RDPLUS' undeclared (first use in this function); did you mean 'NFS_INO_ODIRECT'? >>> && test_bit(NFS_INO_ADVISE_RDPLUS, &NFS_I(inodeP)->flags) >>> ^~~~~~~~~~~~~~~~~~~~~ >>> NFS_INO_ODIRECT >>> /usr/lpp/mmfs/src/gpl-linux/cxiCache.c:4011:31: note: each undeclared identifier is reported only once for each function it appears in >>> [...] >>> >>> If anyone managed to build that, a message would be nice, else you >>> should expect problems. >>> >>> I opened a case with IBM but it is caught in the call entry ... (SF >>> TS013920636, in case any supporter wants to look after it). >>> >>> Uwe >>> >>> -- >>> Karlsruhe Institute of Technology (KIT) >>> Steinbuch Centre for Computing (SCC) >>> Scientific Data Management (SDM) >>> >>> Uwe Falke >>> >>> Hermann-von-Helmholtz-Platz 1, Building 442, Room 187 >>> D-76344 Eggenstein-Leopoldshafen >>> >>> Tel: +49 721 608 28024 >>> Email:uwe.falke at kit.edu >>> www.scc.kit.edu >>> >>> Registered office: >>> Kaiserstra?e 12, 76131 Karlsruhe, Germany >>> >>> KIT ? The Research University in the Helmholtz Association >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at gpfsug.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org >> > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org -- Karlsruhe Institute of Technology (KIT) Steinbuch Centre for Computing (SCC) Scientific Data Management (SDM) Uwe Falke Hermann-von-Helmholtz-Platz 1, Building 442, Room 187 D-76344 Eggenstein-Leopoldshafen Tel: +49 721 608 28024 Email:uwe.falke at kit.edu www.scc.kit.edu Registered office: Kaiserstra?e 12, 76131 Karlsruhe, Germany KIT ? The Research University in the Helmholtz Association -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 5814 bytes Desc: S/MIME Cryptographic Signature URL: From christof.schmitt at us.ibm.com Thu Aug 17 19:08:22 2023 From: christof.schmitt at us.ibm.com (Christof Schmitt) Date: Thu, 17 Aug 2023 18:08:22 +0000 Subject: [gpfsug-discuss] GPL compilation failure In-Reply-To: References: <0eadd571-5863-945f-8148-99186611bd8e@kit.edu> <21B0BB52-C8CA-4A27-9385-84D4DFA5871A@rutgers.edu> <6E40DCAC-4EB1-4414-A300-86F446A934B8@rutgers.edu> Message-ID: <95a6175d548915d6b1be28056722b0c8288204c0.camel@us.ibm.com> I cannot speak for Redhat, but the RHEL release cycles are documented: https://access.redhat.com/support/policy/updates/errata#RHEL8_and_9_Life_Cycle The rule of thumb for Scale is that a new RHEL minor release will likely be supported in the Scale release or PTF following the RHEL release. There might be exceptions dending on the timing of the release dates, possible problems found in testing, etc. Regards, Christof On Thu, 2023-08-17 at 19:58 +0200, Uwe Falke wrote: thanks, ryan. I would have loved to stay with RHEL 8.7, however RHEL chose to not patch the RHEL8.7 Kernel line it seems ... Uwe On 17.08.23 18:53, Ryan Novosielski wrote: Sorry, one more thing: while you can often upgrade from say 3.10.0-1160.92.1 to 3.10.0-1160.95.1, like the kind of upgrade you?d see within an RHEL point release ? and you?ll often see an older version get updated to list support for those kinds of version increments ? you should definitely expect to potentially need to upgrade GPFS if you?re going between RHEL point releases, like 8.7 to 8.8 ? those ones that change from, say, 4.18.0-425.x to 4.18.0-477.x. Sometimes unsupported versions will work, but often times they will not, and if you have a support contract, that difference is sort of meaningless since you wouldn?t want to run an unsupported version anyway. On Aug 17, 2023, at 12:38, Ryan Novosielski wrote: Not supported on that GPFS version; probably need a 5.1.8.x ? definitely supported on 5.1.8.1. Always check here before upgrading the kernel: https://www.ibm.com/docs/en/storage-scale?topic=STXKQY/gpfsclustersfaq.html -- #BlackLivesMatter ____ || \\UTGERS, |---------------------------*O*--------------------------- ||_// the State | Ryan Novosielski - novosirj at rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\ of NJ | Office of Advanced Research Computing - MSB A555B, Newark `' On Aug 17, 2023, at 11:54, Uwe Falke wrote: Hi, just to let you know: building the 5.1.7.1 GPL layer for Linux Kernel 4.18.0-477.21.1.el8_8.x86_64 failed for me: [...] In file included from /usr/lpp/mmfs/src/gpl-linux/cfiles.c:87, from /usr/lpp/mmfs/src/gpl-linux/cfiles_cust.c:54: /usr/lpp/mmfs/src/gpl-linux/cxiCache.c: In function 'pcache_nfs_have_rdirplus': /usr/lpp/mmfs/src/gpl-linux/cxiCache.c:4011:31: error: 'NFS_INO_ADVISE_RDPLUS' undeclared (first use in this function); did you mean 'NFS_INO_ODIRECT'? && test_bit(NFS_INO_ADVISE_RDPLUS, &NFS_I(inodeP)->flags) ^~~~~~~~~~~~~~~~~~~~~ NFS_INO_ODIRECT /usr/lpp/mmfs/src/gpl-linux/cxiCache.c:4011:31: note: each undeclared identifier is reported only once for each function it appears in [...] If anyone managed to build that, a message would be nice, else you should expect problems. I opened a case with IBM but it is caught in the call entry ... (SF TS013920636, in case any supporter wants to look after it). Uwe -- Karlsruhe Institute of Technology (KIT) Steinbuch Centre for Computing (SCC) Scientific Data Management (SDM) Uwe Falke Hermann-von-Helmholtz-Platz 1, Building 442, Room 187 D-76344 Eggenstein-Leopoldshafen Tel: +49 721 608 28024 Email: uwe.falke at kit.edu www.scc.kit.edu Registered office: Kaiserstra?e 12, 76131 Karlsruhe, Germany KIT ? The Research University in the Helmholtz Association _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org -- Karlsruhe Institute of Technology (KIT) Steinbuch Centre for Computing (SCC) Scientific Data Management (SDM) Uwe Falke Hermann-von-Helmholtz-Platz 1, Building 442, Room 187 D-76344 Eggenstein-Leopoldshafen Tel: +49 721 608 28024 Email: uwe.falke at kit.edu www.scc.kit.edu Registered office: Kaiserstra?e 12, 76131 Karlsruhe, Germany KIT ? The Research University in the Helmholtz Association _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From uwe.falke at kit.edu Thu Aug 17 19:38:54 2023 From: uwe.falke at kit.edu (Uwe Falke) Date: Thu, 17 Aug 2023 20:38:54 +0200 Subject: [gpfsug-discuss] GPL compilation failure In-Reply-To: <95a6175d548915d6b1be28056722b0c8288204c0.camel@us.ibm.com> References: <0eadd571-5863-945f-8148-99186611bd8e@kit.edu> <21B0BB52-C8CA-4A27-9385-84D4DFA5871A@rutgers.edu> <6E40DCAC-4EB1-4414-A300-86F446A934B8@rutgers.edu> <95a6175d548915d6b1be28056722b0c8288204c0.camel@us.ibm.com> Message-ID: thanks for that, so it is clear RHEL 8.7 won't get any more updates. Uwe On 17.08.23 20:08, Christof Schmitt wrote: > I cannot speak for Redhat, but the RHEL release cycles are documented: > https://access.redhat.com/support/policy/updates/errata#RHEL8_and_9_Life_Cycle > > The rule of thumb for Scale is that a new RHEL minor release will > likely be supported in the Scale release or PTF following the RHEL > release. There might be exceptions dending on the timing of the > release dates, possible problems found in testing, etc. > > Regards, > > Christof > > On Thu, 2023-08-17 at 19:58 +0200, Uwe Falke wrote: >> >> thanks, ryan. >> >> I would have loved to stay with RHEL 8.7, however RHEL chose to not >> patch the RHEL8.7 Kernel line it seems ... >> >> Uwe >> >> On 17.08.23 18:53, Ryan Novosielski wrote: >>> Sorry, one more thing: while you can often upgrade from say >>> 3.10.0-1160.92.1 to 3.10.0-1160.95.1, like the kind of upgrade you?d >>> see within an RHEL point release ? and you?ll often see an older >>> version get updated to list support for those kinds of version >>> increments ? you should definitely expect to potentially need to >>> upgrade GPFS if you?re going between RHEL point releases, like 8.7 >>> to 8.8 ? those ones that change from, say, 4.18.0-425.x to >>> 4.18.0-477.x. >>> >>> Sometimes unsupported versions will work, but often times they will >>> not, and if you have a support contract, that difference is sort of >>> meaningless since you wouldn?t want to run an unsupported version >>> anyway. >>> >>>> On Aug 17, 2023, at 12:38, Ryan Novosielski >>>> wrote: >>>> >>>> Not supported on that GPFS version; probably need a 5.1.8.x ? >>>> definitely supported on 5.1.8.1. >>>> >>>> Always check here before upgrading the kernel: >>>> >>>> https://www.ibm.com/docs/en/storage-scale?topic=STXKQY/gpfsclustersfaq.html >>>> >>>> -- >>>> #BlackLivesMatter >>>> ____ >>>> ||?\\UTGERS, |---------------------------*O*--------------------------- >>>> ||_// the State ?| ? ? ? ? Ryan Novosielski - novosirj at rutgers.edu >>>> || \\ University | Sr. Technologist -?973/972.0922 (2x0922) ~*~ >>>> RBHS?Campus >>>> || ?\\ ? ?of NJ ?| Office of Advanced?Research Computing - MSB >>>> A555B,?Newark >>>> ? ? ?`' >>>> >>>>> On Aug 17, 2023, at 11:54, Uwe Falke wrote: >>>>> >>>>> Hi, just to let you know: >>>>> >>>>> building the 5.1.7.1 GPL layer for Linux Kernel >>>>> 4.18.0-477.21.1.el8_8.x86_64 failed for me: >>>>> >>>>> [...] >>>>> >>>>> In file included from /usr/lpp/mmfs/src/gpl-linux/cfiles.c:87, >>>>> from /usr/lpp/mmfs/src/gpl-linux/cfiles_cust.c:54: >>>>> /usr/lpp/mmfs/src/gpl-linux/cxiCache.c: In function 'pcache_nfs_have_rdirplus': >>>>> /usr/lpp/mmfs/src/gpl-linux/cxiCache.c:4011:31: error: 'NFS_INO_ADVISE_RDPLUS' undeclared (first use in this function); did you mean 'NFS_INO_ODIRECT'? >>>>> && test_bit(NFS_INO_ADVISE_RDPLUS, &NFS_I(inodeP)->flags) >>>>> ^~~~~~~~~~~~~~~~~~~~~ >>>>> NFS_INO_ODIRECT >>>>> /usr/lpp/mmfs/src/gpl-linux/cxiCache.c:4011:31: note: each undeclared identifier is reported only once for each function it appears in >>>>> [...] >>>>> >>>>> If anyone managed to build that, a message would be nice, else you >>>>> should expect problems. >>>>> >>>>> I opened a case with IBM but it is caught in the call entry ... >>>>> (SF TS013920636, in case any supporter wants to look after it). >>>>> >>>>> Uwe >>>>> >>>>> -- >>>>> Karlsruhe Institute of Technology (KIT) >>>>> Steinbuch Centre for Computing (SCC) >>>>> Scientific Data Management (SDM) >>>>> Uwe Falke >>>>> Hermann-von-Helmholtz-Platz 1, Building 442, Room 187 >>>>> D-76344 Eggenstein-Leopoldshafen >>>>> Tel: +49 721 608 28024 >>>>> Email: >>>>> uwe.falke at kit.edu >>>>> www.scc.kit.edu >>>>> Registered office: >>>>> Kaiserstra?e 12, 76131 Karlsruhe, Germany >>>>> KIT ? The Research University in the Helmholtz Association >>>>> _______________________________________________ >>>>> gpfsug-discuss mailing list >>>>> gpfsug-discuss at gpfsug.org >>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org >>>> >>> >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at gpfsug.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org >> -- >> Karlsruhe Institute of Technology (KIT) >> Steinbuch Centre for Computing (SCC) >> Scientific Data Management (SDM) >> Uwe Falke >> Hermann-von-Helmholtz-Platz 1, Building 442, Room 187 >> D-76344 Eggenstein-Leopoldshafen >> Tel: +49 721 608 28024 >> Email: >> uwe.falke at kit.edu >> www.scc.kit.edu >> Registered office: >> Kaiserstra?e 12, 76131 Karlsruhe, Germany >> KIT ? The Research University in the Helmholtz Association >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org >> > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org -- Karlsruhe Institute of Technology (KIT) Steinbuch Centre for Computing (SCC) Scientific Data Management (SDM) Uwe Falke Hermann-von-Helmholtz-Platz 1, Building 442, Room 187 D-76344 Eggenstein-Leopoldshafen Tel: +49 721 608 28024 Email:uwe.falke at kit.edu www.scc.kit.edu Registered office: Kaiserstra?e 12, 76131 Karlsruhe, Germany KIT ? The Research University in the Helmholtz Association -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 5814 bytes Desc: S/MIME Cryptographic Signature URL: From jonathan.buzzard at strath.ac.uk Thu Aug 17 22:24:30 2023 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Thu, 17 Aug 2023 22:24:30 +0100 Subject: [gpfsug-discuss] GPL compilation failure In-Reply-To: References: <0eadd571-5863-945f-8148-99186611bd8e@kit.edu> <21B0BB52-C8CA-4A27-9385-84D4DFA5871A@rutgers.edu> <6E40DCAC-4EB1-4414-A300-86F446A934B8@rutgers.edu> <95a6175d548915d6b1be28056722b0c8288204c0.camel@us.ibm.com> Message-ID: <66d72b9d-f1df-95b5-5cd3-a94adf6a1be9@strath.ac.uk> On 17/08/2023 19:38, Uwe Falke wrote: > > thanks for that, so it is clear RHEL 8.7 won't get any more updates. > Sure it's not an Extended Update Support (EUS) release. On the other hand 8.6 and 8.8 are. I believe that Scale is only supported on RHEL releases that are supported by Redhat. So right now in the 8.x version that is 8.6 with EUS and 8.8 which in due course will also be an EUS release. Given that Scale tends to be supported on new RHEL point releases in the Scale release or PTF following the RHEL release then as far as I can see the only way to be fully supported at all times is to stick to EUS releases. I would note that the elephant in the room is unless you have budget for genuine RHEL rather than a rebuild you need a plan to move to an alternative distribution. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From janfrode at tanso.net Fri Aug 18 08:19:11 2023 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Fri, 18 Aug 2023 09:19:11 +0200 Subject: [gpfsug-discuss] RKM resilience questions testing and best practice In-Reply-To: References: Message-ID: If a key server go offline, scale will just go to the next one in the list -- and give a warning/error about it in mmhealth. Nothing should happen to the file system access. Also, you can tune how often scale needs to refresh the keys from the key server with encryptionKeyCacheExpiration. Setting it to 0 means that your nodes will only need to fetch the key when they mount the file system, or when you change policy. -jf On Thu, Aug 17, 2023 at 5:54?PM Alec wrote: > Yesterday I proposed treating the replicated key servers as 2 different > sets of servers. And having scale address two of the RKM servers by one > rkmid/tenant/devicegrp/client name, and having a second > rkmid/tenant/devicegrp/client name for the 2nd set of servers. > > So define the same cluster of key management servers in two separate > stanzas of RKM.conf, an upper and lower half. > > If we do that and key management team takes one set offline, everything > should work but scale would think one set of keys are offline and scream. > > I think we need an IBM ticket to help vet all that out. > > Alec > > On Thu, Aug 17, 2023, 8:11 AM Jan-Frode Myklebust > wrote: > >> >> Your second KMIP server don?t need to have an active replication >> relationship with the first one ? it just needs to contain the same MEK. So >> you could do a one time replication / copying between them, and they would >> not have to see each other anymore. >> >> I don?t think having them host different keys will work, as you won?t be >> able to fetch the second key from the one server your client is connected >> to, and then will be unable to encrypt with that key. >> >> From what I?ve seen of KMIP setups with Scale, it?s a stupidly trivial >> service. It?s just a server that will tell you the key when asked + some >> access control to make sure no one else gets it. Also MEKs never changes? >> unless you actively change them in the file system policy, and then you >> could just post the new key to all/both your independent key servers when >> you do the change. >> >> >> -jf >> >> ons. 16. aug. 2023 kl. 23:25 skrev Alec : >> >>> Ed >>> Thanks for the response, I wasn't aware of those two commands. I will >>> see if that unlocks a solution. I kind of need the test to work in a >>> production environment. So can't just be adding spare nodes onto the >>> cluster and forgetting with file systems. >>> >>> Unfortunately the logs don't indicate when a node has returned to >>> health. Only that it's in trouble but as we patch often we see these >>> regularly. >>> >>> >>> For the second question, we would add a 2nd MEK key to each file so that >>> two independent keys from two different RKM pools would be able to unlock >>> any file. This would give us two whole independent paths to encrypt and >>> decrypt a file. >>> >>> So I'm looking for a best practice example from IBM to indicate this so >>> we don't have a dependency on a single RKM environment. >>> >>> Alec >>> >>> >>> >>> On Wed, Aug 16, 2023, 2:02 PM Wahl, Edward wrote: >>> >>>> > How can we verify that a key server is up and running when there are >>>> multiple key servers in an rkm pool serving a single key. >>>> >>>> >>>> >>>> Pretty simple. >>>> >>>> -Grab a compute node/client (and mark it offline if needed) unmount all >>>> encrypted File Systems. >>>> >>>> -Hack the RKM.conf to point to JUST the server you want to test (and >>>> maybe a backup) >>>> >>>> -Clear all keys: ?/usr/lpp/mmfs/bin/tsctl encKeyCachePurge all ? >>>> >>>> -Reload the RKM.conf: ?/usr/lpp/mmfs/bin/tsloadikm run? (this is a >>>> great command if you need to load new Certificates too) >>>> >>>> -Attempt to mount the encrypted FS, and then cat a few files. >>>> >>>> >>>> >>>> If you?ve not setup a 2nd server in your test you will see quarantine >>>> messages in the logs for a bad KMIP server. If it works, you can clear >>>> keys again and see how many were retrieved. >>>> >>>> >>>> >>>> >Is there any documentation or diagram officially from IBM that >>>> recommends having 2 keys from independent RKM environments for high >>>> availability as best practice that I could refer to? >>>> >>>> >>>> >>>> I am not an IBM-er? but I?m also not 100% sure what you are asking >>>> here. Two un-related SKLM setups? How would you sync the keys? How >>>> would this be better than multiple replicated servers? >>>> >>>> >>>> >>>> Ed Wahl >>>> >>>> Ohio Supercomputer Center >>>> >>>> >>>> >>>> *From:* gpfsug-discuss *On Behalf >>>> Of *Alec >>>> *Sent:* Wednesday, August 16, 2023 3:33 PM >>>> *To:* gpfsug main discussion list >>>> *Subject:* [gpfsug-discuss] RKM resilience questions testing and best >>>> practice >>>> >>>> >>>> >>>> Hello we are using a remote key server with GPFS I have two questions: >>>> First question: How can we verify that a key server is up and running when >>>> there are multiple key servers in an rkm pool serving a single key. The >>>> scenario is after maintenance >>>> >>>> Hello we are using a remote key server with GPFS I have two questions: >>>> >>>> >>>> >>>> First question: >>>> >>>> How can we verify that a key server is up and running when there are >>>> multiple key servers in an rkm pool serving a single key. >>>> >>>> >>>> >>>> The scenario is after maintenance or periodically we want to verify >>>> that all member of the pool are in service. >>>> >>>> >>>> >>>> Second question is: >>>> >>>> Is there any documentation or diagram officially from IBM that >>>> recommends having 2 keys from independent RKM environments for high >>>> availability as best practice that I could refer to? >>>> >>>> >>>> >>>> Alec >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at gpfsug.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org >>>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at gpfsug.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org >>> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org >> > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From anacreo at gmail.com Fri Aug 18 09:51:13 2023 From: anacreo at gmail.com (Alec) Date: Fri, 18 Aug 2023 01:51:13 -0700 Subject: [gpfsug-discuss] RKM resilience questions testing and best practice In-Reply-To: References: Message-ID: Okay so how do you know the backup key servers are actually functioning until you try to fail to them? We need a way to know they are actually working. Setting encryptionKeyCacheExpiration to 0 would actually help in that we shouldn't go down once we are up. But it would suck if we bounce and then find out none of the key servers are working, then we have the same disaster but just a different time to experience it. Spectrum Scale honestly needs an option to probe and complain about the backup RKM servers. Or if we could run a command to validate that all keys are visible on all key servers that could work as well. Alec On Fri, Aug 18, 2023, 12:22 AM Jan-Frode Myklebust wrote: > If a key server go offline, scale will just go to the next one in the list > -- and give a warning/error about it in mmhealth. Nothing should happen to > the file system access. Also, you can tune how often scale needs to refresh > the keys from the key server with encryptionKeyCacheExpiration. Setting it > to 0 means that your nodes will only need to fetch the key when they mount > the file system, or when you change policy. > > > -jf > > On Thu, Aug 17, 2023 at 5:54?PM Alec wrote: > >> Yesterday I proposed treating the replicated key servers as 2 different >> sets of servers. And having scale address two of the RKM servers by one >> rkmid/tenant/devicegrp/client name, and having a second >> rkmid/tenant/devicegrp/client name for the 2nd set of servers. >> >> So define the same cluster of key management servers in two separate >> stanzas of RKM.conf, an upper and lower half. >> >> If we do that and key management team takes one set offline, everything >> should work but scale would think one set of keys are offline and scream. >> >> I think we need an IBM ticket to help vet all that out. >> >> Alec >> >> On Thu, Aug 17, 2023, 8:11 AM Jan-Frode Myklebust >> wrote: >> >>> >>> Your second KMIP server don?t need to have an active replication >>> relationship with the first one ? it just needs to contain the same MEK. So >>> you could do a one time replication / copying between them, and they would >>> not have to see each other anymore. >>> >>> I don?t think having them host different keys will work, as you won?t be >>> able to fetch the second key from the one server your client is connected >>> to, and then will be unable to encrypt with that key. >>> >>> From what I?ve seen of KMIP setups with Scale, it?s a stupidly trivial >>> service. It?s just a server that will tell you the key when asked + some >>> access control to make sure no one else gets it. Also MEKs never changes? >>> unless you actively change them in the file system policy, and then you >>> could just post the new key to all/both your independent key servers when >>> you do the change. >>> >>> >>> -jf >>> >>> ons. 16. aug. 2023 kl. 23:25 skrev Alec : >>> >>>> Ed >>>> Thanks for the response, I wasn't aware of those two commands. I >>>> will see if that unlocks a solution. I kind of need the test to work in a >>>> production environment. So can't just be adding spare nodes onto the >>>> cluster and forgetting with file systems. >>>> >>>> Unfortunately the logs don't indicate when a node has returned to >>>> health. Only that it's in trouble but as we patch often we see these >>>> regularly. >>>> >>>> >>>> For the second question, we would add a 2nd MEK key to each file so >>>> that two independent keys from two different RKM pools would be able to >>>> unlock any file. This would give us two whole independent paths to encrypt >>>> and decrypt a file. >>>> >>>> So I'm looking for a best practice example from IBM to indicate this so >>>> we don't have a dependency on a single RKM environment. >>>> >>>> Alec >>>> >>>> >>>> >>>> On Wed, Aug 16, 2023, 2:02 PM Wahl, Edward wrote: >>>> >>>>> > How can we verify that a key server is up and running when there are >>>>> multiple key servers in an rkm pool serving a single key. >>>>> >>>>> >>>>> >>>>> Pretty simple. >>>>> >>>>> -Grab a compute node/client (and mark it offline if needed) unmount >>>>> all encrypted File Systems. >>>>> >>>>> -Hack the RKM.conf to point to JUST the server you want to test (and >>>>> maybe a backup) >>>>> >>>>> -Clear all keys: ?/usr/lpp/mmfs/bin/tsctl encKeyCachePurge all ? >>>>> >>>>> -Reload the RKM.conf: ?/usr/lpp/mmfs/bin/tsloadikm run? (this is a >>>>> great command if you need to load new Certificates too) >>>>> >>>>> -Attempt to mount the encrypted FS, and then cat a few files. >>>>> >>>>> >>>>> >>>>> If you?ve not setup a 2nd server in your test you will see quarantine >>>>> messages in the logs for a bad KMIP server. If it works, you can clear >>>>> keys again and see how many were retrieved. >>>>> >>>>> >>>>> >>>>> >Is there any documentation or diagram officially from IBM that >>>>> recommends having 2 keys from independent RKM environments for high >>>>> availability as best practice that I could refer to? >>>>> >>>>> >>>>> >>>>> I am not an IBM-er? but I?m also not 100% sure what you are asking >>>>> here. Two un-related SKLM setups? How would you sync the keys? How >>>>> would this be better than multiple replicated servers? >>>>> >>>>> >>>>> >>>>> Ed Wahl >>>>> >>>>> Ohio Supercomputer Center >>>>> >>>>> >>>>> >>>>> *From:* gpfsug-discuss *On Behalf >>>>> Of *Alec >>>>> *Sent:* Wednesday, August 16, 2023 3:33 PM >>>>> *To:* gpfsug main discussion list >>>>> *Subject:* [gpfsug-discuss] RKM resilience questions testing and best >>>>> practice >>>>> >>>>> >>>>> >>>>> Hello we are using a remote key server with GPFS I have two questions: >>>>> First question: How can we verify that a key server is up and running when >>>>> there are multiple key servers in an rkm pool serving a single key. The >>>>> scenario is after maintenance >>>>> >>>>> Hello we are using a remote key server with GPFS I have two questions: >>>>> >>>>> >>>>> >>>>> First question: >>>>> >>>>> How can we verify that a key server is up and running when there are >>>>> multiple key servers in an rkm pool serving a single key. >>>>> >>>>> >>>>> >>>>> The scenario is after maintenance or periodically we want to verify >>>>> that all member of the pool are in service. >>>>> >>>>> >>>>> >>>>> Second question is: >>>>> >>>>> Is there any documentation or diagram officially from IBM that >>>>> recommends having 2 keys from independent RKM environments for high >>>>> availability as best practice that I could refer to? >>>>> >>>>> >>>>> >>>>> Alec >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> gpfsug-discuss mailing list >>>>> gpfsug-discuss at gpfsug.org >>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org >>>>> >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at gpfsug.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org >>>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at gpfsug.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org >>> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org >> > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From anacreo at gmail.com Fri Aug 18 10:09:38 2023 From: anacreo at gmail.com (Alec) Date: Fri, 18 Aug 2023 02:09:38 -0700 Subject: [gpfsug-discuss] RKM resilience questions testing and best practice In-Reply-To: References: Message-ID: Hmm.. IBM mentions in 5.1.2 documentation that for performance we could just rotate the order of the keys to load balance the keys.. however because of server maintenance I would imagine all the nodes end up on the same server eventually. But I think I see a solution. If I just define 4 additional RKM configs and each one with one key server and don't do anything else with it. I am guessing that GPFS is going to monitor and complain about them if they go down. And that is easy to test... So RKM.conf with RKM_PROD { kmipServerUri1 = node1 kmipServerUri2 = node2 kmipServerUri3 = node3 kmipServerUri4 = node4 } RKM_PROD_T1 { kmipServerUri = node1 } RKM_PROD_T2 { kmipServerUri = node2 } RKM_PROD_T3 { kmipServerUri = node3 } RKM_PROD_T4 { kmipServerUri = node4 } I could then define 4 files with a key from each test RKM_PROD_T? group to monitor the availability of the individual key servers. Call it Alec's trust but verify HA. On Fri, Aug 18, 2023, 1:51 AM Alec wrote: > Okay so how do you know the backup key servers are actually functioning > until you try to fail to them? We need a way to know they are actually > working. > > Setting encryptionKeyCacheExpiration to 0 would actually help in that we > shouldn't go down once we are up. But it would suck if we bounce and then > find out none of the key servers are working, then we have the same > disaster but just a different time to experience it. > > Spectrum Scale honestly needs an option to probe and complain about the > backup RKM servers. Or if we could run a command to validate that all > keys are visible on all key servers that could work as well. > > Alec > > On Fri, Aug 18, 2023, 12:22 AM Jan-Frode Myklebust > wrote: > >> If a key server go offline, scale will just go to the next one in the >> list -- and give a warning/error about it in mmhealth. Nothing should >> happen to the file system access. Also, you can tune how often scale needs >> to refresh the keys from the key server with encryptionKeyCacheExpiration. >> Setting it to 0 means that your nodes will only need to fetch the key when >> they mount the file system, or when you change policy. >> >> >> -jf >> >> On Thu, Aug 17, 2023 at 5:54?PM Alec wrote: >> >>> Yesterday I proposed treating the replicated key servers as 2 different >>> sets of servers. And having scale address two of the RKM servers by one >>> rkmid/tenant/devicegrp/client name, and having a second >>> rkmid/tenant/devicegrp/client name for the 2nd set of servers. >>> >>> So define the same cluster of key management servers in two separate >>> stanzas of RKM.conf, an upper and lower half. >>> >>> If we do that and key management team takes one set offline, everything >>> should work but scale would think one set of keys are offline and scream. >>> >>> I think we need an IBM ticket to help vet all that out. >>> >>> Alec >>> >>> On Thu, Aug 17, 2023, 8:11 AM Jan-Frode Myklebust >>> wrote: >>> >>>> >>>> Your second KMIP server don?t need to have an active replication >>>> relationship with the first one ? it just needs to contain the same MEK. So >>>> you could do a one time replication / copying between them, and they would >>>> not have to see each other anymore. >>>> >>>> I don?t think having them host different keys will work, as you won?t >>>> be able to fetch the second key from the one server your client is >>>> connected to, and then will be unable to encrypt with that key. >>>> >>>> From what I?ve seen of KMIP setups with Scale, it?s a stupidly trivial >>>> service. It?s just a server that will tell you the key when asked + some >>>> access control to make sure no one else gets it. Also MEKs never changes? >>>> unless you actively change them in the file system policy, and then you >>>> could just post the new key to all/both your independent key servers when >>>> you do the change. >>>> >>>> >>>> -jf >>>> >>>> ons. 16. aug. 2023 kl. 23:25 skrev Alec : >>>> >>>>> Ed >>>>> Thanks for the response, I wasn't aware of those two commands. I >>>>> will see if that unlocks a solution. I kind of need the test to work in a >>>>> production environment. So can't just be adding spare nodes onto the >>>>> cluster and forgetting with file systems. >>>>> >>>>> Unfortunately the logs don't indicate when a node has returned to >>>>> health. Only that it's in trouble but as we patch often we see these >>>>> regularly. >>>>> >>>>> >>>>> For the second question, we would add a 2nd MEK key to each file so >>>>> that two independent keys from two different RKM pools would be able to >>>>> unlock any file. This would give us two whole independent paths to encrypt >>>>> and decrypt a file. >>>>> >>>>> So I'm looking for a best practice example from IBM to indicate this >>>>> so we don't have a dependency on a single RKM environment. >>>>> >>>>> Alec >>>>> >>>>> >>>>> >>>>> On Wed, Aug 16, 2023, 2:02 PM Wahl, Edward wrote: >>>>> >>>>>> > How can we verify that a key server is up and running when there >>>>>> are multiple key servers in an rkm pool serving a single key. >>>>>> >>>>>> >>>>>> >>>>>> Pretty simple. >>>>>> >>>>>> -Grab a compute node/client (and mark it offline if needed) unmount >>>>>> all encrypted File Systems. >>>>>> >>>>>> -Hack the RKM.conf to point to JUST the server you want to test (and >>>>>> maybe a backup) >>>>>> >>>>>> -Clear all keys: ?/usr/lpp/mmfs/bin/tsctl encKeyCachePurge all ? >>>>>> >>>>>> -Reload the RKM.conf: ?/usr/lpp/mmfs/bin/tsloadikm run? (this is a >>>>>> great command if you need to load new Certificates too) >>>>>> >>>>>> -Attempt to mount the encrypted FS, and then cat a few files. >>>>>> >>>>>> >>>>>> >>>>>> If you?ve not setup a 2nd server in your test you will see >>>>>> quarantine messages in the logs for a bad KMIP server. If it works, you >>>>>> can clear keys again and see how many were retrieved. >>>>>> >>>>>> >>>>>> >>>>>> >Is there any documentation or diagram officially from IBM that >>>>>> recommends having 2 keys from independent RKM environments for high >>>>>> availability as best practice that I could refer to? >>>>>> >>>>>> >>>>>> >>>>>> I am not an IBM-er? but I?m also not 100% sure what you are asking >>>>>> here. Two un-related SKLM setups? How would you sync the keys? How >>>>>> would this be better than multiple replicated servers? >>>>>> >>>>>> >>>>>> >>>>>> Ed Wahl >>>>>> >>>>>> Ohio Supercomputer Center >>>>>> >>>>>> >>>>>> >>>>>> *From:* gpfsug-discuss *On >>>>>> Behalf Of *Alec >>>>>> *Sent:* Wednesday, August 16, 2023 3:33 PM >>>>>> *To:* gpfsug main discussion list >>>>>> *Subject:* [gpfsug-discuss] RKM resilience questions testing and >>>>>> best practice >>>>>> >>>>>> >>>>>> >>>>>> Hello we are using a remote key server with GPFS I have two >>>>>> questions: First question: How can we verify that a key server is up and >>>>>> running when there are multiple key servers in an rkm pool serving a single >>>>>> key. The scenario is after maintenance >>>>>> >>>>>> Hello we are using a remote key server with GPFS I have two questions: >>>>>> >>>>>> >>>>>> >>>>>> First question: >>>>>> >>>>>> How can we verify that a key server is up and running when there are >>>>>> multiple key servers in an rkm pool serving a single key. >>>>>> >>>>>> >>>>>> >>>>>> The scenario is after maintenance or periodically we want to verify >>>>>> that all member of the pool are in service. >>>>>> >>>>>> >>>>>> >>>>>> Second question is: >>>>>> >>>>>> Is there any documentation or diagram officially from IBM that >>>>>> recommends having 2 keys from independent RKM environments for high >>>>>> availability as best practice that I could refer to? >>>>>> >>>>>> >>>>>> >>>>>> Alec >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> gpfsug-discuss mailing list >>>>>> gpfsug-discuss at gpfsug.org >>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org >>>>>> >>>>> _______________________________________________ >>>>> gpfsug-discuss mailing list >>>>> gpfsug-discuss at gpfsug.org >>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org >>>>> >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at gpfsug.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org >>>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at gpfsug.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org >>> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at gpfsug.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From janfrode at tanso.net Fri Aug 18 14:01:33 2023 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Fri, 18 Aug 2023 15:01:33 +0200 Subject: [gpfsug-discuss] RKM resilience questions testing and best practice In-Reply-To: References: Message-ID: Maybe give a vote for this one: https://ideas.ibm.com/ideas/GPFS-I-652 Encryption - tool to check health status of all configured encryption > servers > > When Encryption is configured on a file system. the key server must be > available to allow user file access. When the key server fails, data access > is lost. We need a tools that can be run to check key server health, check > retrieval of keys, and communication health. This should be independent of > mmfsd. Inclusion in mmhealth would be ideal. > Planned for future release... -jf On Fri, Aug 18, 2023 at 11:11?AM Alec wrote: > Hmm.. IBM mentions in 5.1.2 documentation that for performance we could > just rotate the order of the keys to load balance the keys.. however > because of server maintenance I would imagine all the nodes end up on the > same server eventually. > > But I think I see a solution. If I just define 4 additional RKM configs > and each one with one key server and don't do anything else with it. I am > guessing that GPFS is going to monitor and complain about them if they go > down. And that is easy to test... > > > So RKM.conf with > RKM_PROD { > kmipServerUri1 = node1 > kmipServerUri2 = node2 > kmipServerUri3 = node3 > kmipServerUri4 = node4 > } > RKM_PROD_T1 { > kmipServerUri = node1 > } > RKM_PROD_T2 { > kmipServerUri = node2 > } > RKM_PROD_T3 { > kmipServerUri = node3 > } > RKM_PROD_T4 { > kmipServerUri = node4 > } > > I could then define 4 files with a key from each test RKM_PROD_T? group to > monitor the availability of the individual key servers. > > Call it Alec's trust but verify HA. > > On Fri, Aug 18, 2023, 1:51 AM Alec wrote: > >> Okay so how do you know the backup key servers are actually functioning >> until you try to fail to them? We need a way to know they are actually >> working. >> >> Setting encryptionKeyCacheExpiration to 0 would actually help in that we >> shouldn't go down once we are up. But it would suck if we bounce and then >> find out none of the key servers are working, then we have the same >> disaster but just a different time to experience it. >> >> Spectrum Scale honestly needs an option to probe and complain about the >> backup RKM servers. Or if we could run a command to validate that all >> keys are visible on all key servers that could work as well. >> >> Alec >> >> On Fri, Aug 18, 2023, 12:22 AM Jan-Frode Myklebust >> wrote: >> >>> If a key server go offline, scale will just go to the next one in the >>> list -- and give a warning/error about it in mmhealth. Nothing should >>> happen to the file system access. Also, you can tune how often scale needs >>> to refresh the keys from the key server with encryptionKeyCacheExpiration. >>> Setting it to 0 means that your nodes will only need to fetch the key when >>> they mount the file system, or when you change policy. >>> >>> >>> -jf >>> >>> On Thu, Aug 17, 2023 at 5:54?PM Alec wrote: >>> >>>> Yesterday I proposed treating the replicated key servers as 2 different >>>> sets of servers. And having scale address two of the RKM servers by one >>>> rkmid/tenant/devicegrp/client name, and having a second >>>> rkmid/tenant/devicegrp/client name for the 2nd set of servers. >>>> >>>> So define the same cluster of key management servers in two separate >>>> stanzas of RKM.conf, an upper and lower half. >>>> >>>> If we do that and key management team takes one set offline, everything >>>> should work but scale would think one set of keys are offline and scream. >>>> >>>> I think we need an IBM ticket to help vet all that out. >>>> >>>> Alec >>>> >>>> On Thu, Aug 17, 2023, 8:11 AM Jan-Frode Myklebust >>>> wrote: >>>> >>>>> >>>>> Your second KMIP server don?t need to have an active replication >>>>> relationship with the first one ? it just needs to contain the same MEK. So >>>>> you could do a one time replication / copying between them, and they would >>>>> not have to see each other anymore. >>>>> >>>>> I don?t think having them host different keys will work, as you won?t >>>>> be able to fetch the second key from the one server your client is >>>>> connected to, and then will be unable to encrypt with that key. >>>>> >>>>> From what I?ve seen of KMIP setups with Scale, it?s a stupidly trivial >>>>> service. It?s just a server that will tell you the key when asked + some >>>>> access control to make sure no one else gets it. Also MEKs never changes? >>>>> unless you actively change them in the file system policy, and then you >>>>> could just post the new key to all/both your independent key servers when >>>>> you do the change. >>>>> >>>>> >>>>> -jf >>>>> >>>>> ons. 16. aug. 2023 kl. 23:25 skrev Alec : >>>>> >>>>>> Ed >>>>>> Thanks for the response, I wasn't aware of those two commands. I >>>>>> will see if that unlocks a solution. I kind of need the test to work in a >>>>>> production environment. So can't just be adding spare nodes onto the >>>>>> cluster and forgetting with file systems. >>>>>> >>>>>> Unfortunately the logs don't indicate when a node has returned to >>>>>> health. Only that it's in trouble but as we patch often we see these >>>>>> regularly. >>>>>> >>>>>> >>>>>> For the second question, we would add a 2nd MEK key to each file so >>>>>> that two independent keys from two different RKM pools would be able to >>>>>> unlock any file. This would give us two whole independent paths to encrypt >>>>>> and decrypt a file. >>>>>> >>>>>> So I'm looking for a best practice example from IBM to indicate this >>>>>> so we don't have a dependency on a single RKM environment. >>>>>> >>>>>> Alec >>>>>> >>>>>> >>>>>> >>>>>> On Wed, Aug 16, 2023, 2:02 PM Wahl, Edward wrote: >>>>>> >>>>>>> > How can we verify that a key server is up and running when there >>>>>>> are multiple key servers in an rkm pool serving a single key. >>>>>>> >>>>>>> >>>>>>> >>>>>>> Pretty simple. >>>>>>> >>>>>>> -Grab a compute node/client (and mark it offline if needed) unmount >>>>>>> all encrypted File Systems. >>>>>>> >>>>>>> -Hack the RKM.conf to point to JUST the server you want to test (and >>>>>>> maybe a backup) >>>>>>> >>>>>>> -Clear all keys: ?/usr/lpp/mmfs/bin/tsctl encKeyCachePurge all ? >>>>>>> >>>>>>> -Reload the RKM.conf: ?/usr/lpp/mmfs/bin/tsloadikm run? (this is >>>>>>> a great command if you need to load new Certificates too) >>>>>>> >>>>>>> -Attempt to mount the encrypted FS, and then cat a few files. >>>>>>> >>>>>>> >>>>>>> >>>>>>> If you?ve not setup a 2nd server in your test you will see >>>>>>> quarantine messages in the logs for a bad KMIP server. If it works, you >>>>>>> can clear keys again and see how many were retrieved. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >Is there any documentation or diagram officially from IBM that >>>>>>> recommends having 2 keys from independent RKM environments for high >>>>>>> availability as best practice that I could refer to? >>>>>>> >>>>>>> >>>>>>> >>>>>>> I am not an IBM-er? but I?m also not 100% sure what you are asking >>>>>>> here. Two un-related SKLM setups? How would you sync the keys? How >>>>>>> would this be better than multiple replicated servers? >>>>>>> >>>>>>> >>>>>>> >>>>>>> Ed Wahl >>>>>>> >>>>>>> Ohio Supercomputer Center >>>>>>> >>>>>>> >>>>>>> >>>>>>> *From:* gpfsug-discuss *On >>>>>>> Behalf Of *Alec >>>>>>> *Sent:* Wednesday, August 16, 2023 3:33 PM >>>>>>> *To:* gpfsug main discussion list >>>>>>> *Subject:* [gpfsug-discuss] RKM resilience questions testing and >>>>>>> best practice >>>>>>> >>>>>>> >>>>>>> >>>>>>> Hello we are using a remote key server with GPFS I have two >>>>>>> questions: First question: How can we verify that a key server is up and >>>>>>> running when there are multiple key servers in an rkm pool serving a single >>>>>>> key. The scenario is after maintenance >>>>>>> >>>>>>> Hello we are using a remote key server with GPFS I have two >>>>>>> questions: >>>>>>> >>>>>>> >>>>>>> >>>>>>> First question: >>>>>>> >>>>>>> How can we verify that a key server is up and running when there are >>>>>>> multiple key servers in an rkm pool serving a single key. >>>>>>> >>>>>>> >>>>>>> >>>>>>> The scenario is after maintenance or periodically we want to verify >>>>>>> that all member of the pool are in service. >>>>>>> >>>>>>> >>>>>>> >>>>>>> Second question is: >>>>>>> >>>>>>> Is there any documentation or diagram officially from IBM that >>>>>>> recommends having 2 keys from independent RKM environments for high >>>>>>> availability as best practice that I could refer to? >>>>>>> >>>>>>> >>>>>>> >>>>>>> Alec >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> gpfsug-discuss mailing list >>>>>>> gpfsug-discuss at gpfsug.org >>>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org >>>>>>> >>>>>> _______________________________________________ >>>>>> gpfsug-discuss mailing list >>>>>> gpfsug-discuss at gpfsug.org >>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org >>>>>> >>>>> _______________________________________________ >>>>> gpfsug-discuss mailing list >>>>> gpfsug-discuss at gpfsug.org >>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org >>>>> >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at gpfsug.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org >>>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at gpfsug.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org >>> >> _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.kidger at hpe.com Mon Aug 21 18:50:02 2023 From: daniel.kidger at hpe.com (Kidger, Daniel) Date: Mon, 21 Aug 2023 17:50:02 +0000 Subject: [gpfsug-discuss] Joining RDMA over different networks? Message-ID: I know in the Lustre world that LNET routers are used to provide RDMA over heterogeneous networks. Is there an equivalent for Storage Scale? eg if an ESS uses Infiniband to connect directly to Cluster A, could that InfiniBand RDMA fabric be "routed" to ClusterB that has RoCE connecting all its nodes together and hence the filesystem mounted? ps. The same question would apply to other usually incompatible RDMA networks like Omnipath, Slingshot, Cornelis, ... ? Daniel Daniel Kidger HPC Storage Solutions Architect, EMEA daniel.kidger at hpe.com +44 (0)7818 522266 hpe.com [cid:image001.png at 01D9D45F.FC6CCA30] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 2541 bytes Desc: image001.png URL: From novosirj at rutgers.edu Mon Aug 21 19:07:03 2023 From: novosirj at rutgers.edu (Ryan Novosielski) Date: Mon, 21 Aug 2023 18:07:03 +0000 Subject: [gpfsug-discuss] Joining RDMA over different networks? In-Reply-To: References: Message-ID: <9AE82616-B931-478A-92DB-0E484DB93B60@rutgers.edu> If I understand what you?re asking correctly, we used to have a cluster that did this. GPFS was on Infininiband, some of the compute nodes were too, and the rest were on Omnipath. There were routers in between with both types. Sent from my iPhone On Aug 21, 2023, at 13:55, Kidger, Daniel wrote: ? I know in the Lustre world that LNET routers are used to provide RDMA over heterogeneous networks. Is there an equivalent for Storage Scale? eg if an ESS uses Infiniband to connect directly to Cluster A, could that InfiniBand RDMA fabric be ?routed? to ClusterB that has RoCE connecting all its nodes together and hence the filesystem mounted? ps. The same question would apply to other usually incompatible RDMA networks like Omnipath, Slingshot, Cornelis, ? ? Daniel Daniel Kidger HPC Storage Solutions Architect, EMEA daniel.kidger at hpe.com +44 (0)7818 522266 hpe.com _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 2541 bytes Desc: image001.png URL: From novosirj at rutgers.edu Mon Aug 21 19:07:03 2023 From: novosirj at rutgers.edu (Ryan Novosielski) Date: Mon, 21 Aug 2023 18:07:03 +0000 Subject: [gpfsug-discuss] Joining RDMA over different networks? In-Reply-To: References: Message-ID: <9AE82616-B931-478A-92DB-0E484DB93B60@rutgers.edu> If I understand what you?re asking correctly, we used to have a cluster that did this. GPFS was on Infininiband, some of the compute nodes were too, and the rest were on Omnipath. There were routers in between with both types. Sent from my iPhone On Aug 21, 2023, at 13:55, Kidger, Daniel wrote: ? I know in the Lustre world that LNET routers are used to provide RDMA over heterogeneous networks. Is there an equivalent for Storage Scale? eg if an ESS uses Infiniband to connect directly to Cluster A, could that InfiniBand RDMA fabric be ?routed? to ClusterB that has RoCE connecting all its nodes together and hence the filesystem mounted? ps. The same question would apply to other usually incompatible RDMA networks like Omnipath, Slingshot, Cornelis, ? ? Daniel Daniel Kidger HPC Storage Solutions Architect, EMEA daniel.kidger at hpe.com +44 (0)7818 522266 hpe.com _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 2541 bytes Desc: image001.png URL: From daniel.kidger at hpe.com Mon Aug 21 19:43:03 2023 From: daniel.kidger at hpe.com (Kidger, Daniel) Date: Mon, 21 Aug 2023 18:43:03 +0000 Subject: [gpfsug-discuss] Joining RDMA over different networks? In-Reply-To: <9AE82616-B931-478A-92DB-0E484DB93B60@rutgers.edu> References: <9AE82616-B931-478A-92DB-0E484DB93B60@rutgers.edu> Message-ID: Ryan, This sounds very interesting. Do you have more details or references of how they connected together, and what any pain points were? Daniel From: gpfsug-discuss On Behalf Of Ryan Novosielski Sent: 21 August 2023 19:07 To: gpfsug main discussion list Cc: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Joining RDMA over different networks? If I understand what you?re asking correctly, we used to have a cluster that did this. GPFS was on Infininiband, some of the compute nodes were too, and the rest were on Omnipath. There were routers in between with both types. Sent from my iPhone On Aug 21, 2023, at 13:55, Kidger, Daniel > wrote: ? I know in the Lustre world that LNET routers are used to provide RDMA over heterogeneous networks. Is there an equivalent for Storage Scale? eg if an ESS uses Infiniband to connect directly to Cluster A, could that InfiniBand RDMA fabric be ?routed? to ClusterB that has RoCE connecting all its nodes together and hence the filesystem mounted? ps. The same question would apply to other usually incompatible RDMA networks like Omnipath, Slingshot, Cornelis, ? ? Daniel Daniel Kidger HPC Storage Solutions Architect, EMEA daniel.kidger at hpe.com +44 (0)7818 522266 hpe.com _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.kidger at hpe.com Mon Aug 21 19:43:03 2023 From: daniel.kidger at hpe.com (Kidger, Daniel) Date: Mon, 21 Aug 2023 18:43:03 +0000 Subject: [gpfsug-discuss] Joining RDMA over different networks? In-Reply-To: <9AE82616-B931-478A-92DB-0E484DB93B60@rutgers.edu> References: <9AE82616-B931-478A-92DB-0E484DB93B60@rutgers.edu> Message-ID: Ryan, This sounds very interesting. Do you have more details or references of how they connected together, and what any pain points were? Daniel From: gpfsug-discuss On Behalf Of Ryan Novosielski Sent: 21 August 2023 19:07 To: gpfsug main discussion list Cc: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Joining RDMA over different networks? If I understand what you?re asking correctly, we used to have a cluster that did this. GPFS was on Infininiband, some of the compute nodes were too, and the rest were on Omnipath. There were routers in between with both types. Sent from my iPhone On Aug 21, 2023, at 13:55, Kidger, Daniel > wrote: ? I know in the Lustre world that LNET routers are used to provide RDMA over heterogeneous networks. Is there an equivalent for Storage Scale? eg if an ESS uses Infiniband to connect directly to Cluster A, could that InfiniBand RDMA fabric be ?routed? to ClusterB that has RoCE connecting all its nodes together and hence the filesystem mounted? ps. The same question would apply to other usually incompatible RDMA networks like Omnipath, Slingshot, Cornelis, ? ? Daniel Daniel Kidger HPC Storage Solutions Architect, EMEA daniel.kidger at hpe.com +44 (0)7818 522266 hpe.com _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From ewahl at osc.edu Mon Aug 21 21:03:31 2023 From: ewahl at osc.edu (Wahl, Edward) Date: Mon, 21 Aug 2023 20:03:31 +0000 Subject: [gpfsug-discuss] Joining RDMA over different networks? In-Reply-To: References: Message-ID: I believe the new nVidia name for this type of product for IB->Ethernet is ?skyway?. Older types of this will surely get discussed on the list. Gateway Systems and Routers | NVIDIA Ed Wahl Ohio Supercomputer Center From: gpfsug-discuss On Behalf Of Kidger, Daniel Sent: Monday, August 21, 2023 1:50 PM To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] Joining RDMA over different networks? I know in the Lustre world that LNET routers are used to provide RDMA over heterogeneous networks. Is there an equivalent for Storage Scale? eg if an ESS uses Infiniband to connect directly to Cluster A, could that InfiniBand RDMA fabric be I know in the Lustre world that LNET routers are used to provide RDMA over heterogeneous networks. Is there an equivalent for Storage Scale? eg if an ESS uses Infiniband to connect directly to Cluster A, could that InfiniBand RDMA fabric be ?routed? to ClusterB that has RoCE connecting all its nodes together and hence the filesystem mounted? ps. The same question would apply to other usually incompatible RDMA networks like Omnipath, Slingshot, Cornelis, ? ? Daniel Daniel Kidger HPC Storage Solutions Architect, EMEA daniel.kidger at hpe.com +44 (0)7818 522266 hpe.com [cid:image001.png at 01D9D448.FE55CCF0] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 2541 bytes Desc: image001.png URL: From ewahl at osc.edu Mon Aug 21 21:03:31 2023 From: ewahl at osc.edu (Wahl, Edward) Date: Mon, 21 Aug 2023 20:03:31 +0000 Subject: [gpfsug-discuss] Joining RDMA over different networks? In-Reply-To: References: Message-ID: I believe the new nVidia name for this type of product for IB->Ethernet is ?skyway?. Older types of this will surely get discussed on the list. Gateway Systems and Routers | NVIDIA Ed Wahl Ohio Supercomputer Center From: gpfsug-discuss On Behalf Of Kidger, Daniel Sent: Monday, August 21, 2023 1:50 PM To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] Joining RDMA over different networks? I know in the Lustre world that LNET routers are used to provide RDMA over heterogeneous networks. Is there an equivalent for Storage Scale? eg if an ESS uses Infiniband to connect directly to Cluster A, could that InfiniBand RDMA fabric be I know in the Lustre world that LNET routers are used to provide RDMA over heterogeneous networks. Is there an equivalent for Storage Scale? eg if an ESS uses Infiniband to connect directly to Cluster A, could that InfiniBand RDMA fabric be ?routed? to ClusterB that has RoCE connecting all its nodes together and hence the filesystem mounted? ps. The same question would apply to other usually incompatible RDMA networks like Omnipath, Slingshot, Cornelis, ? ? Daniel Daniel Kidger HPC Storage Solutions Architect, EMEA daniel.kidger at hpe.com +44 (0)7818 522266 hpe.com [cid:image001.png at 01D9D448.FE55CCF0] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 2541 bytes Desc: image001.png URL: From novosirj at rutgers.edu Tue Aug 22 00:27:15 2023 From: novosirj at rutgers.edu (Ryan Novosielski) Date: Mon, 21 Aug 2023 23:27:15 +0000 Subject: [gpfsug-discuss] Joining RDMA over different networks? In-Reply-To: References: <9AE82616-B931-478A-92DB-0E484DB93B60@rutgers.edu> Message-ID: <6555765F-3FA5-4EC6-B1D5-1F2E2E023541@rutgers.edu> I still have the guide from that system, and I saved some of the routing scripts and what not. But really, it wasn?t much more complicated than Ethernet routing. The routing nodes, I guess obviously, had both Omnipath and Infiniband interfaces. Compute knifes themselves I believe used a supervisord script, if I?m remembering that name right, to try to balance out which routing nide ione would use as a gateway. There were two as it was configured when I got to it, but a larger number was possible. It seems to me that there was probably a better way to do that, but it did work. The read/write rates were not as fast as our fully Inifniband clusters, but they were fast enough. The cluster was Caliburn, which was in the top 500 for some time, so there may be some papers and whatnot written on it before we inherited it. If there?s something specific you want to know, I could probably dig it up. Sent from my iPhone On Aug 21, 2023, at 14:48, Kidger, Daniel wrote: ? Ryan, This sounds very interesting. Do you have more details or references of how they connected together, and what any pain points were? Daniel From: gpfsug-discuss On Behalf Of Ryan Novosielski Sent: 21 August 2023 19:07 To: gpfsug main discussion list Cc: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Joining RDMA over different networks? If I understand what you?re asking correctly, we used to have a cluster that did this. GPFS was on Infininiband, some of the compute nodes were too, and the rest were on Omnipath. There were routers in between with both types. Sent from my iPhone On Aug 21, 2023, at 13:55, Kidger, Daniel > wrote: ? I know in the Lustre world that LNET routers are used to provide RDMA over heterogeneous networks. Is there an equivalent for Storage Scale? eg if an ESS uses Infiniband to connect directly to Cluster A, could that InfiniBand RDMA fabric be ?routed? to ClusterB that has RoCE connecting all its nodes together and hence the filesystem mounted? ps. The same question would apply to other usually incompatible RDMA networks like Omnipath, Slingshot, Cornelis, ? ? Daniel Daniel Kidger HPC Storage Solutions Architect, EMEA daniel.kidger at hpe.com +44 (0)7818 522266 hpe.com _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From novosirj at rutgers.edu Tue Aug 22 00:27:15 2023 From: novosirj at rutgers.edu (Ryan Novosielski) Date: Mon, 21 Aug 2023 23:27:15 +0000 Subject: [gpfsug-discuss] Joining RDMA over different networks? In-Reply-To: References: <9AE82616-B931-478A-92DB-0E484DB93B60@rutgers.edu> Message-ID: <6555765F-3FA5-4EC6-B1D5-1F2E2E023541@rutgers.edu> I still have the guide from that system, and I saved some of the routing scripts and what not. But really, it wasn?t much more complicated than Ethernet routing. The routing nodes, I guess obviously, had both Omnipath and Infiniband interfaces. Compute knifes themselves I believe used a supervisord script, if I?m remembering that name right, to try to balance out which routing nide ione would use as a gateway. There were two as it was configured when I got to it, but a larger number was possible. It seems to me that there was probably a better way to do that, but it did work. The read/write rates were not as fast as our fully Inifniband clusters, but they were fast enough. The cluster was Caliburn, which was in the top 500 for some time, so there may be some papers and whatnot written on it before we inherited it. If there?s something specific you want to know, I could probably dig it up. Sent from my iPhone On Aug 21, 2023, at 14:48, Kidger, Daniel wrote: ? Ryan, This sounds very interesting. Do you have more details or references of how they connected together, and what any pain points were? Daniel From: gpfsug-discuss On Behalf Of Ryan Novosielski Sent: 21 August 2023 19:07 To: gpfsug main discussion list Cc: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Joining RDMA over different networks? If I understand what you?re asking correctly, we used to have a cluster that did this. GPFS was on Infininiband, some of the compute nodes were too, and the rest were on Omnipath. There were routers in between with both types. Sent from my iPhone On Aug 21, 2023, at 13:55, Kidger, Daniel > wrote: ? I know in the Lustre world that LNET routers are used to provide RDMA over heterogeneous networks. Is there an equivalent for Storage Scale? eg if an ESS uses Infiniband to connect directly to Cluster A, could that InfiniBand RDMA fabric be ?routed? to ClusterB that has RoCE connecting all its nodes together and hence the filesystem mounted? ps. The same question would apply to other usually incompatible RDMA networks like Omnipath, Slingshot, Cornelis, ? ? Daniel Daniel Kidger HPC Storage Solutions Architect, EMEA daniel.kidger at hpe.com +44 (0)7818 522266 hpe.com _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonathan.buzzard at strath.ac.uk Tue Aug 22 10:28:38 2023 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Tue, 22 Aug 2023 10:28:38 +0100 Subject: [gpfsug-discuss] Joining RDMA over different networks? In-Reply-To: <6555765F-3FA5-4EC6-B1D5-1F2E2E023541@rutgers.edu> References: <9AE82616-B931-478A-92DB-0E484DB93B60@rutgers.edu> <6555765F-3FA5-4EC6-B1D5-1F2E2E023541@rutgers.edu> Message-ID: <7b5a4cd7-943f-e67e-95bc-735c0bfefe3a@strath.ac.uk> On 22/08/2023 00:27, Ryan Novosielski wrote: > I still have the guide from that system, and I saved some of the routing > scripts and what not. But really, it wasn?t much more complicated than > Ethernet routing. > > The routing nodes, I guess obviously, had both Omnipath and Infiniband > interfaces. Compute knifes themselves I believe used a supervisord > script, if I?m remembering that name right, to try to balance out which > routing nide ione would use as a gateway. There were two as it was > configured when I got to it, but a larger number was possible. > Having done it in a limited fashion previously I would recommend that you have two routing nodes and use keepalived on at least the Ethernet side with VRRP to try and maintain some redundancy. Otherwise you get in a situation where you are entirely dependent on a single node which you can't reboot without a GPFS shutdown. Cyber security makes that an untenable position these days. In our situation our DSS-G nodes where both Ethernet and Infiniband connected and we had a bunch of nodes that where using Infiniband for the data traffic and Ethernet for the management interface at 1Gbps. Everything else was on 10Gbps or better Ethernet. We therefore needed the Ethernet only connected nodes to be able to talk to the Infiniband connected nodes data interface. Due to the way routing works on Linux when the Infiniband nodes attempted to connect to the Ethernet connected only nodes it went via the 1Gbps Ethernet interface. So after a while and issues with a single gateway machine we switched to making it redundant. Basically the Ethernet only connected nodes had a custom route to reach the Infiniband network, and the DSS-G nodes where doing the forwarding and then had keepalived running VRRP to move the IP address around on the Ethernet side so there was redundancy in the gateway. The amount of traffic transiting the gateway was actually tiny because all the filesystem data was coming from the DSS-G nodes that were Infiniband connected :-) I have no idea if you can do the equivalent of VRRP on Infiniband and Omnipath. In the end the Infiniband nodes (a bunch of C6220's used to support undergraduate/MSc projects and classes) had to be upgraded to 10Gbps Ethernet as RedHat dropped support for the Intel Truescale Infiniband adapters in RHEL8. We don't let the student's run multinode jobs anyway so the loss of the Infiniband was not an issue. Though with the enforced move away from RHEL means we will get it back JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From daniel.kidger at hpe.com Tue Aug 22 10:51:07 2023 From: daniel.kidger at hpe.com (Kidger, Daniel) Date: Tue, 22 Aug 2023 09:51:07 +0000 Subject: [gpfsug-discuss] Joining RDMA over different networks? In-Reply-To: <7b5a4cd7-943f-e67e-95bc-735c0bfefe3a@strath.ac.uk> References: <9AE82616-B931-478A-92DB-0E484DB93B60@rutgers.edu> <6555765F-3FA5-4EC6-B1D5-1F2E2E023541@rutgers.edu> <7b5a4cd7-943f-e67e-95bc-735c0bfefe3a@strath.ac.uk> Message-ID: Jonathan, Thank you for the great answer! Just to be clear though - are you talking about TCP/IP mounting of the filesystem(s) rather than RDMA ? I think routing of RDMA is perhaps something only Lustre can do? Daniel Daniel Kidger HPC Storage Solutions Architect, EMEA daniel.kidger at hpe.com +44 (0)7818 522266?? hpe.com -----Original Message----- From: gpfsug-discuss On Behalf Of Jonathan Buzzard Sent: 22 August 2023 10:29 To: gpfsug-discuss at gpfsug.org Subject: Re: [gpfsug-discuss] Joining RDMA over different networks? On 22/08/2023 00:27, Ryan Novosielski wrote: > I still have the guide from that system, and I saved some of the > routing scripts and what not. But really, it wasn?t much more > complicated than Ethernet routing. > > The routing nodes, I guess obviously, had both Omnipath and Infiniband > interfaces. Compute knifes themselves I believe used a supervisord > script, if I?m remembering that name right, to try to balance out > which routing nide ione would use as a gateway. There were two as it > was configured when I got to it, but a larger number was possible. > Having done it in a limited fashion previously I would recommend that you have two routing nodes and use keepalived on at least the Ethernet side with VRRP to try and maintain some redundancy. Otherwise you get in a situation where you are entirely dependent on a single node which you can't reboot without a GPFS shutdown. Cyber security makes that an untenable position these days. In our situation our DSS-G nodes where both Ethernet and Infiniband connected and we had a bunch of nodes that where using Infiniband for the data traffic and Ethernet for the management interface at 1Gbps. Everything else was on 10Gbps or better Ethernet. We therefore needed the Ethernet only connected nodes to be able to talk to the Infiniband connected nodes data interface. Due to the way routing works on Linux when the Infiniband nodes attempted to connect to the Ethernet connected only nodes it went via the 1Gbps Ethernet interface. So after a while and issues with a single gateway machine we switched to making it redundant. Basically the Ethernet only connected nodes had a custom route to reach the Infiniband network, and the DSS-G nodes where doing the forwarding and then had keepalived running VRRP to move the IP address around on the Ethernet side so there was redundancy in the gateway. The amount of traffic transiting the gateway was actually tiny because all the filesystem data was coming from the DSS-G nodes that were Infiniband connected :-) I have no idea if you can do the equivalent of VRRP on Infiniband and Omnipath. In the end the Infiniband nodes (a bunch of C6220's used to support undergraduate/MSc projects and classes) had to be upgraded to 10Gbps Ethernet as RedHat dropped support for the Intel Truescale Infiniband adapters in RHEL8. We don't let the student's run multinode jobs anyway so the loss of the Infiniband was not an issue. Though with the enforced move away from RHEL means we will get it back JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org From jonathan.buzzard at strath.ac.uk Tue Aug 22 11:20:38 2023 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Tue, 22 Aug 2023 11:20:38 +0100 Subject: [gpfsug-discuss] Joining RDMA over different networks? In-Reply-To: References: <9AE82616-B931-478A-92DB-0E484DB93B60@rutgers.edu> <6555765F-3FA5-4EC6-B1D5-1F2E2E023541@rutgers.edu> <7b5a4cd7-943f-e67e-95bc-735c0bfefe3a@strath.ac.uk> Message-ID: On 22/08/2023 10:51, Kidger, Daniel wrote: > > Jonathan, > > Thank you for the great answer! > Just to be clear though - are you talking about TCP/IP mounting of the filesystem(s) rather than RDMA ? > Yes for a few reasons. Firstly a bunch of our Ethernet adaptors don't support RDMA. Second there a lot of ducks to be got in line and kept in line for RDMA to work and that's too much effort IMHO. Thirdly the nodes can peg the 10Gbps interface they have which is a hard QOS that we are happy with. Though if specifying today we would have 25Gbps to the compute nodes and 100 possibly 200Gbps on the DSS-G nodes. Basically we don't want one node to go nuts and monopolize the file system :-) The DSS-G nodes don't have an issue keeping up so I am not sure there is much performance benefit from RDMA to be had. That said you are supposed to be able to do IPoIB over the RDMA hardware's network, and I had presumed that the same could be said of TCP/IP over RDMA on Ethernet. > I think routing of RDMA is perhaps something only Lustre can do? > Possibly, something else is that we have our DSS-G nodes doing MLAG's over a pair of switches. I need to be able to do firmware updates on the network switches the DSS-G nodes are connected to without shutting down the cluster. I don't think you can do that with RDMA reading the switch manuals so another reason not to do it IMHO. In the 2020's the mantra is patch baby patch and everything is focused on making that quick and easy to achieve. Your expensive HPC system is for jack if hackers have taken it over because you didn't path it in a timely fashion. Also I would have a *lot* of explaining to do which I would rather not. Also in our experience storage is rarely the bottle neck and when it is aka Gromacs is creating a ~1TB temp file at 10Gbps (yeah that's a real thing we have observed on a fairly regular basis) that's an intended QOS so everyone else can get work done and I don't get a bunch of tickets from users complaining about the file system performing badly. We have seen enough simultaneous Gromacs that without the 10Gbps hard QOS the filesystem would have been brought to it's knees. We can't do the temp files locally on the node because we only spec'ed them with 1TB local disks and the Gromacs temp files regularly exceed the available local space. Also getting users to do it would be a nightmare :-) JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From anacreo at gmail.com Tue Aug 22 11:52:29 2023 From: anacreo at gmail.com (Alec) Date: Tue, 22 Aug 2023 03:52:29 -0700 Subject: [gpfsug-discuss] Joining RDMA over different networks? In-Reply-To: References: <9AE82616-B931-478A-92DB-0E484DB93B60@rutgers.edu> <6555765F-3FA5-4EC6-B1D5-1F2E2E023541@rutgers.edu> <7b5a4cd7-943f-e67e-95bc-735c0bfefe3a@strath.ac.uk> Message-ID: I wouldn't want to use GPFS if I didn't want my nodes to be able to go nuts, why bother to be frank. I had tested a configuration with a single x86 box and 4 x 100Gbe adapters talking to an ESS, that thing did amazing performance in excess of 25 GB/s over Ethernet. If you have a node that needs that performance build to it. Spend more time configuring QoS to fair share your bandwidth than baking bottlenecks into your configuration. The reasoning of holding end nodes to a smaller bandwidth than the backend doesn't make sense. You want to clear "the work" as efficiently as possible, more than keep IT from having any constraints popping up. That's what leads to just endless dithering and diluting of infrastructure until no one can figure out how to get real performance. So yeah 95% of the workloads don't care about their performance and can live on dithered and diluted infrastructure that costs a zillion times more money than what the 5% of workload that does care about bandwidth needs to spend to actually deliver. Build your infrastructure storage as high bandwidth as possible per node because compared to all the other costs it's a drop in the bucket... Don't cheap out on "cables". The real joke is the masses are running what big iron pioneered, can't even fathom how much workload that last 5% of the data center is doing, and then trying to dictate how to "engineer" platforms by not engineering. Just god help you if you have a SharePoint list with 5000+ entries, you'll likely break the internets with that high volume workload. Alec On Tue, Aug 22, 2023, 3:23 AM Jonathan Buzzard < jonathan.buzzard at strath.ac.uk> wrote: > On 22/08/2023 10:51, Kidger, Daniel wrote: > > > > Jonathan, > > > > Thank you for the great answer! > > Just to be clear though - are you talking about TCP/IP mounting of the > filesystem(s) rather than RDMA ? > > > > Yes for a few reasons. Firstly a bunch of our Ethernet adaptors don't > support RDMA. Second there a lot of ducks to be got in line and kept in > line for RDMA to work and that's too much effort IMHO. Thirdly the nodes > can peg the 10Gbps interface they have which is a hard QOS that we are > happy with. Though if specifying today we would have 25Gbps to the > compute nodes and 100 possibly 200Gbps on the DSS-G nodes. Basically we > don't want one node to go nuts and monopolize the file system :-) The > DSS-G nodes don't have an issue keeping up so I am not sure there is > much performance benefit from RDMA to be had. > > That said you are supposed to be able to do IPoIB over the RDMA > hardware's network, and I had presumed that the same could be said of > TCP/IP over RDMA on Ethernet. > > > I think routing of RDMA is perhaps something only Lustre can do? > > > > Possibly, something else is that we have our DSS-G nodes doing MLAG's > over a pair of switches. I need to be able to do firmware updates on the > network switches the DSS-G nodes are connected to without shutting down > the cluster. I don't think you can do that with RDMA reading the switch > manuals so another reason not to do it IMHO. In the 2020's the mantra is > patch baby patch and everything is focused on making that quick and easy > to achieve. Your expensive HPC system is for jack if hackers have taken > it over because you didn't path it in a timely fashion. Also I would > have a *lot* of explaining to do which I would rather not. > > Also in our experience storage is rarely the bottle neck and when it is > aka Gromacs is creating a ~1TB temp file at 10Gbps (yeah that's a real > thing we have observed on a fairly regular basis) that's an intended QOS > so everyone else can get work done and I don't get a bunch of tickets > from users complaining about the file system performing badly. We have > seen enough simultaneous Gromacs that without the 10Gbps hard QOS the > filesystem would have been brought to it's knees. > > We can't do the temp files locally on the node because we only spec'ed > them with 1TB local disks and the Gromacs temp files regularly exceed > the available local space. Also getting users to do it would be a > nightmare :-) > > > JAB. > > -- > Jonathan A. Buzzard Tel: +44141-5483420 > HPC System Administrator, ARCHIE-WeSt. > University of Strathclyde, John Anderson Building, Glasgow. G4 0NG > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Walter.Sklenka at EDV-Design.at Thu Aug 24 12:37:45 2023 From: Walter.Sklenka at EDV-Design.at (Walter Sklenka) Date: Thu, 24 Aug 2023 11:37:45 +0000 Subject: [gpfsug-discuss] FW: ESS 3500-C5 : rg has resigned permanently In-Reply-To: <7204a493499543e1a5a9fa0fa8bf41bb@Mail.EDVDesign.cloudia> References: <7204a493499543e1a5a9fa0fa8bf41bb@Mail.EDVDesign.cloudia> Message-ID: <87c3706ea7fa410295dae9a24dd38db6@Mail.EDVDesign.cloudia> Mit freundlichen Gr??en Walter Sklenka Technical Consultant EDV-Design Informationstechnologie GmbH Giefinggasse 6/1/2, A-1210 Wien Tel: +43 1 29 22 165-31 Fax: +43 1 29 22 165-90 E-Mail: sklenka at edv-design.at Internet: www.edv-design.at From: Walter Sklenka Sent: Donnerstag, 24. August 2023 12:02 To: 'gpfsug-discuss-request at gpfsug.org' Subject: FW: ESS 3500-C5 : rg has resigned permanently Hi ! Does someone eventually have experience with ESS 3500 ( no hybrid config, only NLSAS with 5 enclosures ) We have issues with a shared recoverygroup. After creating it we made a test of setting only one node active (mybe not an optimal idea) But since then the recoverygroup is down We have created a PMR but do not get any response until now. The rg has no vdisks of any filesystem [gpfsadmin at hgess02-m ~]$ ^C [gpfsadmin at hgess02-m ~]$ sudo mmvdisk rg change --rg ess3500_hgess02_n1_hs_hgess02_n2_hs --restart mmvdisk: mmvdisk: mmvdisk: Unable to reset server list for recovery group 'ess3500_hgess02_n1_hs_hgess02_n2_hs'. mmvdisk: Command failed. Examine previous error messages to determine cause. We also tried 2023-08-21_16:57:26.174+0200: [I] Command: tsrecgroupserver ess3500_hgess02_n1_hs_hgess02_n2_hs -f -l root hgess02-n2-hs.invalid 2023-08-21_16:57:26.201+0200: [I] Recovery group ess3500_hgess02_n1_hs_hgess02_n2_hs has resigned permanently 2023-08-21_16:57:26.201+0200: [E] Command: err 2: tsrecgroupserver ess3500_hgess02_n1_hs_hgess02_n2_hs -f -l root hgess02-n2-hs.invalid 2023-08-21_16:57:26.201+0200: Specified entity, such as a disk or file system, does not exist. 2023-08-21_16:57:26.207+0200: [I] Command: tsrecgroupserver ess3500_hgess02_n1_hs_hgess02_n2_hs -f -l LG001 hgess02-n2-hs.invalid. 2023-08-21_16:57:26.207+0200: [E] Command: err 212: tsrecgroupserver ess3500_hgess02_n1_hs_hgess02_n2_hs -f -l LG001 hgess02-n2-hs.invalid 2023-08-21_16:57:26.207+0200: The current file system manager failed and no new manager will be appointed. This may cause nodes mounting the file system to experience mount failures. 2023-08-21_16:57:26.213+0200: [I] Command: tsrecgroupserver ess3500_hgess02_n1_hs_hgess02_n2_hs -f -l LG002 hgess02-n2-hs.invalid 2023-08-21_16:57:26.213+0200: [E] Command: err 212: tsrecgroupserver ess3500_hgess02_n1_hs_hgess02_n2_hs -f -l LG002 hgess02-n2-hs.invalid 2023-08-21_16:57:26.213+0200: The current file system manager failed and no new manager will be appointed. This may cause nodes mounting the file system to experience mount failures. For us it is crucial to know what we can do if theis happens again ( it has no vdisks yet so it is not critical ). Do you know: is there a non documented way to "vary on", or activate a recoverygroup again? The doc : https://www.ibm.com/docs/en/ess/6.1.6_lts?topic=rgi-recovery-group-issues-shared-recovery-groups-in-ess tells to mmshutdown and mmstartup, but the RGCM does say nothing When trying to execute any vdisk command it only says "rg down", no idea how we could recover from that without deleting the rg ( I hope it will never happen, when we have vdisks on it Have a nice day Walter Mit freundlichen Gr??en Walter Sklenka Technical Consultant EDV-Design Informationstechnologie GmbH Giefinggasse 6/1/2, A-1210 Wien Tel: +43 1 29 22 165-31 Fax: +43 1 29 22 165-90 E-Mail: sklenka at edv-design.at Internet: www.edv-design.at -------------- next part -------------- An HTML attachment was scrubbed... URL: From janfrode at tanso.net Thu Aug 24 13:50:36 2023 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Thu, 24 Aug 2023 14:50:36 +0200 Subject: [gpfsug-discuss] FW: ESS 3500-C5 : rg has resigned permanently In-Reply-To: <87c3706ea7fa410295dae9a24dd38db6@Mail.EDVDesign.cloudia> References: <7204a493499543e1a5a9fa0fa8bf41bb@Mail.EDVDesign.cloudia> <87c3706ea7fa410295dae9a24dd38db6@Mail.EDVDesign.cloudia> Message-ID: It does sound like "mmvdisk rg change --restart" is the "varyon" command you're looking for.. but it's not clear why it's failing. I would start by looking at if there are any lower level issues with your cluster. Are your nodes healthy on a GPFS-level? "mmnetverify -N all" says network is OK ? "mmhealth node show -N all" not indicating any issues ? Check mmfs.log.latest ? On Thu, Aug 24, 2023 at 1:41?PM Walter Sklenka wrote: > > > > > Mit freundlichen Gr??en > *Walter Sklenka* > *Technical Consultant* > > > > EDV-Design Informationstechnologie GmbH > Giefinggasse 6/1/2, A-1210 Wien > Tel: +43 1 29 22 165-31 > Fax: +43 1 29 22 165-90 > E-Mail: sklenka at edv-design.at > Internet: www.edv-design.at > > > > *From:* Walter Sklenka > *Sent:* Donnerstag, 24. August 2023 12:02 > *To:* 'gpfsug-discuss-request at gpfsug.org' < > gpfsug-discuss-request at gpfsug.org> > *Subject:* FW: ESS 3500-C5 : rg has resigned permanently > > > > Hi ! > > Does someone eventually have experience with ESS 3500 ( no hybrid config, > only NLSAS with 5 enclosures ) > > > > We have issues with a shared recoverygroup. After creating it we made a > test of setting only one node active (mybe not an optimal idea) > > But since then the recoverygroup is down > > We have created a PMR but do not get any response until now. > > > > The rg has no vdisks of any filesystem > > [gpfsadmin at hgess02-m ~]$ ^C > [gpfsadmin at hgess02-m ~]$ sudo mmvdisk rg change --rg > ess3500_hgess02_n1_hs_hgess02_n2_hs --restart > mmvdisk: > mmvdisk: > mmvdisk: Unable to reset server list for recovery group > 'ess3500_hgess02_n1_hs_hgess02_n2_hs'. > mmvdisk: Command failed. Examine previous error messages to determine > cause. > > > > We also tried > > 2023-08-21_16:57:26.174+0200: [I] Command: tsrecgroupserver > ess3500_hgess02_n1_hs_hgess02_n2_hs -f -l root hgess02-n2-hs.invalid > 2023-08-21_16:57:26.201+0200: [I] Recovery group > ess3500_hgess02_n1_hs_hgess02_n2_hs has resigned permanently > 2023-08-21_16:57:26.201+0200: [E] Command: err 2: tsrecgroupserver > ess3500_hgess02_n1_hs_hgess02_n2_hs -f -l root hgess02-n2-hs.invalid > 2023-08-21_16:57:26.201+0200: Specified entity, such as a disk or file > system, does not exist. > 2023-08-21_16:57:26.207+0200: [I] Command: tsrecgroupserver > ess3500_hgess02_n1_hs_hgess02_n2_hs -f -l LG001 hgess02-n2-hs.invalid. > 2023-08-21_16:57:26.207+0200: [E] Command: err 212: tsrecgroupserver > ess3500_hgess02_n1_hs_hgess02_n2_hs -f -l LG001 hgess02-n2-hs.invalid > 2023-08-21_16:57:26.207+0200: The current file system manager failed and > no new manager will be appointed. This may cause nodes mounting the file > system to experience mount failures. > 2023-08-21_16:57:26.213+0200: [I] Command: tsrecgroupserver > ess3500_hgess02_n1_hs_hgess02_n2_hs -f -l LG002 hgess02-n2-hs.invalid > 2023-08-21_16:57:26.213+0200: [E] Command: err 212: tsrecgroupserver > ess3500_hgess02_n1_hs_hgess02_n2_hs -f -l LG002 hgess02-n2-hs.invalid > 2023-08-21_16:57:26.213+0200: The current file system manager failed and > no new manager will be appointed. This may cause nodes mounting the file > system to experience mount failures. > > > > > > For us it is crucial to know what we can do if theis happens again ( it > has no vdisks yet so it is not critical ). > > > > Do you know: is there a non documented way to ?vary on?, or activate a > recoverygroup again? > > The doc : > > > https://www.ibm.com/docs/en/ess/6.1.6_lts?topic=rgi-recovery-group-issues-shared-recovery-groups-in-ess > > tells to mmshutdown and mmstartup, but the RGCM does say nothing > > When trying to execute any vdisk command it only says ?rg down?, no idea > how we could recover from that without deleting the rg ( I hope it will > never happen, when we have vdisks on it > > > > > > > > Have a nice day > > Walter > > > > > > > > > > Mit freundlichen Gr??en > *Walter Sklenka* > *Technical Consultant* > > > > EDV-Design Informationstechnologie GmbH > Giefinggasse 6/1/2, A-1210 Wien > Tel: +43 1 29 22 165-31 > Fax: +43 1 29 22 165-90 > E-Mail: sklenka at edv-design.at > Internet: www.edv-design.at > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonathan.buzzard at strath.ac.uk Thu Aug 24 16:26:34 2023 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Thu, 24 Aug 2023 16:26:34 +0100 Subject: [gpfsug-discuss] Joining RDMA over different networks? In-Reply-To: References: <9AE82616-B931-478A-92DB-0E484DB93B60@rutgers.edu> <6555765F-3FA5-4EC6-B1D5-1F2E2E023541@rutgers.edu> <7b5a4cd7-943f-e67e-95bc-735c0bfefe3a@strath.ac.uk> Message-ID: On 22/08/2023 11:52, Alec wrote: > I wouldn't want to use GPFS if I didn't want my nodes to be able to go > nuts, why bother to be frank. > Because there are multiple users to the system. Do you want to be the one explaining to 50 other users that they can't use the system today because John from Chemistry is pounding the filesystem to death for his jobs? Didn't think so. There is not an infinite amount of money available and it is not possible with a reasonable amount of money to make a file system that all the nodes can max out their network connection at once. > I had tested a configuration with a single x86 box and 4 x 100Gbe > adapters talking to an ESS, that thing did amazing performance in excess > of 25 GB/s over Ethernet.? If you have a node that needs that > performance build to it.? Spend more time configuring QoS to fair share > your bandwidth than baking bottlenecks into your configuration. > There are finite budgets and compromises have to be made. The compromises we made back in 2017 when the specification was written and put out to tender have held up really well. > The reasoning of holding end nodes to a smaller bandwidth than the > backend doesn't make sense.? You want to clear "the work" as efficiently > as possible, more than keep IT from having any constraints popping up. > That's what leads to just endless dithering and diluting of > infrastructure until no one can figure out how to get real performance. > It does because a small number of jobs can hold the system to ransom for lots of other users. I have to balance things across a large number of nodes. There is only a finite amount of bandwidth to the storage and it has to be shared out fairly. I could attempt to do it with QOS on the switches or I could go sod that for a lark 10Gbps is all you get and lets keep it simple. Though like I said today it would be 25Gbps, but this was a specification written six years ago when 25Gbps Ethernet was rather exotic and too expensive. > So yeah 95% of the workloads don't care about their performance and can > live on dithered and diluted infrastructure that costs a zillion times > more money than what the 5% of workload that does care about bandwidth > needs to spend to actually deliver. > They do care about performance, they just don't need to max out the allotted performance per node. However if performance of the file system is bad the performance of the their jobs will also be bad and the total FLOPS I get from the system will plummet through the floor. Note it is more like 0.1% of jobs that peg the 10Gbps network interface for any period of time it at all. > Build your infrastructure storage as high bandwidth as possible per node > because compared to all the other costs it's a drop in the bucket... > Don't cheap out on "cables". No it's not. The Omnipath network (which by the way is reserved deliberately for MPI) cost a *LOT* of money. We are having serious conversations that with current core counts per node that an Infiniband/Omnipath network doesn't make sense any more, and that 25Gbps Ethernet will do just fine for a standard compute node. Around 85% of our jobs run on 40 cores (aka one node) or less. If you go to 128 cores a node it's more like 95% of all jobs. If you go to 192 cores it's about 98% of all jobs. The maximum job size we allow currently is 400 cores. Better to ditch the expensive interconnect and use the hundreds of thousands of dollars saved and buy more compute nodes is the current thinking. The 2% of users can just have longer runtimes but hey there will be a lot more FLOPS available in total and they rarely have just one job in the queue so it will all balance out in the wash and be positive for most users. In consultation the users are on board with this direction of travel. From our perspective if a user absolutely needs more than 192 cores on a modern system it would not be unreasonable to direct them to a national facility that can handle the really huge jobs. We are an institutional HPC facility after all. We don't claim to be able to handle a 1000 core job for example. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From anacreo at gmail.com Thu Aug 24 17:46:04 2023 From: anacreo at gmail.com (Alec) Date: Thu, 24 Aug 2023 09:46:04 -0700 Subject: [gpfsug-discuss] Joining RDMA over different networks? In-Reply-To: References: <9AE82616-B931-478A-92DB-0E484DB93B60@rutgers.edu> <6555765F-3FA5-4EC6-B1D5-1F2E2E023541@rutgers.edu> <7b5a4cd7-943f-e67e-95bc-735c0bfefe3a@strath.ac.uk> Message-ID: So why not use the built in QOS features of Spectrum Scale to adjust the performance of a particular fileset, that way you can ensure you have appropriate bandwidth? https://www.ibm.com/docs/en/storage-scale/5.1.1?topic=reference-mmqos-command What you're saying is that you don't want to build a system to meet Johns demands because you're worried about Tom not having bandwidth for his process. When in fact there is a way to guarantee a minimum quality of service for every user and still allow the system to perform exceptionally well for those that need / want it. You can also set hard caps if you want. I haven't tested it but you should also be able to set a maxbps for a node so that it won't exceed a certain limit if you really need to. Not sure if you're using LSF but you can even tie LSF queues to Spectrum Scale QOS, I didn't really try it but thought that has some great possibilities. I would say don't hurt John to keep Tom happy.. make both of them happy. In this scenario you don't have to intimately know the CPU vs IO characteristics of a job. You just need to know that reserving 1GB/s of I/O per filesystem is fair, and letting jobs consume max I/O when available is efficient. In Linux you have other mechanisms such as cgroups to refine workload distribution within the node. Another way to think about it is that in a system that is trying to get work done any unused capacity is costing someone somewhere something. At the same time if a system can't perform reliably and predictably that is a problem, but QOS is there to solve that problem. Alec On Thu, Aug 24, 2023, 8:28 AM Jonathan Buzzard < jonathan.buzzard at strath.ac.uk> wrote: > On 22/08/2023 11:52, Alec wrote: > > > I wouldn't want to use GPFS if I didn't want my nodes to be able to go > > nuts, why bother to be frank. > > > > Because there are multiple users to the system. Do you want to be the > one explaining to 50 other users that they can't use the system today > because John from Chemistry is pounding the filesystem to death for his > jobs? Didn't think so. > > There is not an infinite amount of money available and it is not > possible with a reasonable amount of money to make a file system that > all the nodes can max out their network connection at once. > > > I had tested a configuration with a single x86 box and 4 x 100Gbe > > adapters talking to an ESS, that thing did amazing performance in excess > > of 25 GB/s over Ethernet. If you have a node that needs that > > performance build to it. Spend more time configuring QoS to fair share > > your bandwidth than baking bottlenecks into your configuration. > > > > There are finite budgets and compromises have to be made. The > compromises we made back in 2017 when the specification was written and > put out to tender have held up really well. > > > The reasoning of holding end nodes to a smaller bandwidth than the > > backend doesn't make sense. You want to clear "the work" as efficiently > > as possible, more than keep IT from having any constraints popping up. > > That's what leads to just endless dithering and diluting of > > infrastructure until no one can figure out how to get real performance. > > > > It does because a small number of jobs can hold the system to ransom for > lots of other users. I have to balance things across a large number of > nodes. There is only a finite amount of bandwidth to the storage and it > has to be shared out fairly. I could attempt to do it with QOS on the > switches or I could go sod that for a lark 10Gbps is all you get and > lets keep it simple. Though like I said today it would be 25Gbps, but > this was a specification written six years ago when 25Gbps Ethernet was > rather exotic and too expensive. > > > So yeah 95% of the workloads don't care about their performance and can > > live on dithered and diluted infrastructure that costs a zillion times > > more money than what the 5% of workload that does care about bandwidth > > needs to spend to actually deliver. > > > > They do care about performance, they just don't need to max out the > allotted performance per node. However if performance of the file system > is bad the performance of the their jobs will also be bad and the total > FLOPS I get from the system will plummet through the floor. > > Note it is more like 0.1% of jobs that peg the 10Gbps network interface > for any period of time it at all. > > > Build your infrastructure storage as high bandwidth as possible per node > > because compared to all the other costs it's a drop in the bucket... > > Don't cheap out on "cables". > > No it's not. The Omnipath network (which by the way is reserved > deliberately for MPI) cost a *LOT* of money. We are having serious > conversations that with current core counts per node that an > Infiniband/Omnipath network doesn't make sense any more, and that 25Gbps > Ethernet will do just fine for a standard compute node. > > Around 85% of our jobs run on 40 cores (aka one node) or less. If you go > to 128 cores a node it's more like 95% of all jobs. If you go to 192 > cores it's about 98% of all jobs. The maximum job size we allow > currently is 400 cores. > > Better to ditch the expensive interconnect and use the hundreds of > thousands of dollars saved and buy more compute nodes is the current > thinking. The 2% of users can just have longer runtimes but hey there > will be a lot more FLOPS available in total and they rarely have just > one job in the queue so it will all balance out in the wash and be > positive for most users. > > In consultation the users are on board with this direction of travel. > From our perspective if a user absolutely needs more than 192 cores on > a modern system it would not be unreasonable to direct them to a > national facility that can handle the really huge jobs. We are an > institutional HPC facility after all. We don't claim to be able to > handle a 1000 core job for example. > > > JAB. > > -- > Jonathan A. Buzzard Tel: +44141-5483420 > HPC System Administrator, ARCHIE-WeSt. > University of Strathclyde, John Anderson Building, Glasgow. G4 0NG > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Walter.Sklenka at EDV-Design.at Thu Aug 24 19:18:30 2023 From: Walter.Sklenka at EDV-Design.at (Walter Sklenka) Date: Thu, 24 Aug 2023 18:18:30 +0000 Subject: [gpfsug-discuss] FW: ESS 3500-C5 : rg has resigned permanently In-Reply-To: References: <7204a493499543e1a5a9fa0fa8bf41bb@Mail.EDVDesign.cloudia> <87c3706ea7fa410295dae9a24dd38db6@Mail.EDVDesign.cloudia> Message-ID: Hi Jan-Frode! We did the ?switch? with mmvdisk rg change ?rg ess3500_ess_n1_hs_ess_n2_hs ?active ess-n2-hs ? Both nodes were up and we did not see any anomalies. And the rg was successfully created with the log groups Maybe the method to switch the rg (with ?active) is a bad idea? (because manuals says: https://www.ibm.com/docs/en/ess/6.1.6_lts?topic=command-mmvdisk-recoverygroup For a shared recovery group, the mmvdisk recoverygroup change --active Node command means to make the specified node the server for all four user log groups and the root log group. The specified node therefore temporarily becomes the sole active server for the entire shared recovery group, leaving the other server idle. This should only be done in unusual maintenance situations, since it is normally considered an error condition for one of the servers of a shared recovery group to be idle. If the keyword DEFAULT is used instead of a server name, it restores the normal default balance of log groups, making each of the two servers responsible for two user log groups. this was the state before we tried to restart , no log are seen, we got ?unable to reset server list? ~]$ sudo mmvdisk server list --rg ess3500_ess_n1_hs_ess_n2_hs node number server active remarks ------ -------------------------------- ------- ------- 98 ess-n1-hs yes configured 99 ess-n2-hs yes configured ~]$ sudo mmvdisk recoverygroup list --rg ess3500_ess_n1_hs_ess_n2_hs needs user recovery group node class active current or master server service vdisks remarks ----------------------------------- ---------- ------- -------------------------------- ------- ------ ------- ess3500_ess_n1_hs_ess_n2_hs ess3500_mmvdisk_ess_n1_hs_ess_n2_hs no - unknown 0 ~]$ ^C ~]$ sudo mmvdisk rg change --rg ess3500_ess_n1_hs_ess_n2_hs --restart mmvdisk: mmvdisk: mmvdisk: Unable to reset server list for recovery group 'ess3500_ess_n1_hs_ess_n2_hs'. mmvdisk: Command failed. Examine previous error messages to determine cause. Well, in the logs we did not find anything And finally we had to delete the rg , because we urgently needed new space With the new one we tested again and we did mmshutdown -startup , and also with --active flag, and all went ok. And now we have data on the rg But we are in concern that this might happen sometimes again and we might not be able to reenable the rg leading to a disaster So if you have any idea I would appreciate very much ? Best regards Walter From: gpfsug-discuss On Behalf Of Jan-Frode Myklebust Sent: Donnerstag, 24. August 2023 14:51 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] FW: ESS 3500-C5 : rg has resigned permanently It does sound like "mmvdisk rg change --restart" is the "varyon" command you're looking for.. but it's not clear why it's failing. I would start by looking at if there are any lower level issues with your cluster. Are your nodes healthy on a GPFS-level? "mmnetverify -N all" says network is OK ? "mmhealth node show -N all" not indicating any issues ? Check mmfs.log.latest ? On Thu, Aug 24, 2023 at 1:41?PM Walter Sklenka > wrote: Hi ! Does someone eventually have experience with ESS 3500 ( no hybrid config, only NLSAS with 5 enclosures ) We have issues with a shared recoverygroup. After creating it we made a test of setting only one node active (mybe not an optimal idea) But since then the recoverygroup is down We have created a PMR but do not get any response until now. The rg has no vdisks of any filesystem [gpfsadmin at hgess02-m ~]$ ^C [gpfsadmin at hgess02-m ~]$ sudo mmvdisk rg change --rg ess3500_hgess02_n1_hs_hgess02_n2_hs --restart mmvdisk: mmvdisk: mmvdisk: Unable to reset server list for recovery group 'ess3500_hgess02_n1_hs_hgess02_n2_hs'. mmvdisk: Command failed. Examine previous error messages to determine cause. We also tried 2023-08-21_16:57:26.174+0200: [I] Command: tsrecgroupserver ess3500_hgess02_n1_hs_hgess02_n2_hs -f -l root hgess02-n2-hs.invalid 2023-08-21_16:57:26.201+0200: [I] Recovery group ess3500_hgess02_n1_hs_hgess02_n2_hs has resigned permanently 2023-08-21_16:57:26.201+0200: [E] Command: err 2: tsrecgroupserver ess3500_hgess02_n1_hs_hgess02_n2_hs -f -l root hgess02-n2-hs.invalid 2023-08-21_16:57:26.201+0200: Specified entity, such as a disk or file system, does not exist. 2023-08-21_16:57:26.207+0200: [I] Command: tsrecgroupserver ess3500_hgess02_n1_hs_hgess02_n2_hs -f -l LG001 hgess02-n2-hs.invalid. 2023-08-21_16:57:26.207+0200: [E] Command: err 212: tsrecgroupserver ess3500_hgess02_n1_hs_hgess02_n2_hs -f -l LG001 hgess02-n2-hs.invalid 2023-08-21_16:57:26.207+0200: The current file system manager failed and no new manager will be appointed. This may cause nodes mounting the file system to experience mount failures. 2023-08-21_16:57:26.213+0200: [I] Command: tsrecgroupserver ess3500_hgess02_n1_hs_hgess02_n2_hs -f -l LG002 hgess02-n2-hs.invalid 2023-08-21_16:57:26.213+0200: [E] Command: err 212: tsrecgroupserver ess3500_hgess02_n1_hs_hgess02_n2_hs -f -l LG002 hgess02-n2-hs.invalid 2023-08-21_16:57:26.213+0200: The current file system manager failed and no new manager will be appointed. This may cause nodes mounting the file system to experience mount failures. For us it is crucial to know what we can do if theis happens again ( it has no vdisks yet so it is not critical ). Do you know: is there a non documented way to ?vary on?, or activate a recoverygroup again? The doc : https://www.ibm.com/docs/en/ess/6.1.6_lts?topic=rgi-recovery-group-issues-shared-recovery-groups-in-ess tells to mmshutdown and mmstartup, but the RGCM does say nothing When trying to execute any vdisk command it only says ?rg down?, no idea how we could recover from that without deleting the rg ( I hope it will never happen, when we have vdisks on it Have a nice day Walter Mit freundlichen Gr??en Walter Sklenka Technical Consultant EDV-Design Informationstechnologie GmbH Giefinggasse 6/1/2, A-1210 Wien Tel: +43 1 29 22 165-31 Fax: +43 1 29 22 165-90 E-Mail: sklenka at edv-design.at Internet: www.edv-design.at _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonathan.buzzard at strath.ac.uk Thu Aug 24 19:42:33 2023 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Thu, 24 Aug 2023 19:42:33 +0100 Subject: [gpfsug-discuss] Joining RDMA over different networks? In-Reply-To: References: <9AE82616-B931-478A-92DB-0E484DB93B60@rutgers.edu> <6555765F-3FA5-4EC6-B1D5-1F2E2E023541@rutgers.edu> <7b5a4cd7-943f-e67e-95bc-735c0bfefe3a@strath.ac.uk> Message-ID: On 24/08/2023 17:46, Alec wrote: > So why not use the built in QOS features of Spectrum Scale to adjust the > performance of a particular fileset, that way you can ensure you have > appropriate bandwidth? > Because all the users files are in the same fileset would be the simple answer. Way way to much administration overhead for that to change. There is huge amounts of KISS involved in the cluster design. Also it's only a subset of John's jobs that peg the network. Oh and at tender we didn't know we would get GPFS so we had to account for that in the system design. As a side note is that GPU nodes get 40Gbps network connections, so I am bandwidth limiting by node type. The flip side is that the high speed network (Omnipath in this case) has been reserved for MPI (or similar) traffic. Basically we observed that core counts where growing at faster rate than Infiniband/Omnipath bandwidth. We went from 12 cores a node to 40 cores, but from 40Gbps Infiniband to 100Gbps Omnipath. So rather than mixing both storage and MPI on the same fabric we moved the storage out onto 10Gbps Ethernet which for >99% of users is adequate and freed up capacity on the low latency, high speed network for the MPI traffic. I stand by that design choice 110%. Then because low latency/high speed network is only for MPI traffic we don't need to equip all nodes with Omnipath (as the tender turned out) which saved $$$$ which could be spent otherwise. A login node for example does just fine with plain Ethernet. As does a large memory (3TB RAM) node which doesn't run multinode jobs. Same for GPU nodes, and worked again in our favour when we added a whole bunch of refurb ethernet only connected standard nodes last year as we had capacity problems. Most of our jobs run on a single node so topology aware scheduling in Slurm to the rescue. Cheap addition if your storage is commodity Ethernet, would have been horrendously expensive for Omnipath. There are also other considerations. Running GPFS is already enough of a minority sport that running it over the likes of Omnipath or Infiniband or even with RDMA is just asking for problems and fails the KISS test IMHO. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From janfrode at tanso.net Thu Aug 24 20:56:16 2023 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Thu, 24 Aug 2023 21:56:16 +0200 Subject: [gpfsug-discuss] FW: ESS 3500-C5 : rg has resigned permanently In-Reply-To: References: <7204a493499543e1a5a9fa0fa8bf41bb@Mail.EDVDesign.cloudia> <87c3706ea7fa410295dae9a24dd38db6@Mail.EDVDesign.cloudia> Message-ID: mmvdisk rg change ?active is a very common operation. It should be perfectly safe. mmvdisk rg change ?restart is an option I didn?t know about, so likely not something that?s commonly used. I wouldn?t be too worried about losing the RGs. I don?t think that?s something that can happen without support being able to help getting it back online. Once I?ve had a situation similar to your RG not wanting to become active again during an upgrade (around 5 years ago), and I believe we solved it by rebooting the io-nodes ? must have been some stuck process I was unable to understand? or was it a CCR issue caused by some nodes being way back-level..? Don?t remember. -jf tor. 24. aug. 2023 kl. 20:22 skrev Walter Sklenka < Walter.Sklenka at edv-design.at>: > Hi Jan-Frode! > > We did the ?switch? with mmvdisk rg change ?rg ess3500_ess_n1_hs_ess_n2_hs > ?active ess-n2-hs ? > > Both nodes were up and we did not see any anomalies. And the rg was > successfully created with the log groups > > Maybe the method to switch the rg (with ?active) is a bad idea? (because > manuals says: > > > https://www.ibm.com/docs/en/ess/6.1.6_lts?topic=command-mmvdisk-recoverygroup > > *For a shared recovery group, the mmvdisk recoverygroup change --active * > *Node** command means to make the specified node the server for all four > user log groups and the root log group. The specified node therefore > temporarily becomes the sole active server for the entire shared recovery > group, leaving the other server idle. This should only be done in unusual > maintenance situations, since it is normally considered an error condition > for one of the servers of a shared recovery group to be idle. If the > keyword DEFAULT is used instead of a server name, it restores the normal > default balance of log groups, making each of the two servers responsible > for two user log groups.* > > > > > this was the state before we tried to restart , no log are seen, we got > ?unable to reset server list? > > ~]$ sudo mmvdisk server list --rg ess3500_ess_n1_hs_ess_n2_hs > > > > > > node > > number server active remarks > > ------ -------------------------------- ------- ------- > > 98 ess-n1-hs yes configured > > 99 ess-n2-hs yes configured > > > > > > ~]$ sudo mmvdisk recoverygroup list --rg ess3500_ess_n1_hs_ess_n2_hs > > > > > > > needs user > > recovery group node > class active current or master > server service vdisks remarks > > ----------------------------------- ---------- ------- > -------------------------------- ------- ------ ------- > > ess3500_ess_n1_hs_ess_n2_hs ess3500_mmvdisk_ess_n1_hs_ess_n2_hs no > - unknown 0 > > > > > > ~]$ ^C > > ~]$ sudo mmvdisk rg change --rg ess3500_ess_n1_hs_ess_n2_hs --restart > > mmvdisk: > > mmvdisk: > > mmvdisk: Unable to reset server list for recovery group > 'ess3500_ess_n1_hs_ess_n2_hs'. > > mmvdisk: Command failed. Examine previous error messages to determine > cause. > > > > > > Well, in the logs we did not find anything > > And finally we had to delete the rg , because we urgently needed new space > > With the new one we tested again and we did mmshutdown -startup , and > also with --active flag, and all went ok. And now we have data on the rg > > But we are in concern that this might happen sometimes again and we might > not be able to reenable the rg leading to a disaster > > > > So if you have any idea I would appreciate very much ? > > > > Best regards > > Walter > > *From:* gpfsug-discuss *On Behalf Of *Jan-Frode > Myklebust > *Sent:* Donnerstag, 24. August 2023 14:51 > *To:* gpfsug main discussion list > *Subject:* Re: [gpfsug-discuss] FW: ESS 3500-C5 : rg has resigned > permanently > > > > It does sound like "mmvdisk rg change --restart" is the "varyon" command > you're looking for.. but it's not clear why it's failing. I would start by > looking at if there are any lower level issues with your cluster. Are your > nodes healthy on a GPFS-level? "mmnetverify -N all" says network is OK ? > "mmhealth node show -N all" not indicating any issues ? Check > mmfs.log.latest ? > > > > On Thu, Aug 24, 2023 at 1:41?PM Walter Sklenka < > Walter.Sklenka at edv-design.at> wrote: > > > > Hi ! > > Does someone eventually have experience with ESS 3500 ( no hybrid config, > only NLSAS with 5 enclosures ) > > > > We have issues with a shared recoverygroup. After creating it we made a > test of setting only one node active (mybe not an optimal idea) > > But since then the recoverygroup is down > > We have created a PMR but do not get any response until now. > > > > The rg has no vdisks of any filesystem > > [gpfsadmin at hgess02-m ~]$ ^C > [gpfsadmin at hgess02-m ~]$ sudo mmvdisk rg change --rg > ess3500_hgess02_n1_hs_hgess02_n2_hs --restart > mmvdisk: > mmvdisk: > mmvdisk: Unable to reset server list for recovery group > 'ess3500_hgess02_n1_hs_hgess02_n2_hs'. > mmvdisk: Command failed. Examine previous error messages to determine > cause. > > > > We also tried > > 2023-08-21_16:57:26.174+0200: [I] Command: tsrecgroupserver > ess3500_hgess02_n1_hs_hgess02_n2_hs -f -l root hgess02-n2-hs.invalid > 2023-08-21_16:57:26.201+0200: [I] Recovery group > ess3500_hgess02_n1_hs_hgess02_n2_hs has resigned permanently > 2023-08-21_16:57:26.201+0200: [E] Command: err 2: tsrecgroupserver > ess3500_hgess02_n1_hs_hgess02_n2_hs -f -l root hgess02-n2-hs.invalid > 2023-08-21_16:57:26.201+0200: Specified entity, such as a disk or file > system, does not exist. > 2023-08-21_16:57:26.207+0200: [I] Command: tsrecgroupserver > ess3500_hgess02_n1_hs_hgess02_n2_hs -f -l LG001 hgess02-n2-hs.invalid. > 2023-08-21_16:57:26.207+0200: [E] Command: err 212: tsrecgroupserver > ess3500_hgess02_n1_hs_hgess02_n2_hs -f -l LG001 hgess02-n2-hs.invalid > 2023-08-21_16:57:26.207+0200: The current file system manager failed and > no new manager will be appointed. This may cause nodes mounting the file > system to experience mount failures. > 2023-08-21_16:57:26.213+0200: [I] Command: tsrecgroupserver > ess3500_hgess02_n1_hs_hgess02_n2_hs -f -l LG002 hgess02-n2-hs.invalid > 2023-08-21_16:57:26.213+0200: [E] Command: err 212: tsrecgroupserver > ess3500_hgess02_n1_hs_hgess02_n2_hs -f -l LG002 hgess02-n2-hs.invalid > 2023-08-21_16:57:26.213+0200: The current file system manager failed and > no new manager will be appointed. This may cause nodes mounting the file > system to experience mount failures. > > > > > > For us it is crucial to know what we can do if theis happens again ( it > has no vdisks yet so it is not critical ). > > > > Do you know: is there a non documented way to ?vary on?, or activate a > recoverygroup again? > > The doc : > > > https://www.ibm.com/docs/en/ess/6.1.6_lts?topic=rgi-recovery-group-issues-shared-recovery-groups-in-ess > > tells to mmshutdown and mmstartup, but the RGCM does say nothing > > When trying to execute any vdisk command it only says ?rg down?, no idea > how we could recover from that without deleting the rg ( I hope it will > never happen, when we have vdisks on it > > > > > > > > Have a nice day > > Walter > > > > > > > > > > Mit freundlichen Gr??en > *Walter Sklenka* > *Technical Consultant* > > > > EDV-Design Informationstechnologie GmbH > Giefinggasse 6 > /1/2, > A-1210 Wien > Tel: +43 1 29 22 165-31 > Fax: +43 1 29 22 165-90 > E-Mail: sklenka at edv-design.at > Internet: www.edv-design.at > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at gpfsug.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Walter.Sklenka at EDV-Design.at Fri Aug 25 11:33:45 2023 From: Walter.Sklenka at EDV-Design.at (Walter Sklenka) Date: Fri, 25 Aug 2023 10:33:45 +0000 Subject: [gpfsug-discuss] FW: ESS 3500-C5 : rg has resigned permanently In-Reply-To: References: <7204a493499543e1a5a9fa0fa8bf41bb@Mail.EDVDesign.cloudia> <87c3706ea7fa410295dae9a24dd38db6@Mail.EDVDesign.cloudia> Message-ID: Hi! Yes, thank you very much Finally after recreating and yet data on it, we realized we never rebooted the IO nodes !!! This is the answer, or at least a calming, feasible try to explain ? Have a nice weekend From: gpfsug-discuss On Behalf Of Jan-Frode Myklebust Sent: Donnerstag, 24. August 2023 21:56 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] FW: ESS 3500-C5 : rg has resigned permanently mmvdisk rg change ?active is a very common operation. It should be perfectly safe. mmvdisk rg change ?restart is an option I didn?t know about, so likely not something that?s commonly used. I wouldn?t be too worried about losing the RGs. I don?t think that?s something that can happen without support being able to help getting it back online. Once I?ve had a situation similar to your RG not wanting to become active again during an upgrade (around 5 years ago), and I believe we solved it by rebooting the io-nodes ? must have been some stuck process I was unable to understand? or was it a CCR issue caused by some nodes being way back-level..? Don?t remember. -jf tor. 24. aug. 2023 kl. 20:22 skrev Walter Sklenka >: Hi Jan-Frode! We did the ?switch? with mmvdisk rg change ?rg ess3500_ess_n1_hs_ess_n2_hs ?active ess-n2-hs ? Both nodes were up and we did not see any anomalies. And the rg was successfully created with the log groups Maybe the method to switch the rg (with ?active) is a bad idea? (because manuals says: https://www.ibm.com/docs/en/ess/6.1.6_lts?topic=command-mmvdisk-recoverygroup For a shared recovery group, the mmvdisk recoverygroup change --active Node command means to make the specified node the server for all four user log groups and the root log group. The specified node therefore temporarily becomes the sole active server for the entire shared recovery group, leaving the other server idle. This should only be done in unusual maintenance situations, since it is normally considered an error condition for one of the servers of a shared recovery group to be idle. If the keyword DEFAULT is used instead of a server name, it restores the normal default balance of log groups, making each of the two servers responsible for two user log groups. this was the state before we tried to restart , no log are seen, we got ?unable to reset server list? ~]$ sudo mmvdisk server list --rg ess3500_ess_n1_hs_ess_n2_hs node number server active remarks ------ -------------------------------- ------- ------- 98 ess-n1-hs yes configured 99 ess-n2-hs yes configured ~]$ sudo mmvdisk recoverygroup list --rg ess3500_ess_n1_hs_ess_n2_hs needs user recovery group node class active current or master server service vdisks remarks ----------------------------------- ---------- ------- -------------------------------- ------- ------ ------- ess3500_ess_n1_hs_ess_n2_hs ess3500_mmvdisk_ess_n1_hs_ess_n2_hs no - unknown 0 ~]$ ^C ~]$ sudo mmvdisk rg change --rg ess3500_ess_n1_hs_ess_n2_hs --restart mmvdisk: mmvdisk: mmvdisk: Unable to reset server list for recovery group 'ess3500_ess_n1_hs_ess_n2_hs'. mmvdisk: Command failed. Examine previous error messages to determine cause. Well, in the logs we did not find anything And finally we had to delete the rg , because we urgently needed new space With the new one we tested again and we did mmshutdown -startup , and also with --active flag, and all went ok. And now we have data on the rg But we are in concern that this might happen sometimes again and we might not be able to reenable the rg leading to a disaster So if you have any idea I would appreciate very much ? Best regards Walter From: gpfsug-discuss > On Behalf Of Jan-Frode Myklebust Sent: Donnerstag, 24. August 2023 14:51 To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] FW: ESS 3500-C5 : rg has resigned permanently It does sound like "mmvdisk rg change --restart" is the "varyon" command you're looking for.. but it's not clear why it's failing. I would start by looking at if there are any lower level issues with your cluster. Are your nodes healthy on a GPFS-level? "mmnetverify -N all" says network is OK ? "mmhealth node show -N all" not indicating any issues ? Check mmfs.log.latest ? On Thu, Aug 24, 2023 at 1:41?PM Walter Sklenka > wrote: Hi ! Does someone eventually have experience with ESS 3500 ( no hybrid config, only NLSAS with 5 enclosures ) We have issues with a shared recoverygroup. After creating it we made a test of setting only one node active (mybe not an optimal idea) But since then the recoverygroup is down We have created a PMR but do not get any response until now. The rg has no vdisks of any filesystem [gpfsadmin at hgess02-m ~]$ ^C [gpfsadmin at hgess02-m ~]$ sudo mmvdisk rg change --rg ess3500_hgess02_n1_hs_hgess02_n2_hs --restart mmvdisk: mmvdisk: mmvdisk: Unable to reset server list for recovery group 'ess3500_hgess02_n1_hs_hgess02_n2_hs'. mmvdisk: Command failed. Examine previous error messages to determine cause. We also tried 2023-08-21_16:57:26.174+0200: [I] Command: tsrecgroupserver ess3500_hgess02_n1_hs_hgess02_n2_hs -f -l root hgess02-n2-hs.invalid 2023-08-21_16:57:26.201+0200: [I] Recovery group ess3500_hgess02_n1_hs_hgess02_n2_hs has resigned permanently 2023-08-21_16:57:26.201+0200: [E] Command: err 2: tsrecgroupserver ess3500_hgess02_n1_hs_hgess02_n2_hs -f -l root hgess02-n2-hs.invalid 2023-08-21_16:57:26.201+0200: Specified entity, such as a disk or file system, does not exist. 2023-08-21_16:57:26.207+0200: [I] Command: tsrecgroupserver ess3500_hgess02_n1_hs_hgess02_n2_hs -f -l LG001 hgess02-n2-hs.invalid. 2023-08-21_16:57:26.207+0200: [E] Command: err 212: tsrecgroupserver ess3500_hgess02_n1_hs_hgess02_n2_hs -f -l LG001 hgess02-n2-hs.invalid 2023-08-21_16:57:26.207+0200: The current file system manager failed and no new manager will be appointed. This may cause nodes mounting the file system to experience mount failures. 2023-08-21_16:57:26.213+0200: [I] Command: tsrecgroupserver ess3500_hgess02_n1_hs_hgess02_n2_hs -f -l LG002 hgess02-n2-hs.invalid 2023-08-21_16:57:26.213+0200: [E] Command: err 212: tsrecgroupserver ess3500_hgess02_n1_hs_hgess02_n2_hs -f -l LG002 hgess02-n2-hs.invalid 2023-08-21_16:57:26.213+0200: The current file system manager failed and no new manager will be appointed. This may cause nodes mounting the file system to experience mount failures. For us it is crucial to know what we can do if theis happens again ( it has no vdisks yet so it is not critical ). Do you know: is there a non documented way to ?vary on?, or activate a recoverygroup again? The doc : https://www.ibm.com/docs/en/ess/6.1.6_lts?topic=rgi-recovery-group-issues-shared-recovery-groups-in-ess tells to mmshutdown and mmstartup, but the RGCM does say nothing When trying to execute any vdisk command it only says ?rg down?, no idea how we could recover from that without deleting the rg ( I hope it will never happen, when we have vdisks on it Have a nice day Walter Mit freundlichen Gr??en Walter Sklenka Technical Consultant EDV-Design Informationstechnologie GmbH Giefinggasse 6/1/2, A-1210 Wien Tel: +43 1 29 22 165-31 Fax: +43 1 29 22 165-90 E-Mail: sklenka at edv-design.at Internet: www.edv-design.at _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From leonardo.sala at psi.ch Fri Aug 25 15:45:27 2023 From: leonardo.sala at psi.ch (Leonardo Sala) Date: Fri, 25 Aug 2023 16:45:27 +0200 Subject: [gpfsug-discuss] How to properly debug CES / Ganesha? Message-ID: <3a2dbed3-3f88-97a2-e588-a5300f74d32a@psi.ch> Hallo, since some time we do have seemingly random issues with a particular customer accessing data over Ganesha / CES (5.1.8). What happens is that the CES server owning their IP gets a very high cpu load, and every operation on the NFS clients become sluggish. It does seem not related to throughput, and looking at the metrics [*] I do not see a correlation with e.g. increased NFS ops. I see no events in GPFS, and nothing suspicious in the ganesha and gpfs log files. What would be a good procedure to identify the misbehaving client (I suspect NFS, as it seems there is only 1 idle SMB client)? I have put now LOGLEVEL=INFO in ganesha to see if I catch anything interesting, but I would be curious on how this kind of apparently random issues could be better debugged and restricted to a client Thanks a lot! regards leo [*] for i in read write; do for j in ops queue lat req err; do mmperfmon query "ces-server|NFSIO|/export/path|NFSv41|nfs_${i}_$j" 2023-08-25-14:40:00 2023-08-25-15:05:00 -b60; done; done -- Paul Scherrer Institut Dr. Leonardo Sala Group Leader Data Analysis and Research Infrastructure Group Leader Data Curation a.i. Deputy Department Head Science IT Infrastructure and Services department Science IT Infrastructure and Services department (AWI) WHGA/036 Forschungstrasse 111 5232 Villigen PSI Switzerland Phone: +41 56 310 3369 leonardo.sala at psi.ch www.psi.ch -------------- next part -------------- An HTML attachment was scrubbed... URL: From jjdoherty at yahoo.com Fri Aug 25 20:26:40 2023 From: jjdoherty at yahoo.com (Jim Doherty) Date: Fri, 25 Aug 2023 19:26:40 +0000 (UTC) Subject: [gpfsug-discuss] How to properly debug CES / Ganesha? In-Reply-To: <3a2dbed3-3f88-97a2-e588-a5300f74d32a@psi.ch> References: <3a2dbed3-3f88-97a2-e588-a5300f74d32a@psi.ch> Message-ID: <1162230369.305181.1692991600343@mail.yahoo.com> See??https://ganltc.github.io/performance-monitoring-of-nfs-ganesha.html On Friday, August 25, 2023 at 10:48:02 AM EDT, Leonardo Sala wrote: Hallo, since some time we do have seemingly random issues with a particular customer accessing data over Ganesha / CES (5.1.8). What happens is that the CES server owning their IP gets a very high cpu load, and every operation on the NFS clients become sluggish. It does seem not related to throughput, and looking at the metrics [*] I do not see a correlation with e.g. increased NFS ops. I see no events in GPFS, and nothing suspicious in the ganesha and gpfs log files. What would be a good procedure to identify the misbehaving client (I suspect NFS, as it seems there is only 1 idle SMB client)? I have put now LOGLEVEL=INFO in ganesha to see if I catch anything interesting, but I would be curious on how this kind of apparently random issues could be better debugged and restricted to a client Thanks a lot! regards leo [*] for i in read write; do for j in ops queue lat req err; do mmperfmon query "ces-server|NFSIO|/export/path|NFSv41|nfs_${i}_$j" 2023-08-25-14:40:00 2023-08-25-15:05:00 -b60; done; done -- Paul Scherrer Institut Dr. Leonardo Sala Group Leader Data Analysis and Research Infrastructure Group Leader Data Curation a.i. Deputy Department Head Science IT Infrastructure and Services department Science IT Infrastructure and Services department (AWI) WHGA/036 Forschungstrasse 111 5232 Villigen PSI Switzerland Phone: +41 56 310 3369 leonardo.sala at psi.ch www.psi.ch _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at gpfsug.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org -------------- next part -------------- An HTML attachment was scrubbed... URL: From helge.hauglin at usit.uio.no Tue Aug 29 10:15:09 2023 From: helge.hauglin at usit.uio.no (Helge Hauglin) Date: Tue, 29 Aug 2023 11:15:09 +0200 Subject: [gpfsug-discuss] How to properly debug CES / Ganesha? In-Reply-To: <3a2dbed3-3f88-97a2-e588-a5300f74d32a@psi.ch> (Leonardo Sala's message of "Fri, 25 Aug 2023 16:45:27 +0200") References: <3a2dbed3-3f88-97a2-e588-a5300f74d32a@psi.ch> Message-ID: Hi, To identify which address sends most packages to and from a protocol node, I use a variation of this: | tcpdump -c 20000 -i 2>/dev/null | grep IP | cut -d' ' -f3 | sort | uniq -c | sort -nr | head -10 (Collect 20.000 packages, pick out sender address and port, sort and count those, make a top 10 list.) You could limit to only NFS traffic by adding "port nfs" at the end of the "tcpdump" command, but then you would not see e.g SMB clients with a lot of traffic, if there are any of those. > Hallo, > > since some time we do have seemingly random issues with a particular > customer accessing data over Ganesha / CES (5.1.8). What happens is > that the CES server owning their IP gets a very high cpu load, and > every operation on the NFS clients become sluggish. It does seem not > related to throughput, and looking at the metrics [*] I do not see a > correlation with e.g. increased NFS ops. I see no events in GPFS, and > nothing suspicious in the ganesha and gpfs log files. > > What would be a good procedure to identify the misbehaving client (I > suspect NFS, as it seems there is only 1 idle SMB client)? I have put > now LOGLEVEL=INFO in ganesha to see if I catch anything interesting, > but I would be curious on how this kind of apparently random issues > could be better debugged and restricted to a client > > Thanks a lot! > > regards > > leo > > [*] > > for i in read write; do for j in ops queue lat req err; do mmperfmon > query "ces-server|NFSIO|/export/path|NFSv41|nfs_${i}_$j" > 2023-08-25-14:40:00 2023-08-25-15:05:00 -b60; done; done -- Regards, Helge Hauglin ---------------------------------------------------------------- Mr. Helge Hauglin, Senior Engineer System administrator Center for Information Technology, University of Oslo, Norway From helge.hauglin at usit.uio.no Tue Aug 29 10:15:09 2023 From: helge.hauglin at usit.uio.no (Helge Hauglin) Date: Tue, 29 Aug 2023 11:15:09 +0200 Subject: [gpfsug-discuss] How to properly debug CES / Ganesha? In-Reply-To: <3a2dbed3-3f88-97a2-e588-a5300f74d32a@psi.ch> (Leonardo Sala's message of "Fri, 25 Aug 2023 16:45:27 +0200") References: <3a2dbed3-3f88-97a2-e588-a5300f74d32a@psi.ch> Message-ID: Hi, To identify which address sends most packages to and from a protocol node, I use a variation of this: | tcpdump -c 20000 -i 2>/dev/null | grep IP | cut -d' ' -f3 | sort | uniq -c | sort -nr | head -10 (Collect 20.000 packages, pick out sender address and port, sort and count those, make a top 10 list.) You could limit to only NFS traffic by adding "port nfs" at the end of the "tcpdump" command, but then you would not see e.g SMB clients with a lot of traffic, if there are any of those. > Hallo, > > since some time we do have seemingly random issues with a particular > customer accessing data over Ganesha / CES (5.1.8). What happens is > that the CES server owning their IP gets a very high cpu load, and > every operation on the NFS clients become sluggish. It does seem not > related to throughput, and looking at the metrics [*] I do not see a > correlation with e.g. increased NFS ops. I see no events in GPFS, and > nothing suspicious in the ganesha and gpfs log files. > > What would be a good procedure to identify the misbehaving client (I > suspect NFS, as it seems there is only 1 idle SMB client)? I have put > now LOGLEVEL=INFO in ganesha to see if I catch anything interesting, > but I would be curious on how this kind of apparently random issues > could be better debugged and restricted to a client > > Thanks a lot! > > regards > > leo > > [*] > > for i in read write; do for j in ops queue lat req err; do mmperfmon > query "ces-server|NFSIO|/export/path|NFSv41|nfs_${i}_$j" > 2023-08-25-14:40:00 2023-08-25-15:05:00 -b60; done; done -- Regards, Helge Hauglin ---------------------------------------------------------------- Mr. Helge Hauglin, Senior Engineer System administrator Center for Information Technology, University of Oslo, Norway