From kenneth.waegeman at ugent.be Mon Sep 3 16:06:28 2018 From: kenneth.waegeman at ugent.be (Kenneth Waegeman) Date: Mon, 3 Sep 2018 17:06:28 +0200 Subject: [gpfsug-discuss] system.log pool on client nodes for HAWC In-Reply-To: References:

Message-ID: <85d5591a-cf74-f55e-1802-e3e14983abbf@ugent.be> Thank you Vasily and Simon for the clarification! I was looking further into it, and I got stuck with more questions :) - In https://www.ibm.com/support/knowledgecenter/STXKQY_5.0.0/com.ibm.spectrum.scale.v5r00.doc/bl1adv_hawc_tuning.htm I read: ??? HAWC does not change the following behaviors: ??????? write behavior of small files when the data is placed in the inode itself ??????? write behavior of directory blocks or other metadata I wondered why? Is the metadata not logged in the (same) recovery logs? (It seemed by reading https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.ins.doc/bl1ins_logfile.htm it does ) - Would there be a way to estimate how much of the write requests on a running cluster would benefit from enabling HAWC ? Thanks again! Kenneth On 31/08/18 19:49, Vasily Tarasov wrote: > That is correct. The blocks of each recovery log are striped across > the devices in the system.log pool (if it is defined). As a result, > even when all clients have a local device in the system.log pool, many > writes to the recovery log will go to remote devices. For a client > that lacks a local device in the system.log pool, log writes will > always be remote. > Notice, that typically in such a setup you would enable log > replication for HA. Otherwise, if a single client fails (and its > recover log is lost) the whole cluster fails as there is no log? to > recover FS to consistent state. Therefore, at least one remote write > is essential. > HTH, > -- > Vasily Tarasov, > Research Staff Member, > Storage Systems Research, > IBM Research - Almaden > > ----- Original message ----- > From: Kenneth Waegeman > Sent by: gpfsug-discuss-bounces at spectrumscale.org > To: gpfsug main discussion list > Cc: > Subject: [gpfsug-discuss] system.log pool on client nodes for HAWC > Date: Tue, Aug 28, 2018 5:31 AM > Hi all, > > I was looking into HAWC , using the 'distributed fast storage in > client > nodes' method ( > https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.0/com.ibm.spectrum.scale.v5r00.doc/bl1adv_hawc_using.htm > > ) > > This is achieved by putting? a local device on the clients in the > system.log pool. Reading another article > (https://www.ibm.com/support/knowledgecenter/STXKQY_5.0.0/com.ibm.spectrum.scale.v5r00.doc/bl1adv_syslogpool.htm > > ) this would now be used for ALL File system recovery logs. > > Does this mean that if you have a (small) subset of clients with fast > local devices added in the system.log pool, all other clients will use > these too instead of the central system pool? > > Thank you! > > Kenneth > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From oehmes at gmail.com Mon Sep 3 16:32:11 2018 From: oehmes at gmail.com (Sven Oehme) Date: Mon, 3 Sep 2018 08:32:11 -0700 Subject: [gpfsug-discuss] system.log pool on client nodes for HAWC In-Reply-To: <85d5591a-cf74-f55e-1802-e3e14983abbf@ugent.be> References:

<85d5591a-cf74-f55e-1802-e3e14983abbf@ugent.be> Message-ID: Hi Ken, what the documents is saying (or try to) is that the behavior of data in inode or metadata operations are not changed if HAWC is enabled, means if the data fits into the inode it will be placed there directly instead of writing the data i/o into a data recovery log record (which is what HAWC uses) and then later destage it where ever the data blocks of a given file eventually will be written. that also means if all your application does is creating small files that fit into the inode, HAWC will not be able to improve performance. its unfortunate not so simple to say if HAWC will help or not, but here are a couple of thoughts where HAWC will not help and help : on the where it won't help : 1. if you have storage device which has very large or even better are log structured write cache. 2. if majority of your files are very small 3. if your files will almost always be accesses sequentially 4. your storage is primarily flash based where it most likely will help : 1. your majority of storage is direct attached HDD (e.g. FPO) with a small SSD pool for metadata and HAWC 2. your ratio of clients to storage devices is very high (think hundreds of clients and only 1 storage array) 3. your workload is primarily virtual machines or databases as always there are lots of exceptions and corner cases, but is the best list i could come up with. on how to find out if HAWC could help, there are 2 ways of doing this first, look at mmfsadm dump iocounters , you see the average size of i/os and you could check if there is a lot of small write operations done. a more involved but more accurate way would be to take a trace with trace level trace=io , that will generate a very lightweight trace of only the most relevant io layers of GPFS, you could then post process the operations performance, but the data is not the simplest to understand for somebody with low knowledge of filesystems, but if you stare at it for a while it might make some sense to you. Sven On Mon, Sep 3, 2018 at 4:06 PM Kenneth Waegeman wrote: > Thank you Vasily and Simon for the clarification! > > I was looking further into it, and I got stuck with more questions :) > > > - In > https://www.ibm.com/support/knowledgecenter/STXKQY_5.0.0/com.ibm.spectrum.scale.v5r00.doc/bl1adv_hawc_tuning.htm > I read: > HAWC does not change the following behaviors: > write behavior of small files when the data is placed in the inode > itself > write behavior of directory blocks or other metadata > I wondered why? Is the metadata not logged in the (same) recovery logs? > (It seemed by reading > https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.ins.doc/bl1ins_logfile.htm > it does ) > > > - Would there be a way to estimate how much of the write requests on a > running cluster would benefit from enabling HAWC ? > > > Thanks again! > > > Kenneth > > > On 31/08/18 19:49, Vasily Tarasov wrote: > > That is correct. The blocks of each recovery log are striped across the > devices in the system.log pool (if it is defined). As a result, even when > all clients have a local device in the system.log pool, many writes to the > recovery log will go to remote devices. For a client that lacks a local > device in the system.log pool, log writes will always be remote. > > Notice, that typically in such a setup you would enable log replication > for HA. Otherwise, if a single client fails (and its recover log is lost) > the whole cluster fails as there is no log to recover FS to consistent > state. Therefore, at least one remote write is essential. > > HTH, > -- > Vasily Tarasov, > Research Staff Member, > Storage Systems Research, > IBM Research - Almaden > > > > ----- Original message ----- > From: Kenneth Waegeman > > Sent by: gpfsug-discuss-bounces at spectrumscale.org > To: gpfsug main discussion list > > Cc: > Subject: [gpfsug-discuss] system.log pool on client nodes for HAWC > Date: Tue, Aug 28, 2018 5:31 AM > > Hi all, > > I was looking into HAWC , using the 'distributed fast storage in client > nodes' method ( > > https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.0/com.ibm.spectrum.scale.v5r00.doc/bl1adv_hawc_using.htm > > ) > > This is achieved by putting a local device on the clients in the > system.log pool. Reading another article > ( > https://www.ibm.com/support/knowledgecenter/STXKQY_5.0.0/com.ibm.spectrum.scale.v5r00.doc/bl1adv_syslogpool.htm > > ) this would now be used for ALL File system recovery logs. > > Does this mean that if you have a (small) subset of clients with fast > local devices added in the system.log pool, all other clients will use > these too instead of the central system pool? > > Thank you! > > Kenneth > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.orghttp://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From r.sobey at imperial.ac.uk Tue Sep 4 09:44:59 2018 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Tue, 4 Sep 2018 08:44:59 +0000 Subject: [gpfsug-discuss] CES file authentication - bind account deleted? Message-ID: Hi all, I don't like using long subject lines as a rule so it probably doesn't make sense, but consider: FILE access configuration : AD PARAMETERS VALUES ------------------------------------------------- ENABLE_NFS_KERBEROS true SERVERS domaincontroller.ic.ac.uk USER_NAME joebloggs at IC.AC.UK NETBIOS_NAME store IDMAP_ROLE master IDMAP_RANGE 10000000-299999999 IDMAP_RANGE_SIZE 1000000 UNIXMAP_DOMAINS IC(500 - 2000000) LDAPMAP_DOMAINS none If "joebloggs" was to leave the organization and that account deleted from Active Directory, what is the impact on file authentication in CES? Thanks Richard -------------- next part -------------- An HTML attachment was scrubbed... URL: From z.han at imperial.ac.uk Tue Sep 4 14:03:52 2018 From: z.han at imperial.ac.uk (z.han at imperial.ac.uk) Date: Tue, 4 Sep 2018 14:03:52 +0100 (BST) Subject: [gpfsug-discuss] CES file authentication - bind account deleted? In-Reply-To: References: Message-ID: Files owned by "joebloggs" will be owned by the user's uid and gid. Assuming those ids aren't recycled, then there shouldn't be any impact on file authentication, right? It's a different matter if the ids are recycled by AD. Kind regards, Zong-Pei -------------------------------------------- Zong-Pei Han (BSc MSc PhD) UK MED-BIO Data Systems Administrator Room 126, Sir Alexander Fleming Building South Kensington Campus Imperial College London, SW7 2AZ -------------------------------------------- On Tue, 4 Sep 2018, Sobey, Richard A wrote: > Date: Tue, 4 Sep 2018 08:44:59 +0000 > From: "Sobey, Richard A" > Reply-To: gpfsug main discussion list > To: "'gpfsug-discuss at spectrumscale.org'" > Subject: [gpfsug-discuss] CES file authentication - bind account deleted? > > > Hi all, > > ? > > I don?t like using long subject lines as a rule so it probably doesn?t make sense, but consider: > > ? > > FILE access configuration : AD > > PARAMETERS?????????????? VALUES > > ------------------------------------------------- > > ENABLE_NFS_KERBEROS????? true > > SERVERS????????????????? domaincontroller.ic.ac.uk > > USER_NAME??????????????? joebloggs at IC.AC.UK > > NETBIOS_NAME???????????? store > > IDMAP_ROLE?????????????? master > > IDMAP_RANGE????????????? 10000000-299999999 > > IDMAP_RANGE_SIZE???????? 1000000 > > UNIXMAP_DOMAINS????????? IC(500 - 2000000) > > LDAPMAP_DOMAINS????????? none > > ? > > If ?joebloggs? was to leave the organization and that account deleted from Active Directory, what is the impact on file > authentication in CES? > > ? > > Thanks > > Richard > > > From abeattie at au1.ibm.com Tue Sep 4 14:17:56 2018 From: abeattie at au1.ibm.com (Andrew Beattie) Date: Tue, 4 Sep 2018 13:17:56 +0000 Subject: [gpfsug-discuss] CES file authentication - bind account deleted? In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From rohwedder at de.ibm.com Tue Sep 4 14:40:33 2018 From: rohwedder at de.ibm.com (Markus Rohwedder) Date: Tue, 4 Sep 2018 15:40:33 +0200 Subject: [gpfsug-discuss] CES file authentication - bind account deleted? In-Reply-To: References:

Message-ID: Hello. the user name should not matter for operations beyon domain join. mmuserauth man page: --user-name userName .... In case of --type ad with --data-access-method file, the specified username is used to join the cluster to AD domain. It results in creating a machine account for the cluster based on the --netbios-name specified in the command. After successful configuration, the cluster connects with its machine account, and not the user used during the domain join. So the specified username after domain join has no role to play in communication with the AD domain controller and can be even deleted from the AD server. The cluster can still keep using AD for authentication via the machine account created. Mit freundlichen Gr??en / Kind regards Dr. Markus Rohwedder Spectrum Scale GUI Development Phone: +49 7034 6430190 IBM Deutschland Research & Development E-Mail: rohwedder at de.ibm.com Am Weiher 24 65451 Kelsterbach Germany From: "Andrew Beattie" To: gpfsug-discuss at spectrumscale.org Cc: gpfsug-discuss at spectrumscale.org Date: 04.09.2018 15:18 Subject: Re: [gpfsug-discuss] CES file authentication - bind account deleted? Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Richard, If you are setting up Protocol authentication against the active directory, would you not choose to use a service account that isn't going to get deleted? If you choose to use an user account of a Sys Admin who has Domain admin privileges and they leave the company and their account is deleted, you would almost certainly have issues with the Scale cluster trying to validate users permissions and having scale get an error from AD when the credentials that it uses are no longer valid. Andrew Beattie Software Defined Storage - IT Specialist Phone: 614-2133-7927 E-mail: abeattie at au1.ibm.com ----- Original message ----- From: "Sobey, Richard A" Sent by: gpfsug-discuss-bounces at spectrumscale.org To: "'gpfsug-discuss at spectrumscale.org'" Cc: Subject: [gpfsug-discuss] CES file authentication - bind account deleted? Date: Tue, Sep 4, 2018 8:45 AM Hi all, I don?t like using long subject lines as a rule so it probably doesn?t make sense, but consider: FILE access configuration : AD PARAMETERS VALUES ------------------------------------------------- ENABLE_NFS_KERBEROS true SERVERS domaincontroller.ic.ac.uk USER_NAME joebloggs at IC.AC.UK NETBIOS_NAME store IDMAP_ROLE master IDMAP_RANGE 10000000-299999999 IDMAP_RANGE_SIZE 1000000 UNIXMAP_DOMAINS IC(500 - 2000000) LDAPMAP_DOMAINS none If ?joebloggs? was to leave the organization and that account deleted from Active Directory, what is the impact on file authentication in CES? Thanks Richard _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ecblank.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 1A629793.gif Type: image/gif Size: 4659 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From r.sobey at imperial.ac.uk Tue Sep 4 14:44:28 2018 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Tue, 4 Sep 2018 13:44:28 +0000 Subject: [gpfsug-discuss] CES file authentication - bind account deleted? In-Reply-To: References:

Message-ID: Ah, thanks Markus, that?s what I was looking for. Andrew yes, the service account has been created now, I am more interested in the ?what if? we didn?t change things. I suppose this is the result of ~4 years of technical debt on our part! Thanks, Richard From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Markus Rohwedder Sent: 04 September 2018 14:41 To: gpfsug main discussion list Cc: gpfsug-discuss-bounces at spectrumscale.org Subject: Re: [gpfsug-discuss] CES file authentication - bind account deleted? Hello. the user name should not matter for operations beyon domain join. mmuserauth man page: --user-name userName .... In case of --type ad with --data-access-method file, the specified username is used to join the cluster to AD domain. It results in creating a machine account for the cluster based on the --netbios-name specified in the command. After successful configuration, the cluster connects with its machine account, and not the user used during the domain join. So the specified username after domain join has no role to play in communication with the AD domain controller and can be even deleted from the AD server. The cluster can still keep using AD for authentication via the machine account created. Mit freundlichen Gr??en / Kind regards Dr. Markus Rohwedder Spectrum Scale GUI Development ________________________________ Phone: +49 7034 6430190 IBM Deutschland Research & Development [cid:image002.png at 01D4445D.C716BB30] E-Mail: rohwedder at de.ibm.com Am Weiher 24 65451 Kelsterbach Germany ________________________________ [Inactive hide details for "Andrew Beattie" ---04.09.2018 15:18:43---Hi Richard,]"Andrew Beattie" ---04.09.2018 15:18:43---Hi Richard, From: "Andrew Beattie" > To: gpfsug-discuss at spectrumscale.org Cc: gpfsug-discuss at spectrumscale.org Date: 04.09.2018 15:18 Subject: Re: [gpfsug-discuss] CES file authentication - bind account deleted? Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi Richard, If you are setting up Protocol authentication against the active directory, would you not choose to use a service account that isn't going to get deleted? If you choose to use an user account of a Sys Admin who has Domain admin privileges and they leave the company and their account is deleted, you would almost certainly have issues with the Scale cluster trying to validate users permissions and having scale get an error from AD when the credentials that it uses are no longer valid. Andrew Beattie Software Defined Storage - IT Specialist Phone: 614-2133-7927 E-mail: abeattie at au1.ibm.com ----- Original message ----- From: "Sobey, Richard A" > Sent by: gpfsug-discuss-bounces at spectrumscale.org To: "'gpfsug-discuss at spectrumscale.org'" > Cc: Subject: [gpfsug-discuss] CES file authentication - bind account deleted? Date: Tue, Sep 4, 2018 8:45 AM Hi all, I don?t like using long subject lines as a rule so it probably doesn?t make sense, but consider: FILE access configuration : AD PARAMETERS VALUES ------------------------------------------------- ENABLE_NFS_KERBEROS true SERVERS domaincontroller.ic.ac.uk USER_NAME joebloggs at IC.AC.UK NETBIOS_NAME store IDMAP_ROLE master IDMAP_RANGE 10000000-299999999 IDMAP_RANGE_SIZE 1000000 UNIXMAP_DOMAINS IC(500 - 2000000) LDAPMAP_DOMAINS none If ?joebloggs? was to leave the organization and that account deleted from Active Directory, what is the impact on file authentication in CES? Thanks Richard _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 166 bytes Desc: image001.png URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.png Type: image/png Size: 4659 bytes Desc: image002.png URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image003.gif Type: image/gif Size: 105 bytes Desc: image003.gif URL: From vtarasov at us.ibm.com Tue Sep 4 16:57:37 2018 From: vtarasov at us.ibm.com (Vasily Tarasov) Date: Tue, 4 Sep 2018 15:57:37 +0000 Subject: [gpfsug-discuss] system.log pool on client nodes for HAWC In-Reply-To: References: ,

<85d5591a-cf74-f55e-1802-e3e14983abbf@ugent.be> Message-ID: An HTML attachment was scrubbed... URL: From stijn.deweirdt at ugent.be Tue Sep 4 20:23:35 2018 From: stijn.deweirdt at ugent.be (Stijn De Weirdt) Date: Tue, 4 Sep 2018 21:23:35 +0200 Subject: [gpfsug-discuss] system.log pool on client nodes for HAWC In-Reply-To: References:

<85d5591a-cf74-f55e-1802-e3e14983abbf@ugent.be> Message-ID: <29b9209e-d17b-f109-983a-c14c6e0966ef@ugent.be> hi vasily, sven, and is there any advantage in moving the system.log pool to faster storage (like nvdimm) or increasing its default size when HAWC is not used (ie write-cache-threshold kept to 0). (i remember the (very creative) logtip placement on the gss boxes ;) thanks a lot for the detailed answer stijn On 09/04/2018 05:57 PM, Vasily Tarasov wrote: > Let me add just one more item to Sven's detailed reply: HAWC is especially > helpful to decrease the latencies of small synchronous I/Os that come in > *bursts*. If your workload contains a sustained high rate of writes, the > recovery log will get full very quickly, and HAWC won't help much (or can even > decrease performance). Making the recovery log larger allows to adsorb longer > I/O bursts. The specific amount of improvements depends on the workload (how > long/high are bursts, e.g.) and hardware. > Best, > Vasily > -- > Vasily Tarasov, > Research Staff Member, > Storage Systems Research, > IBM Research - Almaden > > ----- Original message ----- > From: Sven Oehme > To: gpfsug main discussion list > Cc: Vasily Tarasov > Subject: Re: [gpfsug-discuss] system.log pool on client nodes for HAWC > Date: Mon, Sep 3, 2018 8:32 AM > Hi Ken, > what the documents is saying (or try to) is that the behavior of data in > inode or metadata operations are not changed if HAWC is enabled, means if > the data fits into the inode it will be placed there directly instead of > writing the data i/o into a data recovery log record (which is what HAWC > uses) and then later destage it where ever the data blocks of a given file > eventually will be written. that also means if all your application does is > creating small files that fit into the inode, HAWC will not be able to > improve performance. > its unfortunate not so simple to say if HAWC will help or not, but here are > a couple of thoughts where HAWC will not help and help : > on the where it won't help : > 1. if you have storage device which has very large or even better are log > structured write cache. > 2. if majority of your files are very small > 3. if your files will almost always be accesses sequentially > 4. your storage is primarily flash based > where it most likely will help : > 1. your majority of storage is direct attached HDD (e.g. FPO) with a small > SSD pool for metadata and HAWC > 2. your ratio of clients to storage devices is very high (think hundreds of > clients and only 1 storage array) > 3. your workload is primarily virtual machines or databases > as always there are lots of exceptions and corner cases, but is the best > list i could come up with. > on how to find out if HAWC could help, there are 2 ways of doing this > first, look at mmfsadm dump iocounters , you see the average size of i/os > and you could check if there is a lot of small write operations done. > a more involved but more accurate way would be to take a trace with trace > level trace=io , that will generate a very lightweight trace of only the > most relevant io layers of GPFS, you could then post process the operations > performance, but the data is not the simplest to understand for somebody > with low knowledge of filesystems, but if you stare at it for a while it > might make some sense to you. > Sven > On Mon, Sep 3, 2018 at 4:06 PM Kenneth Waegeman > wrote: > > Thank you Vasily and Simon for the clarification! > > I was looking further into it, and I got stuck with more questions :) > > > - In > https://www.ibm.com/support/knowledgecenter/STXKQY_5.0.0/com.ibm.spectrum.scale.v5r00.doc/bl1adv_hawc_tuning.htm > I read: > HAWC does not change the following behaviors: > write behavior of small files when the data is placed in the > inode itself > write behavior of directory blocks or other metadata > > I wondered why? Is the metadata not logged in the (same) recovery logs? > (It seemed by reading > https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.ins.doc/bl1ins_logfile.htm > it does ) > > > - Would there be a way to estimate how much of the write requests on a > running cluster would benefit from enabling HAWC ? > > > Thanks again! > > > Kenneth > On 31/08/18 19:49, Vasily Tarasov wrote: >> That is correct. The blocks of each recovery log are striped across >> the devices in the system.log pool (if it is defined). As a result, >> even when all clients have a local device in the system.log pool, many >> writes to the recovery log will go to remote devices. For a client >> that lacks a local device in the system.log pool, log writes will >> always be remote. >> Notice, that typically in such a setup you would enable log >> replication for HA. Otherwise, if a single client fails (and its >> recover log is lost) the whole cluster fails as there is no log to >> recover FS to consistent state. Therefore, at least one remote write >> is essential. >> HTH, >> -- >> Vasily Tarasov, >> Research Staff Member, >> Storage Systems Research, >> IBM Research - Almaden >> >> ----- Original message ----- >> From: Kenneth Waegeman >> >> Sent by: gpfsug-discuss-bounces at spectrumscale.org >> >> To: gpfsug main discussion list >> >> Cc: >> Subject: [gpfsug-discuss] system.log pool on client nodes for HAWC >> Date: Tue, Aug 28, 2018 5:31 AM >> Hi all, >> >> I was looking into HAWC , using the 'distributed fast storage in >> client >> nodes' method ( >> https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.0/com.ibm.spectrum.scale.v5r00.doc/bl1adv_hawc_using.htm >> >> ) >> >> This is achieved by putting a local device on the clients in the >> system.log pool. Reading another article >> (https://www.ibm.com/support/knowledgecenter/STXKQY_5.0.0/com.ibm.spectrum.scale.v5r00.doc/bl1adv_syslogpool.htm >> >> ) this would now be used for ALL File system recovery logs. >> >> Does this mean that if you have a (small) subset of clients with fast >> local devices added in the system.log pool, all other clients will use >> these too instead of the central system pool? >> >> Thank you! >> >> Kenneth >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > From anobre at br.ibm.com Tue Sep 4 20:40:06 2018 From: anobre at br.ibm.com (Anderson Ferreira Nobre) Date: Tue, 4 Sep 2018 19:40:06 +0000 Subject: [gpfsug-discuss] Top files on GPFS filesystem In-Reply-To: Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image._2_DBC5F19CDBC5ECBC00214F54C12582E8.jpg Type: image/jpeg Size: 5698 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image._1_DBCF2504DBCF20E800214F54C12582E8.gif Type: image/gif Size: 360 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image._2_DBC5F19CDBC5ECBC00214F54C12582E8.jpg Type: image/jpeg Size: 5698 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image._1_DBCF2504DBCF20E800214F54C12582E8.gif Type: image/gif Size: 360 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image._2_DBC5F19CDBC5ECBC00214F54C12582E8.jpg Type: image/jpeg Size: 5698 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image._1_DBCF2504DBCF20E800214F54C12582E8.gif Type: image/gif Size: 360 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image.1536071547526146.jpg Type: image/jpeg Size: 5698 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image.1536071547526147.gif Type: image/gif Size: 360 bytes Desc: not available URL: From vtarasov at us.ibm.com Wed Sep 5 01:19:15 2018 From: vtarasov at us.ibm.com (Vasily Tarasov) Date: Wed, 5 Sep 2018 00:19:15 +0000 Subject: [gpfsug-discuss] system.log pool on client nodes for HAWC In-Reply-To: <29b9209e-d17b-f109-983a-c14c6e0966ef@ugent.be> References: <29b9209e-d17b-f109-983a-c14c6e0966ef@ugent.be>,

<85d5591a-cf74-f55e-1802-e3e14983abbf@ugent.be> Message-ID: An HTML attachment was scrubbed... URL: From sven.siebler at urz.uni-heidelberg.de Wed Sep 5 08:13:47 2018 From: sven.siebler at urz.uni-heidelberg.de (Sven Siebler) Date: Wed, 5 Sep 2018 09:13:47 +0200 Subject: [gpfsug-discuss] Getting inode information with REST API V2 Message-ID: <0975dcd6-a665-31f8-6070-a73b414f3d25@urz.uni-heidelberg.de> Hi all, i just started to use the REST API for our monitoring and my question is concerning about how can i get information about allocated inodes with REST API V2 ? Up to now i use "mmlsfileset" directly, which gives me information on maximum and allocated inodes (mmdf for total/free/allocated inodes of the filesystem) If i use the REST API V2 with "filesystems//filesets?fields=:all:", i get all information except the allocated inodes. On the documentation (https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.0/com.ibm.spectrum.scale.v5r00.doc/bl1adm_apiv2getfilesystemfilesets.htm) i found: > "inodeSpace": "Inodes" > The number of inodes that are allocated for use by the fileset. but for me the inodeSpace looks more like the ID of the inodespace, instead of the number of allocated inodes. In the documentation example the API can give output like this: "filesetName" : "root", ?????? "filesystemName" : "gpfs0", ?????? "usage" : { ? ? ?? ??? "allocatedInodes" : 100000, ????? ? ?? "inodeSpaceFreeInodes" : 95962, ?????????? "inodeSpaceUsedInodes" : 4038, ????? ? ?? "usedBytes" : 0, ?????? ? ? "usedInodes" : 4038 } but i could not retrieve such usage-fields in my queries. The only way for me to get inode information with REST is the usage of V1: https://REST_API_host:port/scalemgmt/v1/filesets?filesystemName=FileSystemName which gives exact the information of "mmlsfileset". But because V1 is deprecated i want to use V2 for rewriting our tools... Thanks, Sven -- Sven Siebler Servicebereich Future IT - Research & Education (FIRE) Tel. +49 6221 54 20032 sven.siebler at urz.uni-heidelberg.de Universit?t Heidelberg Universit?tsrechenzentrum (URZ) Im Neuenheimer Feld 293, D-69120 Heidelberg http://www.urz.uni-heidelberg.de -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 5437 bytes Desc: S/MIME Cryptographic Signature URL: From andreas.koeninger at de.ibm.com Wed Sep 5 10:13:19 2018 From: andreas.koeninger at de.ibm.com (Andreas Koeninger) Date: Wed, 5 Sep 2018 09:13:19 +0000 Subject: [gpfsug-discuss] Getting inode information with REST API V2 In-Reply-To: <0975dcd6-a665-31f8-6070-a73b414f3d25@urz.uni-heidelberg.de> References: <0975dcd6-a665-31f8-6070-a73b414f3d25@urz.uni-heidelberg.de> Message-ID: An HTML attachment was scrubbed... URL: From sven.siebler at urz.uni-heidelberg.de Wed Sep 5 12:44:32 2018 From: sven.siebler at urz.uni-heidelberg.de (Sven Siebler) Date: Wed, 5 Sep 2018 13:44:32 +0200 Subject: [gpfsug-discuss] Getting inode information with REST API V2 In-Reply-To: References: <0975dcd6-a665-31f8-6070-a73b414f3d25@urz.uni-heidelberg.de> Message-ID: Hi Andreas, i've forgotten to mention that we are currently using ISS v4.2.1, not v5.0.0. Invastigating the command i got the following: # /usr/lpp/mmfs/gui/cli/runtask FILESETS --debug debug: locale=en_US debug: Running 'mmlsfileset 'lsdf02' -di -Y ' on node localhost debug: Raising event: inode_normal debug: Running 'mmsysmonc event 'filesystem' 'inode_normal' 'lsdf02/sd17e005' 'lsdf02/sd17e005,' ' on node localhost debug: Raising event: inode_normal debug: Running 'mmsysmonc event 'filesystem' 'inode_normal' 'lsdf02/sd17g004' 'lsdf02/sd17g004,' ' on node localhost [...] debug: perf: Executing mmhealth node show --verbose -N 'llsdf02e4' -Y? took 1330ms [...] debug: Inserting 0 new informational HealthEvents for node llsdf02e4 debug: perf: processInfoEvents() with 2 events took 5ms debug: perf: Parsing 23 state rows took 9ms debug: Deleted 0 orphaned states. debug: Loaded list of state changing HealthEvent objects. Size: 4 debug: Inserting 0 new state changing HealthEvents in the history table for node llsdf02e4 debug: perf: processStateChangingEvents() with 3 events took 2ms debug: perf: pool-90578-thread-1 - Processing 5 eventlog rows of node llsdf02e4 took 10ms in total debug: Deleted 0 orphaned states from history. debug: Loaded list of state changing HealthEvent objects. Size: 281 debug: Inserting 0 new state changing HealthEvents for node llsdf02e4 debug: perf: Processing 23 state rows took 59ms in total The command takes very long due to the -di option. I tried also your posted zimon command: #? echo "get -a metrics max(gpfs_fset_maxInodes),max(gpfs_fset_freeInodes),max(gpfs_fset_allocInodes) from gpfs_fs_name=lsdf02 group_by gpfs_fset_name last 13 bucket_size 300" | /opt/IBM/zimon/zc 127.0.0.1 Error: No data available for query: 6396075 In the Admin GUI i noticed that the Information in "Files -> Filesets -> -> Details" shows inconsistent inode information, e.g. ? in Overview: ????? Inodes: 76M ????? Max Inodes: 315M ? in Properties: ? ?? Inodes:??? ??? 1 ???? Max inodes:??? ??? 314572800 thanks, Sven On 05.09.2018 11:13, Andreas Koeninger wrote: > Hi Sven, > the REST API v2 provides similar information to what v1 provided. See > an example from my system below: > /scalemgmt/v2/filesystems/gpfs0/filesets?fields=:all: > [...] > ??? "filesetName" : "fset1", > ??? "filesystemName" : "gpfs0", > ??? "usage" : { > ????? "allocatedInodes" : 51232, > ????? "inodeSpaceFreeInodes" : 51231, > ????? "inodeSpaceUsedInodes" : 1, > ????? "usedBytes" : 0, > ????? "usedInodes" : 1 > ??? } > ? } ], > *In 5.0.0 there are two sources for the inode information: the first > one is mmlsfileset and the second one is the data collected by Zimon.* > Depending on the availability of the data either one is used. > > To debug what's happening on your system you can *execute the FILESETS > task on the GUI node* manually with the --debug flag. The output is > then showing the exact queries that are used to retrieve the data: > *[root at os-11 ~]# /usr/lpp/mmfs/gui/cli/runtask FILESETS --debug* > debug: locale=en_US > debug: Running 'mmlsfileset 'gpfs0' -Y ' on node localhost > debug: Running zimon query: 'get -ja metrics > max(gpfs_fset_maxInodes),max(gpfs_fset_freeInodes),max(gpfs_fset_allocInodes),max(gpfs_rq_blk_current),max(gpfs_rq_file_current) > from gpfs_fs_name=gpfs0 group_by gpfs_fset_name last 13 bucket_size 300' > debug: Running 'mmlsfileset 'objfs' -Y ' on node localhost > debug: Running zimon query: 'get -ja metrics > max(gpfs_fset_maxInodes),max(gpfs_fset_freeInodes),max(gpfs_fset_allocInodes),max(gpfs_rq_blk_current),max(gpfs_rq_file_current) > from gpfs_fs_name=objfs group_by gpfs_fset_name last 13 bucket_size 300' > EFSSG1000I The command completed successfully. > *As a start I suggest running the displayed Zimon queries manually to > see what's returned there, e.g.:* > /(Removed -j for better readability)/ > > *[root at os-11 ~]# echo "get -a metrics > max(gpfs_fset_maxInodes),max(gpfs_fset_freeInodes),max(gpfs_fset_allocInodes),max(gpfs_rq_blk_current),max(gpfs_rq_file_current) > from gpfs_fs_name=gpfs0 group_by gpfs_fset_name last 13 bucket_size > 300" | /opt/IBM/zimon/zc 127.0.0.1* > 1: > ?gpfs-cluster-1.novalocal|GPFSFileset|gpfs0|.audit_log|gpfs_fset_maxInodes > 2: ?gpfs-cluster-1.novalocal|GPFSFileset|gpfs0|fset1|gpfs_fset_maxInodes > 3: ?gpfs-cluster-1.novalocal|GPFSFileset|gpfs0|root|gpfs_fset_maxInodes > 4: > ?gpfs-cluster-1.novalocal|GPFSFileset|gpfs0|.audit_log|gpfs_fset_freeInodes > 5: ?gpfs-cluster-1.novalocal|GPFSFileset|gpfs0|fset1|gpfs_fset_freeInodes > 6: ?gpfs-cluster-1.novalocal|GPFSFileset|gpfs0|root|gpfs_fset_freeInodes > 7: > ?gpfs-cluster-1.novalocal|GPFSFileset|gpfs0|.audit_log|gpfs_fset_allocInodes > 8: ?gpfs-cluster-1.novalocal|GPFSFileset|gpfs0|fset1|gpfs_fset_allocInodes > 9: ?gpfs-cluster-1.novalocal|GPFSFileset|gpfs0|root|gpfs_fset_allocInodes > Row?? ?Timestamp?? ??? ?max(gpfs_fset_maxInodes) > ?max(gpfs_fset_maxInodes)?? ?max(gpfs_fset_maxInodes) > ?max(gpfs_fset_freeInodes)?? ?max(gpfs_fset_freeInodes) > ?max(gpfs_fset_freeInodes)?? ?max(gpfs_fset_allocInodes) > ?max(gpfs_fset_allocInodes)?? ?max(gpfs_fset_allocInodes) > 1?? ?2018-09-05 10:10:00?? ?100000?? ?620640?? ?65792 ?65795?? > ?51231?? ?61749?? ?65824?? ?51232?? ?65792 > 2?? ?2018-09-05 10:15:00?? ?100000?? ?620640?? ?65792 ?65795?? > ?51231?? ?61749?? ?65824?? ?51232?? ?65792 > 3?? ?2018-09-05 10:20:00?? ?100000?? ?620640?? ?65792 ?65795?? > ?51231?? ?61749?? ?65824?? ?51232?? ?65792 > 4?? ?2018-09-05 10:25:00?? ?100000?? ?620640?? ?65792 ?65795?? > ?51231?? ?61749?? ?65824?? ?51232?? ?65792 > 5?? ?2018-09-05 10:30:00?? ?100000?? ?620640?? ?65792 ?65795?? > ?51231?? ?61749?? ?65824?? ?51232?? ?65792 > 6?? ?2018-09-05 10:35:00?? ?100000?? ?620640?? ?65792 ?65795?? > ?51231?? ?61749?? ?65824?? ?51232?? ?65792 > 7?? ?2018-09-05 10:40:00?? ?100000?? ?620640?? ?65792 ?65795?? > ?51231?? ?61749?? ?65824?? ?51232?? ?65792 > 8?? ?2018-09-05 10:45:00?? ?100000?? ?620640?? ?65792 ?65795?? > ?51231?? ?61749?? ?65824?? ?51232?? ?65792 > 9?? ?2018-09-05 10:50:00?? ?100000?? ?620640?? ?65792 ?65795?? > ?51231?? ?61749?? ?65824?? ?51232?? ?65792 > 10?? ?2018-09-05 10:55:00?? ?100000?? ?620640?? ?65792 ?65795?? > ?51231?? ?61749?? ?65824?? ?51232?? ?65792 > 11?? ?2018-09-05 11:00:00?? ?100000?? ?620640?? ?65792 ?65795?? > ?51231?? ?61749?? ?65824?? ?51232?? ?65792 > 12?? ?2018-09-05 11:05:00?? ?100000?? ?620640?? ?65792 ?65795?? > ?51231?? ?61749?? ?65824?? ?51232?? ?65792 > 13?? ?2018-09-05 11:10:00?? ?100000?? ?620640?? ?65792 ?65795?? > ?51231?? ?61749?? ?65824?? ?51232?? ?65792 > . > > Mit freundlichen Gr??en / Kind regards > > Andreas Koeninger > Scrum Master and Software Developer / Spectrum Scale GUI and REST API > IBM Systems &Technology Group, Integrated Systems Development / M069 > ------------------------------------------------------------------------------------------------------------------------------------------- > IBM Deutschland > Am Weiher 24 > 65451 Kelsterbach > Phone: +49-7034-643-0867 > Mobile: +49-7034-643-0867 > E-Mail: andreas.koeninger at de.ibm.com > ------------------------------------------------------------------------------------------------------------------------------------------- > IBM Deutschland Research & Development GmbH / Vorsitzende des > Aufsichtsrats: Martina Koederitz > Gesch?ftsf?hrung: Dirk Wittkopp Sitz der Gesellschaft: B?blingen / > Registergericht: Amtsgericht Stuttgart, HRB 243294 > > ----- Original message ----- > From: Sven Siebler > Sent by: gpfsug-discuss-bounces at spectrumscale.org > To: gpfsug-discuss at spectrumscale.org > Cc: > Subject: [gpfsug-discuss] Getting inode information with REST API V2 > Date: Wed, Sep 5, 2018 9:37 AM > Hi all, > > i just started to use the REST API for our monitoring and my > question is > concerning about how can i get information about allocated inodes with > REST API V2 ? > > Up to now i use "mmlsfileset" directly, which gives me information on > maximum and allocated inodes (mmdf for total/free/allocated inodes of > the filesystem) > > If i use the REST API V2 with > "filesystems//filesets?fields=:all:", i get all > information except the allocated inodes. > > On the documentation > (https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.0/com.ibm.spectrum.scale.v5r00.doc/bl1adm_apiv2getfilesystemfilesets.htm) > i found: > > ?> "inodeSpace": "Inodes" > ?> The number of inodes that are allocated for use by the fileset. > > but for me the inodeSpace looks more like the ID of the inodespace, > instead of the number of allocated inodes. > > In the documentation example the API can give output like this: > > "filesetName" : "root", > ??????? "filesystemName" : "gpfs0", > ??????? "usage" : { > ?? ? ?? ??? "allocatedInodes" : 100000, > ?????? ? ?? "inodeSpaceFreeInodes" : 95962, > ??????????? "inodeSpaceUsedInodes" : 4038, > ?????? ? ?? "usedBytes" : 0, > ??????? ? ? "usedInodes" : 4038 > } > > but i could not retrieve such usage-fields in my queries. > > The only way for me to get inode information with REST is the > usage of V1: > > https://REST_API_host:port/scalemgmt/v1/filesets?filesystemName=FileSystemName > > which gives exact the information of "mmlsfileset". > > But because V1 is deprecated i want to use V2 for rewriting our > tools... > > Thanks, > > Sven > > > -- > Sven Siebler > Servicebereich Future IT - Research & Education (FIRE) > > Tel. +49 6221 54 20032 > sven.siebler at urz.uni-heidelberg.de > Universit?t Heidelberg > Universit?tsrechenzentrum (URZ) > Im Neuenheimer Feld 293, D-69120 Heidelberg > http://www.urz.uni-heidelberg.de > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -- Sven Siebler Servicebereich Future IT - Research & Education (FIRE) Tel. +49 6221 54 20032 sven.siebler at urz.uni-heidelberg.de Universit?t Heidelberg Universit?tsrechenzentrum (URZ) Im Neuenheimer Feld 293, D-69120 Heidelberg http://www.urz.uni-heidelberg.de -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 5437 bytes Desc: S/MIME Cryptographic Signature URL: From andreas.koeninger at de.ibm.com Wed Sep 5 14:42:00 2018 From: andreas.koeninger at de.ibm.com (Andreas Koeninger) Date: Wed, 5 Sep 2018 13:42:00 +0000 Subject: [gpfsug-discuss] Getting inode information with REST API V2 In-Reply-To: References: , <0975dcd6-a665-31f8-6070-a73b414f3d25@urz.uni-heidelberg.de> Message-ID: An HTML attachment was scrubbed... URL: From sven.siebler at urz.uni-heidelberg.de Wed Sep 5 15:17:03 2018 From: sven.siebler at urz.uni-heidelberg.de (Sven Siebler) Date: Wed, 5 Sep 2018 16:17:03 +0200 Subject: [gpfsug-discuss] Getting inode information with REST API V2 In-Reply-To: References: Message-ID: Hi Andreas, you are right ... our Storage Cluster is on v4.2.1 at the moment, while the CES/GUI Nodes running on 4.2.3.6. The GPFSFilesetQuota Sensor is enabled and restricted to the GUI Node, due to the performance impact: { ??????? name = "GPFSFilesetQuota" ??????? period = 3600 ??????? restrict = "llsdf02e4" }, { ??????? name = "GPFSDiskCap" ??????? period = 10800 ??????? restrict = "llsdf02e4" }, thanks, Sven On 05.09.2018 15:42, gpfsug-discuss-request at spectrumscale.org wrote: > Send gpfsug-discuss mailing list submissions to > gpfsug-discuss at spectrumscale.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > or, via email, send a message with subject or body 'help' to > gpfsug-discuss-request at spectrumscale.org > > You can reach the person managing the list at > gpfsug-discuss-owner at spectrumscale.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of gpfsug-discuss digest..." > > > Today's Topics: > > 1. Re: Getting inode information with REST API V2 (Sven Siebler) > 2. Re: Getting inode information with REST API V2 (Andreas Koeninger) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Wed, 5 Sep 2018 13:44:32 +0200 > From: Sven Siebler > To: Andreas Koeninger > Cc: gpfsug-discuss at spectrumscale.org > Subject: Re: [gpfsug-discuss] Getting inode information with REST API > V2 > Message-ID: > > Content-Type: text/plain; charset="utf-8"; Format="flowed" > > Hi Andreas, > > i've forgotten to mention that we are currently using ISS v4.2.1, not > v5.0.0. > > Invastigating the command i got the following: > > # /usr/lpp/mmfs/gui/cli/runtask FILESETS --debug > debug: locale=en_US > debug: Running 'mmlsfileset 'lsdf02' -di -Y ' on node localhost > > debug: Raising event: inode_normal > debug: Running 'mmsysmonc event 'filesystem' 'inode_normal' > 'lsdf02/sd17e005' 'lsdf02/sd17e005,' ' on node localhost > debug: Raising event: inode_normal > debug: Running 'mmsysmonc event 'filesystem' 'inode_normal' > 'lsdf02/sd17g004' 'lsdf02/sd17g004,' ' on node localhost > [...] > debug: perf: Executing mmhealth node show --verbose -N 'llsdf02e4' -Y? > took 1330ms > [...] > debug: Inserting 0 new informational HealthEvents for node llsdf02e4 > debug: perf: processInfoEvents() with 2 events took 5ms > debug: perf: Parsing 23 state rows took 9ms > debug: Deleted 0 orphaned states. > debug: Loaded list of state changing HealthEvent objects. Size: 4 > debug: Inserting 0 new state changing HealthEvents in the history table > for node llsdf02e4 > debug: perf: processStateChangingEvents() with 3 events took 2ms > debug: perf: pool-90578-thread-1 - Processing 5 eventlog rows of node > llsdf02e4 took 10ms in total > debug: Deleted 0 orphaned states from history. > debug: Loaded list of state changing HealthEvent objects. Size: 281 > debug: Inserting 0 new state changing HealthEvents for node llsdf02e4 > debug: perf: Processing 23 state rows took 59ms in total > > The command takes very long due to the -di option. > > I tried also your posted zimon command: > > #? echo "get -a metrics > max(gpfs_fset_maxInodes),max(gpfs_fset_freeInodes),max(gpfs_fset_allocInodes) > from gpfs_fs_name=lsdf02 group_by gpfs_fset_name last 13 bucket_size > 300" | /opt/IBM/zimon/zc 127.0.0.1 > > Error: No data available for query: 6396075 > > In the Admin GUI i noticed that the Information in "Files -> Filesets -> > -> Details" shows inconsistent inode information, e.g. > > ? in Overview: > ????? Inodes: 76M > ????? Max Inodes: 315M > > ? in Properties: > ? ?? Inodes:??? ??? 1 > ???? Max inodes:??? ??? 314572800 > > thanks, > Sven > > > > On 05.09.2018 11:13, Andreas Koeninger wrote: >> Hi Sven, >> the REST API v2 provides similar information to what v1 provided. See >> an example from my system below: >> /scalemgmt/v2/filesystems/gpfs0/filesets?fields=:all: >> [...] >> ??? "filesetName" : "fset1", >> ??? "filesystemName" : "gpfs0", >> ??? "usage" : { >> ????? "allocatedInodes" : 51232, >> ????? "inodeSpaceFreeInodes" : 51231, >> ????? "inodeSpaceUsedInodes" : 1, >> ????? "usedBytes" : 0, >> ????? "usedInodes" : 1 >> ??? } >> ? } ], >> *In 5.0.0 there are two sources for the inode information: the first >> one is mmlsfileset and the second one is the data collected by Zimon.* >> Depending on the availability of the data either one is used. >> >> To debug what's happening on your system you can *execute the FILESETS >> task on the GUI node* manually with the --debug flag. The output is >> then showing the exact queries that are used to retrieve the data: >> *[root at os-11 ~]# /usr/lpp/mmfs/gui/cli/runtask FILESETS --debug* >> debug: locale=en_US >> debug: Running 'mmlsfileset 'gpfs0' -Y ' on node localhost >> debug: Running zimon query: 'get -ja metrics >> max(gpfs_fset_maxInodes),max(gpfs_fset_freeInodes),max(gpfs_fset_allocInodes),max(gpfs_rq_blk_current),max(gpfs_rq_file_current) >> from gpfs_fs_name=gpfs0 group_by gpfs_fset_name last 13 bucket_size 300' >> debug: Running 'mmlsfileset 'objfs' -Y ' on node localhost >> debug: Running zimon query: 'get -ja metrics >> max(gpfs_fset_maxInodes),max(gpfs_fset_freeInodes),max(gpfs_fset_allocInodes),max(gpfs_rq_blk_current),max(gpfs_rq_file_current) >> from gpfs_fs_name=objfs group_by gpfs_fset_name last 13 bucket_size 300' >> EFSSG1000I The command completed successfully. >> *As a start I suggest running the displayed Zimon queries manually to >> see what's returned there, e.g.:* >> /(Removed -j for better readability)/ >> >> *[root at os-11 ~]# echo "get -a metrics >> max(gpfs_fset_maxInodes),max(gpfs_fset_freeInodes),max(gpfs_fset_allocInodes),max(gpfs_rq_blk_current),max(gpfs_rq_file_current) >> from gpfs_fs_name=gpfs0 group_by gpfs_fset_name last 13 bucket_size >> 300" | /opt/IBM/zimon/zc 127.0.0.1* >> 1: >> ?gpfs-cluster-1.novalocal|GPFSFileset|gpfs0|.audit_log|gpfs_fset_maxInodes >> 2: ?gpfs-cluster-1.novalocal|GPFSFileset|gpfs0|fset1|gpfs_fset_maxInodes >> 3: ?gpfs-cluster-1.novalocal|GPFSFileset|gpfs0|root|gpfs_fset_maxInodes >> 4: >> ?gpfs-cluster-1.novalocal|GPFSFileset|gpfs0|.audit_log|gpfs_fset_freeInodes >> 5: ?gpfs-cluster-1.novalocal|GPFSFileset|gpfs0|fset1|gpfs_fset_freeInodes >> 6: ?gpfs-cluster-1.novalocal|GPFSFileset|gpfs0|root|gpfs_fset_freeInodes >> 7: >> ?gpfs-cluster-1.novalocal|GPFSFileset|gpfs0|.audit_log|gpfs_fset_allocInodes >> 8: ?gpfs-cluster-1.novalocal|GPFSFileset|gpfs0|fset1|gpfs_fset_allocInodes >> 9: ?gpfs-cluster-1.novalocal|GPFSFileset|gpfs0|root|gpfs_fset_allocInodes >> Row?? ?Timestamp?? ??? ?max(gpfs_fset_maxInodes) >> ?max(gpfs_fset_maxInodes)?? ?max(gpfs_fset_maxInodes) >> ?max(gpfs_fset_freeInodes)?? ?max(gpfs_fset_freeInodes) >> ?max(gpfs_fset_freeInodes)?? ?max(gpfs_fset_allocInodes) >> ?max(gpfs_fset_allocInodes)?? ?max(gpfs_fset_allocInodes) >> 1?? ?2018-09-05 10:10:00?? ?100000?? ?620640?? ?65792 ?65795?? >> ?51231?? ?61749?? ?65824?? ?51232?? ?65792 >> 2?? ?2018-09-05 10:15:00?? ?100000?? ?620640?? ?65792 ?65795?? >> ?51231?? ?61749?? ?65824?? ?51232?? ?65792 >> 3?? ?2018-09-05 10:20:00?? ?100000?? ?620640?? ?65792 ?65795?? >> ?51231?? ?61749?? ?65824?? ?51232?? ?65792 >> 4?? ?2018-09-05 10:25:00?? ?100000?? ?620640?? ?65792 ?65795?? >> ?51231?? ?61749?? ?65824?? ?51232?? ?65792 >> 5?? ?2018-09-05 10:30:00?? ?100000?? ?620640?? ?65792 ?65795?? >> ?51231?? ?61749?? ?65824?? ?51232?? ?65792 >> 6?? ?2018-09-05 10:35:00?? ?100000?? ?620640?? ?65792 ?65795?? >> ?51231?? ?61749?? ?65824?? ?51232?? ?65792 >> 7?? ?2018-09-05 10:40:00?? ?100000?? ?620640?? ?65792 ?65795?? >> ?51231?? ?61749?? ?65824?? ?51232?? ?65792 >> 8?? ?2018-09-05 10:45:00?? ?100000?? ?620640?? ?65792 ?65795?? >> ?51231?? ?61749?? ?65824?? ?51232?? ?65792 >> 9?? ?2018-09-05 10:50:00?? ?100000?? ?620640?? ?65792 ?65795?? >> ?51231?? ?61749?? ?65824?? ?51232?? ?65792 >> 10?? ?2018-09-05 10:55:00?? ?100000?? ?620640?? ?65792 ?65795?? >> ?51231?? ?61749?? ?65824?? ?51232?? ?65792 >> 11?? ?2018-09-05 11:00:00?? ?100000?? ?620640?? ?65792 ?65795?? >> ?51231?? ?61749?? ?65824?? ?51232?? ?65792 >> 12?? ?2018-09-05 11:05:00?? ?100000?? ?620640?? ?65792 ?65795?? >> ?51231?? ?61749?? ?65824?? ?51232?? ?65792 >> 13?? ?2018-09-05 11:10:00?? ?100000?? ?620640?? ?65792 ?65795?? >> ?51231?? ?61749?? ?65824?? ?51232?? ?65792 >> . >> >> Mit freundlichen Gr??en / Kind regards >> >> Andreas Koeninger >> Scrum Master and Software Developer / Spectrum Scale GUI and REST API >> IBM Systems &Technology Group, Integrated Systems Development / M069 >> ------------------------------------------------------------------------------------------------------------------------------------------- >> IBM Deutschland >> Am Weiher 24 >> 65451 Kelsterbach >> Phone: +49-7034-643-0867 >> Mobile: +49-7034-643-0867 >> E-Mail: andreas.koeninger at de.ibm.com >> ------------------------------------------------------------------------------------------------------------------------------------------- >> IBM Deutschland Research & Development GmbH / Vorsitzende des >> Aufsichtsrats: Martina Koederitz >> Gesch?ftsf?hrung: Dirk Wittkopp Sitz der Gesellschaft: B?blingen / >> Registergericht: Amtsgericht Stuttgart, HRB 243294 >> >> ----- Original message ----- >> From: Sven Siebler >> Sent by: gpfsug-discuss-bounces at spectrumscale.org >> To: gpfsug-discuss at spectrumscale.org >> Cc: >> Subject: [gpfsug-discuss] Getting inode information with REST API V2 >> Date: Wed, Sep 5, 2018 9:37 AM >> Hi all, >> >> i just started to use the REST API for our monitoring and my >> question is >> concerning about how can i get information about allocated inodes with >> REST API V2 ? >> >> Up to now i use "mmlsfileset" directly, which gives me information on >> maximum and allocated inodes (mmdf for total/free/allocated inodes of >> the filesystem) >> >> If i use the REST API V2 with >> "filesystems//filesets?fields=:all:", i get all >> information except the allocated inodes. >> >> On the documentation >> (https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.0/com.ibm.spectrum.scale.v5r00.doc/bl1adm_apiv2getfilesystemfilesets.htm) >> i found: >> >> ?> "inodeSpace": "Inodes" >> ?> The number of inodes that are allocated for use by the fileset. >> >> but for me the inodeSpace looks more like the ID of the inodespace, >> instead of the number of allocated inodes. >> >> In the documentation example the API can give output like this: >> >> "filesetName" : "root", >> ??????? "filesystemName" : "gpfs0", >> ??????? "usage" : { >> ?? ? ?? ??? "allocatedInodes" : 100000, >> ?????? ? ?? "inodeSpaceFreeInodes" : 95962, >> ??????????? "inodeSpaceUsedInodes" : 4038, >> ?????? ? ?? "usedBytes" : 0, >> ??????? ? ? "usedInodes" : 4038 >> } >> >> but i could not retrieve such usage-fields in my queries. >> >> The only way for me to get inode information with REST is the >> usage of V1: >> >> https://REST_API_host:port/scalemgmt/v1/filesets?filesystemName=FileSystemName >> >> which gives exact the information of "mmlsfileset". >> >> But because V1 is deprecated i want to use V2 for rewriting our >> tools... >> >> Thanks, >> >> Sven >> >> >> -- >> Sven Siebler >> Servicebereich Future IT - Research & Education (FIRE) >> >> Tel. +49 6221 54 20032 >> sven.siebler at urz.uni-heidelberg.de >> Universit?t Heidelberg >> Universit?tsrechenzentrum (URZ) >> Im Neuenheimer Feld 293, D-69120 Heidelberg >> http://www.urz.uni-heidelberg.de >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> -- Sven Siebler Servicebereich Future IT - Research & Education (FIRE) Tel. +49 6221 54 20032 sven.siebler at urz.uni-heidelberg.de Universit?t Heidelberg Universit?tsrechenzentrum (URZ) Im Neuenheimer Feld 293, D-69120 Heidelberg http://www.urz.uni-heidelberg.de -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 5437 bytes Desc: S/MIME Cryptographic Signature URL: From olaf.weiser at de.ibm.com Wed Sep 5 16:01:23 2018 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Wed, 5 Sep 2018 17:01:23 +0200 Subject: [gpfsug-discuss] Top files on GPFS filesystem In-Reply-To: References:

Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/jpeg Size: 5698 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 360 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/jpeg Size: 5698 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 360 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/jpeg Size: 5698 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 360 bytes Desc: not available URL: From anobre at br.ibm.com Wed Sep 5 16:14:24 2018 From: anobre at br.ibm.com (Anderson Ferreira Nobre) Date: Wed, 5 Sep 2018 15:14:24 +0000 Subject: [gpfsug-discuss] Top files on GPFS filesystem In-Reply-To: References: ,

Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image._2_DDEEC03CDDEEBB8C0051A029C12582FF.jpg Type: image/jpeg Size: 5698 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image._1_DDE8F8F0DDEEE35C0051A029C12582FF.gif Type: image/gif Size: 360 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image._2_DDE9189CDDE913540051A029C12582FF.jpg Type: image/jpeg Size: 5698 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image._1_DDE93E04DDE939E80051A02AC12582FF.gif Type: image/gif Size: 360 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image._2_DDE95060DDE94DD80051A02AC12582FF.jpg Type: image/jpeg Size: 5698 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image._1_DDE95288DDE94DD80051A02AC12582FF.gif Type: image/gif Size: 360 bytes Desc: not available URL: From Kevin.Buterbaugh at Vanderbilt.Edu Wed Sep 5 16:34:51 2018 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Wed, 5 Sep 2018 15:34:51 +0000 Subject: [gpfsug-discuss] RAID type for system pool Message-ID: <92924CCC-FF4A-4F14-BA19-D35BC93CE69F@vanderbilt.edu> Hi All, We are in the process of finalizing the purchase of some new storage arrays (so no sales people who might be monitoring this list need contact me) to life-cycle some older hardware. One of the things we are considering is the purchase of some new SSD?s for our ?/home? filesystem and I have a question or two related to that. Currently, the existing home filesystem has it?s metadata on SSD?s ? two RAID 1 mirrors and metadata replication set to two. However, the filesystem itself is old enough that it uses 512 byte inodes. We have analyzed our users files and know that if we create a new filesystem with 4K inodes that a very significant portion of the files would now have their _data_ stored in the inode as well due to the files being 3.5K or smaller (currently all data is on spinning HD RAID 1 mirrors). Of course, if we increase the size of the inodes by a factor of 8 then we also need 8 times as much space to store those inodes. Given that Enterprise class SSDs are still very expensive and our budget is not unlimited, we?re trying to get the best bang for the buck. We have always - even back in the day when our metadata was on spinning disk and not SSD - used RAID 1 mirrors and metadata replication of two. However, we are wondering if it might be possible to switch to RAID 5? Specifically, what we are considering doing is buying 8 new SSDs and creating two 3+1P RAID 5 LUNs (metadata replication would stay at two). That would give us 50% more usable space than if we configured those same 8 drives as four RAID 1 mirrors. Unfortunately, unless I?m misunderstanding something, mean that the RAID stripe size and the GPFS block size could not match. Therefore, even though we don?t need the space, would we be much better off to buy 10 SSDs and create two 4+1P RAID 5 LUNs? I?ve searched the mailing list archives and scanned the DeveloperWorks wiki and even glanced at the GPFS documentation and haven?t found anything that says ?bad idea, Kevin?? ;-) Expanding on this further ? if we just present those two RAID 5 LUNs to GPFS as NSDs then we can only have two NSD servers as primary for them. So another thing we?re considering is to take those RAID 5 LUNs and further sub-divide them into a total of 8 logical volumes, each of which could be a GPFS NSD and therefore would allow us to have each of our 8 NSD servers be primary for one of them. Even worse idea?!? Good idea? Anybody have any better ideas??? ;-) Oh, and currently we?re on GPFS 4.2.3-10, but are also planning on moving to GPFS 5.0.1-x before creating the new filesystem. Thanks much? ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From anobre at br.ibm.com Wed Sep 5 17:50:45 2018 From: anobre at br.ibm.com (Anderson Ferreira Nobre) Date: Wed, 5 Sep 2018 16:50:45 +0000 Subject: [gpfsug-discuss] RAID type for system pool In-Reply-To: <92924CCC-FF4A-4F14-BA19-D35BC93CE69F@vanderbilt.edu> References: <92924CCC-FF4A-4F14-BA19-D35BC93CE69F@vanderbilt.edu> Message-ID: An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Wed Sep 5 18:20:00 2018 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Wed, 5 Sep 2018 13:20:00 -0400 Subject: [gpfsug-discuss] RAID type for system pool In-Reply-To: <92924CCC-FF4A-4F14-BA19-D35BC93CE69F@vanderbilt.edu> References: <92924CCC-FF4A-4F14-BA19-D35BC93CE69F@vanderbilt.edu> Message-ID: It's good to try to reason and think this out... But there's a good likelihood that we don't understand ALL the details, some of which may negatively impact performance - so no matter what scheme you come up with - test, test, and re-test before deploying and depending on it in production. Having said that, I'm pretty sure that old "spinning" RAID 5 implementations had horrible performance for GPFS metadata/system pool. Why? Among other things, the large stripe size vs the almost random small writes directed to system pool. That random-small-writes pattern won't change when we go to SSD RAID 5 - so you'd have to see if the SSD implementation is somehow smarter than an old fashioned RAID 5 implementation which I believe requires several physical reads and writes, for each "small" logical write. (Top decent google result I found quickly http://rickardnobel.se/raid-5-write-penalty/ But you will probably want to do more research!) Consider GPFS small write performance for: inode updates, log writes, small files (possibly in inode), directory updates, allocation map updates, index of indirect blocks. From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 09/05/2018 11:36 AM Subject: [gpfsug-discuss] RAID type for system pool Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi All, We are in the process of finalizing the purchase of some new storage arrays (so no sales people who might be monitoring this list need contact me) to life-cycle some older hardware. One of the things we are considering is the purchase of some new SSD?s for our ?/home? filesystem and I have a question or two related to that. Currently, the existing home filesystem has it?s metadata on SSD?s ? two RAID 1 mirrors and metadata replication set to two. However, the filesystem itself is old enough that it uses 512 byte inodes. We have analyzed our users files and know that if we create a new filesystem with 4K inodes that a very significant portion of the files would now have their _data_ stored in the inode as well due to the files being 3.5K or smaller (currently all data is on spinning HD RAID 1 mirrors). Of course, if we increase the size of the inodes by a factor of 8 then we also need 8 times as much space to store those inodes. Given that Enterprise class SSDs are still very expensive and our budget is not unlimited, we?re trying to get the best bang for the buck. We have always - even back in the day when our metadata was on spinning disk and not SSD - used RAID 1 mirrors and metadata replication of two. However, we are wondering if it might be possible to switch to RAID 5? Specifically, what we are considering doing is buying 8 new SSDs and creating two 3+1P RAID 5 LUNs (metadata replication would stay at two). That would give us 50% more usable space than if we configured those same 8 drives as four RAID 1 mirrors. Unfortunately, unless I?m misunderstanding something, mean that the RAID stripe size and the GPFS block size could not match. Therefore, even though we don?t need the space, would we be much better off to buy 10 SSDs and create two 4+1P RAID 5 LUNs? I?ve searched the mailing list archives and scanned the DeveloperWorks wiki and even glanced at the GPFS documentation and haven?t found anything that says ?bad idea, Kevin?? ;-) Expanding on this further ? if we just present those two RAID 5 LUNs to GPFS as NSDs then we can only have two NSD servers as primary for them. So another thing we?re considering is to take those RAID 5 LUNs and further sub-divide them into a total of 8 logical volumes, each of which could be a GPFS NSD and therefore would allow us to have each of our 8 NSD servers be primary for one of them. Even worse idea?!? Good idea? Anybody have any better ideas??? ;-) Oh, and currently we?re on GPFS 4.2.3-10, but are also planning on moving to GPFS 5.0.1-x before creating the new filesystem. Thanks much? ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From bbanister at jumptrading.com Wed Sep 5 18:33:03 2018 From: bbanister at jumptrading.com (Bryan Banister) Date: Wed, 5 Sep 2018 17:33:03 +0000 Subject: [gpfsug-discuss] RAID type for system pool In-Reply-To: References: <92924CCC-FF4A-4F14-BA19-D35BC93CE69F@vanderbilt.edu> Message-ID: <4bebae105b37448eab6226a68a23b47d@jumptrading.com> I agree with Anderson on his thoughts, mainly that if you want to go with RAID5 then you should analyze your current workload to see if it is mostly read operations or if you have more of a heavy write situation. Read-modify-write penalties and write amplification wearing problems on SSDs will become an issue for performance and life of the SSDs if you have a heavy metadata write workload. This also applies to the data in inode situation. The current workload can be inspected with standard iostat, mmdiag --iohist, mmpmon, and the GPFS perfmon stuff. We have SSDs in both RAID1 (metadata) and RAID5 configurations (data). We?re using the RAID controllers to split up the RAID sets into multiple virtual volumes so that we can have more NSD servers hosting the storage and increase the number of I/O commands (aka queue depth x N LUNs > queue depth x 1 LUN) being sent to the storage. Since there isn?t a seek penalty this is working well for us. As mentioned below, be sure to round-robin the ServerList for the NSDs to spread the load across servers. Hope that helps! -Bryan From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Anderson Ferreira Nobre Sent: Wednesday, September 5, 2018 11:51 AM To: gpfsug-discuss at spectrumscale.org Cc: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] RAID type for system pool Note: External Email ________________________________ Hi Kevin, RAID5 is good when the read ratio of I/Os is 70% or more. About creating two RAID5 you need to consider the size of the disks and the time to rebuild the RAID in case of failure. Maybe a single RAID5 would be better because you have more disks working in the backend for a single RAID. I think since if you are using SSD disks the time to rebuild the RAID will always be fast. So you wouldn't need a RAID6. Maybe it's a good idea to read the manual of SAS RAID controller to see how long takes to rebuild the RAID in case of a failure. About the stripe size of controller vs block size in GPFS. This is just a guess, and you would need to do some performance test to make sure. You could consider the stripe width of RAID to be the block size of metadata. I think this is the best you can do. Break in several LUNs I consider a good idea for you don't have large queue length in the LUNs. Specially if the I/O profile is many I/O with small block size. About balance the LUNs over the NSD Servers is a best practice. Do not leave all the LUNs pointing to the first node. Just remember that when you create the NSDs, the device is always corresponding to the first node of servers. This can be laborous work. So to make the things easier I create two NSD stanza files. The first one pointing to the first node like this: %nsd device=/dev/mapper/mpatha nsd=nsd001 servers=host1,host2,host3,host4 usage=metadataOnly failureGroup=1 pool=system %nsd device=/dev/mapper/mpathb nsd=nsd002 servers= servers=host1,host2,host3,host4 usage=metadataOnly failureGroup=1 pool=system Then I use this stanza file to create the nsds. And create a second stanza file: %nsd nsd=nsd001 servers=host1,host2,host3,host4 usage=metadataOnly failureGroup=1 pool=system %nsd nsd=nsd002 servers=host2,host3,host4,host1 usage=metadataOnly failureGroup=1 pool=system And change with mmchnsd. Abra?os / Regards / Saludos, Anderson Nobre AIX & Power Consultant Master Certified IT Specialist IBM Systems Hardware Client Technical Team ? IBM Systems Lab Services [community_general_lab_services] ________________________________ Phone: 55-19-2132-4317 E-mail: anobre at br.ibm.com [IBM] ----- Original message ----- From: "Buterbaugh, Kevin L" > Sent by: gpfsug-discuss-bounces at spectrumscale.org To: gpfsug main discussion list > Cc: Subject: [gpfsug-discuss] RAID type for system pool Date: Wed, Sep 5, 2018 12:35 PM Hi All, We are in the process of finalizing the purchase of some new storage arrays (so no sales people who might be monitoring this list need contact me) to life-cycle some older hardware. One of the things we are considering is the purchase of some new SSD?s for our ?/home? filesystem and I have a question or two related to that. Currently, the existing home filesystem has it?s metadata on SSD?s ? two RAID 1 mirrors and metadata replication set to two. However, the filesystem itself is old enough that it uses 512 byte inodes. We have analyzed our users files and know that if we create a new filesystem with 4K inodes that a very significant portion of the files would now have their _data_ stored in the inode as well due to the files being 3.5K or smaller (currently all data is on spinning HD RAID 1 mirrors). Of course, if we increase the size of the inodes by a factor of 8 then we also need 8 times as much space to store those inodes. Given that Enterprise class SSDs are still very expensive and our budget is not unlimited, we?re trying to get the best bang for the buck. We have always - even back in the day when our metadata was on spinning disk and not SSD - used RAID 1 mirrors and metadata replication of two. However, we are wondering if it might be possible to switch to RAID 5? Specifically, what we are considering doing is buying 8 new SSDs and creating two 3+1P RAID 5 LUNs (metadata replication would stay at two). That would give us 50% more usable space than if we configured those same 8 drives as four RAID 1 mirrors. Unfortunately, unless I?m misunderstanding something, mean that the RAID stripe size and the GPFS block size could not match. Therefore, even though we don?t need the space, would we be much better off to buy 10 SSDs and create two 4+1P RAID 5 LUNs? I?ve searched the mailing list archives and scanned the DeveloperWorks wiki and even glanced at the GPFS documentation and haven?t found anything that says ?bad idea, Kevin?? ;-) Expanding on this further ? if we just present those two RAID 5 LUNs to GPFS as NSDs then we can only have two NSD servers as primary for them. So another thing we?re considering is to take those RAID 5 LUNs and further sub-divide them into a total of 8 logical volumes, each of which could be a GPFS NSD and therefore would allow us to have each of our 8 NSD servers be primary for one of them. Even worse idea?!? Good idea? Anybody have any better ideas??? ;-) Oh, and currently we?re on GPFS 4.2.3-10, but are also planning on moving to GPFS 5.0.1-x before creating the new filesystem. Thanks much? ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential, or privileged information and/or personal data. If you are not the intended recipient, you are hereby notified that any review, dissemination, or copying of this email is strictly prohibited, and requested to notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request, or solicitation of any kind to buy, sell, subscribe, redeem, or perform any type of transaction of a financial product. Personal data, as defined by applicable data privacy laws, contained in this email may be processed by the Company, and any of its affiliated or related companies, for potential ongoing compliance and/or business-related purposes. You may have rights regarding your personal data; for information on exercising these rights or the Company?s treatment of personal data, please email datarequests at jumptrading.com. -------------- next part -------------- An HTML attachment was scrubbed... URL: From stockf at us.ibm.com Wed Sep 5 18:37:24 2018 From: stockf at us.ibm.com (Frederick Stock) Date: Wed, 5 Sep 2018 13:37:24 -0400 Subject: [gpfsug-discuss] RAID type for system pool In-Reply-To: References: <92924CCC-FF4A-4F14-BA19-D35BC93CE69F@vanderbilt.edu> Message-ID: Another option for saving space is to not keep 2 copies of the metadata within GPFS. The SSDs are mirrored so you have two copies though very likely they share a possible single point of failure and that could be a deal breaker. I have my doubts that RAID5 will perform well for the reasons Marc described but worth testing to see how it does perform. If you do test I presume you would also run equivalent tests with a RAID1 (mirrored) configuration. Regarding your point about making multiple volumes that would become GPFS NSDs for metadata. It has been my experience that for traditional RAID systems it is better to have many small metadata LUNs (more IO paths) then a few large metadata LUNs. This becomes less of an issue with ESS, i.e. there you can have a few metadata NSDs yet still get very good performance. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: "Marc A Kaplan" To: gpfsug main discussion list Date: 09/05/2018 01:22 PM Subject: Re: [gpfsug-discuss] RAID type for system pool Sent by: gpfsug-discuss-bounces at spectrumscale.org It's good to try to reason and think this out... But there's a good likelihood that we don't understand ALL the details, some of which may negatively impact performance - so no matter what scheme you come up with - test, test, and re-test before deploying and depending on it in production. Having said that, I'm pretty sure that old "spinning" RAID 5 implementations had horrible performance for GPFS metadata/system pool. Why? Among other things, the large stripe size vs the almost random small writes directed to system pool. That random-small-writes pattern won't change when we go to SSD RAID 5 - so you'd have to see if the SSD implementation is somehow smarter than an old fashioned RAID 5 implementation which I believe requires several physical reads and writes, for each "small" logical write. (Top decent google result I found quickly http://rickardnobel.se/raid-5-write-penalty/But you will probably want to do more research!) Consider GPFS small write performance for: inode updates, log writes, small files (possibly in inode), directory updates, allocation map updates, index of indirect blocks. From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 09/05/2018 11:36 AM Subject: [gpfsug-discuss] RAID type for system pool Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi All, We are in the process of finalizing the purchase of some new storage arrays (so no sales people who might be monitoring this list need contact me) to life-cycle some older hardware. One of the things we are considering is the purchase of some new SSD?s for our ?/home? filesystem and I have a question or two related to that. Currently, the existing home filesystem has it?s metadata on SSD?s ? two RAID 1 mirrors and metadata replication set to two. However, the filesystem itself is old enough that it uses 512 byte inodes. We have analyzed our users files and know that if we create a new filesystem with 4K inodes that a very significant portion of the files would now have their _data_ stored in the inode as well due to the files being 3.5K or smaller (currently all data is on spinning HD RAID 1 mirrors). Of course, if we increase the size of the inodes by a factor of 8 then we also need 8 times as much space to store those inodes. Given that Enterprise class SSDs are still very expensive and our budget is not unlimited, we?re trying to get the best bang for the buck. We have always - even back in the day when our metadata was on spinning disk and not SSD - used RAID 1 mirrors and metadata replication of two. However, we are wondering if it might be possible to switch to RAID 5? Specifically, what we are considering doing is buying 8 new SSDs and creating two 3+1P RAID 5 LUNs (metadata replication would stay at two). That would give us 50% more usable space than if we configured those same 8 drives as four RAID 1 mirrors. Unfortunately, unless I?m misunderstanding something, mean that the RAID stripe size and the GPFS block size could not match. Therefore, even though we don?t need the space, would we be much better off to buy 10 SSDs and create two 4+1P RAID 5 LUNs? I?ve searched the mailing list archives and scanned the DeveloperWorks wiki and even glanced at the GPFS documentation and haven?t found anything that says ?bad idea, Kevin?? ;-) Expanding on this further ? if we just present those two RAID 5 LUNs to GPFS as NSDs then we can only have two NSD servers as primary for them. So another thing we?re considering is to take those RAID 5 LUNs and further sub-divide them into a total of 8 logical volumes, each of which could be a GPFS NSD and therefore would allow us to have each of our 8 NSD servers be primary for one of them. Even worse idea?!? Good idea? Anybody have any better ideas??? ;-) Oh, and currently we?re on GPFS 4.2.3-10, but are also planning on moving to GPFS 5.0.1-x before creating the new filesystem. Thanks much? ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Wed Sep 5 19:05:35 2018 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Wed, 5 Sep 2018 14:05:35 -0400 Subject: [gpfsug-discuss] RAID type for system pool In-Reply-To: References: <92924CCC-FF4A-4F14-BA19-D35BC93CE69F@vanderbilt.edu>

Message-ID: OR don't do RAID replication, but use GPFS triple replication. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 21994 bytes Desc: not available URL: From Rafael.Cezario at ibm.com Wed Sep 5 21:03:12 2018 From: Rafael.Cezario at ibm.com (Rafael Cezario) Date: Wed, 5 Sep 2018 20:03:12 +0000 Subject: [gpfsug-discuss] mmbackup failed Message-ID: Hi All, I have a filesystem "/dados" with 900TB of data. I have a backup routine with mmbackup and I receive several errors because incorrect values in the file .mmbackupShadow.1. The problem was resolved after I removed the lines of the file. Anyone had any a tool or utility to help me check the file .mmbackupShadow looking incorrect rows? Rafael Cezario IBM Power rafael.cezario at ibm.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From UWEFALKE at de.ibm.com Wed Sep 5 21:07:22 2018 From: UWEFALKE at de.ibm.com (Uwe Falke) Date: Wed, 5 Sep 2018 22:07:22 +0200 Subject: [gpfsug-discuss] RAID type for system pool In-Reply-To: <92924CCC-FF4A-4F14-BA19-D35BC93CE69F@vanderbilt.edu> References: <92924CCC-FF4A-4F14-BA19-D35BC93CE69F@vanderbilt.edu> Message-ID: Hi, just think that your RAID controller on parity-backed redundancy needs to read the full stripe, modify it, and write it back (including parity) - the infamous Read-Modify-Write penalty. As long as your users don't bulk-create inodes and doo amend some metadata, (create a file sometimes, e.g.) The writing of a 4k inode, or the update of a 32k dir block causes your controller to read a full block (let's say you use 1MiB on MD) and write back the full block plus parity (on 4+1p RAID 5 at 1MiB that'll be 1.25MiB. Overhead two orders of magnitude above the payload. SSDs have become better now and expensive enterprise SSDs will endure quite a lot of full rewrites, but you need to estimate the MD change rate, apply the RMW overhead and see where you end WRT lifetime (and performance). Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 05/09/2018 17:35 Subject: [gpfsug-discuss] RAID type for system pool Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi All, We are in the process of finalizing the purchase of some new storage arrays (so no sales people who might be monitoring this list need contact me) to life-cycle some older hardware. One of the things we are considering is the purchase of some new SSD?s for our ?/home? filesystem and I have a question or two related to that. Currently, the existing home filesystem has it?s metadata on SSD?s ? two RAID 1 mirrors and metadata replication set to two. However, the filesystem itself is old enough that it uses 512 byte inodes. We have analyzed our users files and know that if we create a new filesystem with 4K inodes that a very significant portion of the files would now have their _data_ stored in the inode as well due to the files being 3.5K or smaller (currently all data is on spinning HD RAID 1 mirrors). Of course, if we increase the size of the inodes by a factor of 8 then we also need 8 times as much space to store those inodes. Given that Enterprise class SSDs are still very expensive and our budget is not unlimited, we?re trying to get the best bang for the buck. We have always - even back in the day when our metadata was on spinning disk and not SSD - used RAID 1 mirrors and metadata replication of two. However, we are wondering if it might be possible to switch to RAID 5? Specifically, what we are considering doing is buying 8 new SSDs and creating two 3+1P RAID 5 LUNs (metadata replication would stay at two). That would give us 50% more usable space than if we configured those same 8 drives as four RAID 1 mirrors. Unfortunately, unless I?m misunderstanding something, mean that the RAID stripe size and the GPFS block size could not match. Therefore, even though we don?t need the space, would we be much better off to buy 10 SSDs and create two 4+1P RAID 5 LUNs? I?ve searched the mailing list archives and scanned the DeveloperWorks wiki and even glanced at the GPFS documentation and haven?t found anything that says ?bad idea, Kevin?? ;-) Expanding on this further ? if we just present those two RAID 5 LUNs to GPFS as NSDs then we can only have two NSD servers as primary for them. So another thing we?re considering is to take those RAID 5 LUNs and further sub-divide them into a total of 8 logical volumes, each of which could be a GPFS NSD and therefore would allow us to have each of our 8 NSD servers be primary for one of them. Even worse idea?!? Good idea? Anybody have any better ideas??? ;-) Oh, and currently we?re on GPFS 4.2.3-10, but are also planning on moving to GPFS 5.0.1-x before creating the new filesystem. Thanks much? ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From alex at calicolabs.com Wed Sep 5 21:13:17 2018 From: alex at calicolabs.com (Alex Chekholko) Date: Wed, 5 Sep 2018 13:13:17 -0700 Subject: [gpfsug-discuss] RAID type for system pool In-Reply-To: References: <92924CCC-FF4A-4F14-BA19-D35BC93CE69F@vanderbilt.edu> Message-ID: Hi Kevin, Why not do single SSD devices and then just use -m DefaultMetadataReplicas = 3 and -M MaxMetadataReplicas = 3 for your mmcrfs ? And maybe you can even get away with -m 2 -M 3. You will get higher performance overall by having more devices. You will get good redundancy with GPFS replicas (just make sure your failure groups make sense). Maybe you can split your SSDs across different shelves or RAID controllers or something. In any case, if you are creating a new filesystem, you can test all this out. Regards, Alex On Wed, Sep 5, 2018 at 1:07 PM Uwe Falke wrote: > Hi, > > just think that your RAID controller on parity-backed redundancy needs to > read the full stripe, modify it, and write it back (including parity) - > the infamous Read-Modify-Write penalty. > As long as your users don't bulk-create inodes and doo amend some > metadata, (create a file sometimes, e.g.) The writing of a 4k inode, or > the update of a 32k dir block causes your controller to read a full block > (let's say you use 1MiB on MD) and write back the full block plus parity > (on 4+1p RAID 5 at 1MiB that'll be 1.25MiB. Overhead two orders of > magnitude above the payload. > SSDs have become better now and expensive enterprise SSDs will endure > quite a lot of full rewrites, but you need to estimate the MD change rate, > apply the RMW overhead and see where you end WRT lifetime (and > performance). > > > > > Mit freundlichen Gr??en / Kind regards > > > Dr. Uwe Falke > > IT Specialist > High Performance Computing Services / Integrated Technology Services / > Data Center Services > > ------------------------------------------------------------------------------------------------------------------------------------------- > IBM Deutschland > Rathausstr. 7 > 09111 Chemnitz > Phone: +49 371 6978 2165 > Mobile: +49 175 575 2877 > E-Mail: uwefalke at de.ibm.com > > ------------------------------------------------------------------------------------------------------------------------------------------- > IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: > Thomas Wolter, Sven Schoo? > Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, > HRB 17122 > > > > > From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 05/09/2018 17:35 > Subject: [gpfsug-discuss] RAID type for system pool > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > Hi All, > > We are in the process of finalizing the purchase of some new storage > arrays (so no sales people who might be monitoring this list need contact > me) to life-cycle some older hardware. One of the things we are > considering is the purchase of some new SSD?s for our ?/home? filesystem > and I have a question or two related to that. > > Currently, the existing home filesystem has it?s metadata on SSD?s ? two > RAID 1 mirrors and metadata replication set to two. However, the > filesystem itself is old enough that it uses 512 byte inodes. We have > analyzed our users files and know that if we create a new filesystem with > 4K inodes that a very significant portion of the files would now have > their _data_ stored in the inode as well due to the files being 3.5K or > smaller (currently all data is on spinning HD RAID 1 mirrors). > > Of course, if we increase the size of the inodes by a factor of 8 then we > also need 8 times as much space to store those inodes. Given that > Enterprise class SSDs are still very expensive and our budget is not > unlimited, we?re trying to get the best bang for the buck. > > We have always - even back in the day when our metadata was on spinning > disk and not SSD - used RAID 1 mirrors and metadata replication of two. > However, we are wondering if it might be possible to switch to RAID 5? > Specifically, what we are considering doing is buying 8 new SSDs and > creating two 3+1P RAID 5 LUNs (metadata replication would stay at two). > That would give us 50% more usable space than if we configured those same > 8 drives as four RAID 1 mirrors. > > Unfortunately, unless I?m misunderstanding something, mean that the RAID > stripe size and the GPFS block size could not match. Therefore, even > though we don?t need the space, would we be much better off to buy 10 SSDs > and create two 4+1P RAID 5 LUNs? > > I?ve searched the mailing list archives and scanned the DeveloperWorks > wiki and even glanced at the GPFS documentation and haven?t found anything > that says ?bad idea, Kevin?? ;-) > > Expanding on this further ? if we just present those two RAID 5 LUNs to > GPFS as NSDs then we can only have two NSD servers as primary for them. So > another thing we?re considering is to take those RAID 5 LUNs and further > sub-divide them into a total of 8 logical volumes, each of which could be > a GPFS NSD and therefore would allow us to have each of our 8 NSD servers > be primary for one of them. Even worse idea?!? Good idea? > > Anybody have any better ideas??? ;-) > > Oh, and currently we?re on GPFS 4.2.3-10, but are also planning on moving > to GPFS 5.0.1-x before creating the new filesystem. > > Thanks much? > > ? > Kevin Buterbaugh - Senior System Administrator > Vanderbilt University - Advanced Computing Center for Research and > Education > Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stockf at us.ibm.com Wed Sep 5 21:20:51 2018 From: stockf at us.ibm.com (Frederick Stock) Date: Wed, 5 Sep 2018 16:20:51 -0400 Subject: [gpfsug-discuss] mmbackup failed In-Reply-To: References: Message-ID: There are options in the mmbackup command to rebuild the shadowDB file from data kept in TSM. Be aware that using this option will take time to rebuild the shadowDB file, i.e. it is not a fast procedure. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: "Rafael Cezario" To: gpfsug-discuss at spectrumscale.org Date: 09/05/2018 04:04 PM Subject: [gpfsug-discuss] mmbackup failed Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi All, I have a filesystem ?/dados? with 900TB of data. I have a backup routine with mmbackup and I receive several errors because incorrect values in the file .mmbackupShadow.1. The problem was resolved after I removed the lines of the file. Anyone had any a tool or utility to help me check the file .mmbackupShadow looking incorrect rows? Rafael Cezario IBM Power rafael.cezario at ibm.com _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From ulmer at ulmer.org Wed Sep 5 21:33:55 2018 From: ulmer at ulmer.org (Stephen Ulmer) Date: Wed, 5 Sep 2018 16:33:55 -0400 Subject: [gpfsug-discuss] RAID type for system pool In-Reply-To: <92924CCC-FF4A-4F14-BA19-D35BC93CE69F@vanderbilt.edu> References: <92924CCC-FF4A-4F14-BA19-D35BC93CE69F@vanderbilt.edu> Message-ID: > On Sep 5, 2018, at 11:34 AM, Buterbaugh, Kevin L > wrote: > > [?] > Of course, if we increase the size of the inodes by a factor of 8 then we also need 8 times as much space to store those inodes. Given that Enterprise class SSDs are still very expensive and our budget is not unlimited, we?re trying to get the best bang for the buck. > Nobody has gone in this direction yet, so I?ll play devil?s advocate: Are you sure you need enterprise class SSDs? The only practical difference between enterprise class SSDs and "read intensive" SSDs is the "endurance" in DWPD[1]. Read-intensive SSDs usually have a DWPD of 1-ish. Enterprise SSDs can have a DWPD as high as 30. So, how many times do you think you?ll actually write all of the data on the SSDs per day? I don?t know how much (meta)data you?ve got, but maybe consider buying the "cheap" SSDs (which will be *much* larger for your dollar) and just use fractions of them with GPFS replication[2] or maybe some vendor?s {distributed, de-clustererd} RAID. Keep some spares. This is probably bad advice, but the thought exercise will let you find the edges of what you meant. :) [1] DWPD = Drive Writes Per Day ? write all of the cells on the entire storage device every 24 hours. [2] Okay, somebody already said to use GPFS replication. ;) -- Stephen > We have always - even back in the day when our metadata was on spinning disk and not SSD - used RAID 1 mirrors and metadata replication of two. However, we are wondering if it might be possible to switch to RAID 5? Specifically, what we are considering doing is buying 8 new SSDs and creating two 3+1P RAID 5 LUNs (metadata replication would stay at two). That would give us 50% more usable space than if we configured those same 8 drives as four RAID 1 mirrors. > > Unfortunately, unless I?m misunderstanding something, mean that the RAID stripe size and the GPFS block size could not match. Therefore, even though we don?t need the space, would we be much better off to buy 10 SSDs and create two 4+1P RAID 5 LUNs? > > I?ve searched the mailing list archives and scanned the DeveloperWorks wiki and even glanced at the GPFS documentation and haven?t found anything that says ?bad idea, Kevin?? ;-) > > Expanding on this further ? if we just present those two RAID 5 LUNs to GPFS as NSDs then we can only have two NSD servers as primary for them. So another thing we?re considering is to take those RAID 5 LUNs and further sub-divide them into a total of 8 logical volumes, each of which could be a GPFS NSD and therefore would allow us to have each of our 8 NSD servers be primary for one of them. Even worse idea?!? Good idea? > > Anybody have any better ideas??? ;-) > > Oh, and currently we?re on GPFS 4.2.3-10, but are also planning on moving to GPFS 5.0.1-x before creating the new filesystem. > > Thanks much? > > ? > Kevin Buterbaugh - Senior System Administrator > Vanderbilt University - Advanced Computing Center for Research and Education > Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Wed Sep 5 23:42:05 2018 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Wed, 5 Sep 2018 18:42:05 -0400 Subject: [gpfsug-discuss] RAID type for system pool In-Reply-To: References: <92924CCC-FF4A-4F14-BA19-D35BC93CE69F@vanderbilt.edu>

Message-ID: <5b172b0f-8a35-ec63-f122-7728aab25564@nasa.gov> I've heard it highly recommended (and have been *really* glad at times to have it) to have at least 2 replicas of metadata to help maintain fs consistency in the event of fs issues or hardware bugs (e.g. a torn write). -Aaron On 9/5/18 1:37 PM, Frederick Stock wrote: > Another option for saving space is to not keep 2 copies of the metadata > within GPFS. ?The SSDs are mirrored so you have two copies though very > likely they share a possible single point of failure and that could be a > deal breaker. ?I have my doubts that RAID5 will perform well for the > reasons Marc described but worth testing to see how it does perform. ?If > you do test I presume you would also run equivalent tests with a RAID1 > (mirrored) configuration. > > Regarding your point about making multiple volumes that would become > GPFS NSDs for metadata. ?It has been my experience that for traditional > RAID systems it is better to have many small metadata LUNs (more IO > paths) then a few large metadata LUNs. ?This becomes less of an issue > with ESS, i.e. there you can have a few metadata NSDs yet still get very > good performance. > > Fred > __________________________________________________ > Fred Stock | IBM Pittsburgh Lab | 720-430-8821 > stockf at us.ibm.com > > > > From: "Marc A Kaplan" > To: gpfsug main discussion list > Date: 09/05/2018 01:22 PM > Subject: Re: [gpfsug-discuss] RAID type for system pool > Sent by: gpfsug-discuss-bounces at spectrumscale.org > ------------------------------------------------------------------------ > > > > It's good to try to reason and think this out... But there's a good > likelihood that we don't understand ALL the details, some of which may > negatively impact performance - so no matter what scheme you come up > with - test, test, and re-test before deploying and depending on it in > production. > > Having said that, I'm pretty sure that old "spinning" RAID 5 > implementations had horrible performance for GPFS metadata/system pool. > Why? Among other things, the large stripe size vs the almost random > small writes directed to system pool. > > That random-small-writes pattern won't change when we go to SSD RAID 5 - > so you'd have to see if the SSD implementation is somehow smarter than > an old fashioned RAID 5 implementation which I believe requires several > physical reads and writes, for each "small" logical write. > (Top decent google result I found quickly > _http://rickardnobel.se/raid-5-write-penalty/_But you will probably want > to do more research!) > > Consider GPFS small write performance for: ?inode updates, log writes, > small files (possibly in inode), directory updates, allocation map > updates, index of indirect blocks. > > > > From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 09/05/2018 11:36 AM > Subject: [gpfsug-discuss] RAID type for system pool > Sent by: gpfsug-discuss-bounces at spectrumscale.org > ------------------------------------------------------------------------ > > > > Hi All, > > We are in the process of finalizing the purchase of some new storage > arrays (so no sales people who might be monitoring this list need > contact me) to life-cycle some older hardware. ?One of the things we are > considering is the purchase of some new SSD?s for our ?/home? filesystem > and I have a question or two related to that. > > Currently, the existing home filesystem has it?s metadata on SSD?s ? two > RAID 1 mirrors and metadata replication set to two. ?However, the > filesystem itself is old enough that it uses 512 byte inodes. ?We have > analyzed our users files and know that if we create a new filesystem > with 4K inodes that a very significant portion of the files would now > have their _data_ stored in the inode as well due to the files being > 3.5K or smaller (currently all data is on spinning HD RAID 1 mirrors). > > Of course, if we increase the size of the inodes by a factor of 8 then > we also need 8 times as much space to store those inodes. ?Given that > Enterprise class SSDs are still very expensive and our budget is not > unlimited, we?re trying to get the best bang for the buck. > > We have always - even back in the day when our metadata was on spinning > disk and not SSD - used RAID 1 mirrors and metadata replication of two. > ?However, we are wondering if it might be possible to switch to RAID 5? > ?Specifically, what we are considering doing is buying 8 new SSDs and > creating two 3+1P RAID 5 LUNs (metadata replication would stay at two). > ?That would give us 50% more usable space than if we configured those > same 8 drives as four RAID 1 mirrors. > > Unfortunately, unless I?m misunderstanding something, mean that the RAID > stripe size and the GPFS block size could not match. ?Therefore, even > though we don?t need the space, would we be much better off to buy 10 > SSDs and create two 4+1P RAID 5 LUNs? > > I?ve searched the mailing list archives and scanned the DeveloperWorks > wiki and even glanced at the GPFS documentation and haven?t found > anything that says ?bad idea, Kevin?? ;-) > > Expanding on this further ? if we just present those two RAID 5 LUNs to > GPFS as NSDs then we can only have two NSD servers as primary for them. > ?So another thing we?re considering is to take those RAID 5 LUNs and > further sub-divide them into a total of 8 logical volumes, each of which > could be a GPFS NSD and therefore would allow us to have each of our 8 > NSD servers be primary for one of them. ?Even worse idea?!? ?Good idea? > > Anybody have any better ideas??? ?;-) > > Oh, and currently we?re on GPFS 4.2.3-10, but are also planning on > moving to GPFS 5.0.1-x before creating the new filesystem. > > Thanks much? > > ? > Kevin Buterbaugh - Senior System Administrator > Vanderbilt University - Advanced Computing Center for Research and > Education_ > __Kevin.Buterbaugh at vanderbilt.edu_ > - (615)875-9633 > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org_ > __http://gpfsug.org/mailman/listinfo/gpfsug-discuss_ > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From Achim.Rehor at de.ibm.com Thu Sep 6 09:15:58 2018 From: Achim.Rehor at de.ibm.com (Achim Rehor) Date: Thu, 6 Sep 2018 08:15:58 +0000 Subject: [gpfsug-discuss] RAID type for system pool In-Reply-To: <92924CCC-FF4A-4F14-BA19-D35BC93CE69F@vanderbilt.edu> Message-ID: Hi Kevin, as you already pointed out, having a RAID stripe size (or a multiple of it) not matching GPFS blocksize, is a bad idea. Every write would cause a read-modify-write operation to keep the parity. So for data LUNs RAID5 with 4+P or 8+P is fully ok. For metadata, if you are keen on performance, I would stay with RAID1, or even RAID0, so you aren?t affected by possible RAID rebuild performance drops. Regards, Achim > Am 05.09.2018 um 17:35 schrieb Buterbaugh, Kevin L : > > Hi All, > > We are in the process of finalizing the purchase of some new storage arrays (so no sales people who might be monitoring this list need contact me) to life-cycle some older hardware. One of the things we are considering is the purchase of some new SSD?s for our ?/home? filesystem and I have a question or two related to that. > > Currently, the existing home filesystem has it?s metadata on SSD?s ? two RAID 1 mirrors and metadata replication set to two. However, the filesystem itself is old enough that it uses 512 byte inodes. We have analyzed our users files and know that if we create a new filesystem with 4K inodes that a very significant portion of the files would now have their _data_ stored in the inode as well due to the files being 3.5K or smaller (currently all data is on spinning HD RAID 1 mirrors). > > Of course, if we increase the size of the inodes by a factor of 8 then we also need 8 times as much space to store those inodes. Given that Enterprise class SSDs are still very expensive and our budget is not unlimited, we?re trying to get the best bang for the buck. > > We have always - even back in the day when our metadata was on spinning disk and not SSD - used RAID 1 mirrors and metadata replication of two. However, we are wondering if it might be possible to switch to RAID 5? Specifically, what we are considering doing is buying 8 new SSDs and creating two 3+1P RAID 5 LUNs (metadata replication would stay at two). That would give us 50% more usable space than if we configured those same 8 drives as four RAID 1 mirrors. > > Unfortunately, unless I?m misunderstanding something, mean that the RAID stripe size and the GPFS block size could not match. Therefore, even though we don?t need the space, would we be much better off to buy 10 SSDs and create two 4+1P RAID 5 LUNs? > > I?ve searched the mailing list archives and scanned the DeveloperWorks wiki and even glanced at the GPFS documentation and haven?t found anything that says ?bad idea, Kevin?? ;-) > > Expanding on this further ? if we just present those two RAID 5 LUNs to GPFS as NSDs then we can only have two NSD servers as primary for them. So another thing we?re considering is to take those RAID 5 LUNs and further sub-divide them into a total of 8 logical volumes, each of which could be a GPFS NSD and therefore would allow us to have each of our 8 NSD servers be primary for one of them. Even worse idea?!? Good idea? > > Anybody have any better ideas??? ;-) > > Oh, and currently we?re on GPFS 4.2.3-10, but are also planning on moving to GPFS 5.0.1-x before creating the new filesystem. > > Thanks much? > > ? > Kevin Buterbaugh - Senior System Administrator > Vanderbilt University - Advanced Computing Center for Research and Education > Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From luis.bolinches at fi.ibm.com Thu Sep 6 09:32:11 2018 From: luis.bolinches at fi.ibm.com (Luis Bolinches) Date: Thu, 6 Sep 2018 08:32:11 +0000 Subject: [gpfsug-discuss] RAID type for system pool In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From jonathan.buzzard at strath.ac.uk Thu Sep 6 11:45:39 2018 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Thu, 06 Sep 2018 11:45:39 +0100 Subject: [gpfsug-discuss] RAID type for system pool In-Reply-To: References: <92924CCC-FF4A-4F14-BA19-D35BC93CE69F@vanderbilt.edu>

Message-ID: <1536230739.17046.18.camel@strath.ac.uk> On Wed, 2018-09-05 at 13:37 -0400, Frederick Stock wrote: > Another option for saving space is to not keep 2 copies of the > metadata within GPFS. ?The SSDs are mirrored so you have two copies > though very likely they share a possible single point of failure and > that could be a deal breaker. ?I have my doubts that RAID5 will > perform well for the reasons Marc described but worth testing to see > how it does perform. ?If you do test I presume you would also run > equivalent tests with a RAID1 (mirrored) configuration. > When you have been on the wrong end of a double disk failure in a RAID1 when the second disk failed during the rebuild you probably want to steer clear of such recklessness on a multi TB GPFS file system :-) JAB. -- Jonathan A. Buzzard?????????????????????????Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From Kevin.Buterbaugh at Vanderbilt.Edu Thu Sep 6 15:58:39 2018 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Thu, 6 Sep 2018 14:58:39 +0000 Subject: [gpfsug-discuss] RAID type for system pool In-Reply-To: References: <92924CCC-FF4A-4F14-BA19-D35BC93CE69F@vanderbilt.edu>

Message-ID: <3B1C05ED-3058-4644-BC54-5BD1ED583C88@vanderbilt.edu> Hi All, Wow - my query got more responses than I expected and my sincere thanks to all who took the time to respond! At this point in time we do have two GPFS filesystems ? one which is basically ?/home? and some software installations and the other which is ?/scratch? and ?/data? (former backed up, latter not). Both of them have their metadata on SSDs set up as RAID 1 mirrors and replication set to two. But at this point in time all of the SSDs are in a single storage array (albeit with dual redundant controllers) ? so the storage array itself is my only SPOF. As part of the hardware purchase we are in the process of making we will be buying a 2nd storage array that can house 2.5? SSDs. Therefore, we will be splitting our SSDs between chassis and eliminating that last SPOF. Of course, this includes the new SSDs we are getting for our new /home filesystem. Our plan right now is to buy 10 SSDs, which will allow us to test 3 configurations: 1) two 4+1P RAID 5 LUNs split up into a total of 8 LV?s (with each of my 8 NSD servers as primary for one of those LV?s and the other 7 as backups) and GPFS metadata replication set to 2. 2) four RAID 1 mirrors (which obviously leaves 2 SSDs unused) and GPFS metadata replication set to 2. This would mean that only 4 of my 8 NSD servers would be a primary. 3) nine RAID 0 / bare drives with GPFS metadata replication set to 3 (which leaves 1 SSD unused). All 8 NSD servers primary for one SSD and 1 serving up two. The responses I received concerning RAID 5 and performance were not a surprise to me. The main advantage that option gives is the most usable storage space for the money (in fact, it gives us way more storage space than we currently need) ? but if it tanks performance, then that?s a deal breaker. Personally, I like the four RAID 1 mirrors config like we?ve been using for years, but it has the disadvantage of giving us the least usable storage space ? that config would give us the minimum we need for right now, but doesn?t really allow for much future growth. I have no experience with metadata replication of 3 (but had actually thought of that option, so feel good that others suggested it) so option 3 will be a brand new experience for us. It is the most optimal in terms of meeting current needs plus allowing for future growth without giving us way more space than we are likely to need). I will be curious to see how long it takes GPFS to re-replicate the data when we simulate a drive failure as opposed to how long a RAID rebuild takes. I am a big believer in Murphy?s Law (Sunday I paid off a bill, Wednesday my refrigerator died!) ? and also believe that the definition of a pessimist is ?someone with experience? ? so we will definitely not set GPFS metadata replication to less than two, nor will we use non-Enterprise class SSDs for metadata ? but I do still appreciate the suggestions. If there is interest, I will report back on our findings. If anyone has any additional thoughts or suggestions, I?d also appreciate hearing them. Again, thank you! Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Thu Sep 6 16:20:43 2018 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Thu, 6 Sep 2018 11:20:43 -0400 Subject: [gpfsug-discuss] RAID type for system pool In-Reply-To: References: <92924CCC-FF4A-4F14-BA19-D35BC93CE69F@vanderbilt.edu> Message-ID: Perhaps repeating myself, but consider no-RAID or RAID "0" and -M MaxMetadataReplicas Specifies the default maximum number of copies of inodes, directories, and indirect blocks for a file. Valid values are 1, 2, and 3. This value cannot be less than the value of DefaultMetadataReplicas. The default is 2. SO you can have triple redundancy with no shared physical point of failure. When you depend a particular RAID controller to do replication and subsequent recovery for you, then you are depending on that RAID controller. Of course, when you take this point of view to the extreme, you realize that for any individual datum you are depending on the single generator or source of that datum being correct, the OS and filesystem software and CPU, etc, etc.... Until you get to the point just beyond where the datum is replicated... -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Thu Sep 6 17:09:10 2018 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Thu, 6 Sep 2018 12:09:10 -0400 Subject: [gpfsug-discuss] RAID type for system pool In-Reply-To: References: <92924CCC-FF4A-4F14-BA19-D35BC93CE69F@vanderbilt.edu> Message-ID: A somewhat smarter RAID controller will "only" need to read the old values of the single changed segment of data and the corresponding parity segment, and know the new value of the data block. Then it can compute the new parity segment value. Not necessarily the entire stripe. Still 2 reads and 2 writes + access delay times ( guaranteed more than one full rotation time when on spinning disks, average something like 1.7x rotation time ). From: "Uwe Falke" To: gpfsug main discussion list Date: 09/05/2018 04:07 PM Subject: Re: [gpfsug-discuss] RAID type for system pool Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi, just think that your RAID controller on parity-backed redundancy needs to read the full stripe, modify it, and write it back (including parity) - the infamous Read-Modify-Write penalty. As long as your users don't bulk-create inodes and doo amend some metadata, (create a file sometimes, e.g.) The writing of a 4k inode, or the update of a 32k dir block causes your controller to read a full block (let's say you use 1MiB on MD) and write back the full block plus parity (on 4+1p RAID 5 at 1MiB that'll be 1.25MiB. Overhead two orders of magnitude above the payload. SSDs have become better now and expensive enterprise SSDs will endure quite a lot of full rewrites, but you need to estimate the MD change rate, apply the RMW overhead and see where you end WRT lifetime (and performance). Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 05/09/2018 17:35 Subject: [gpfsug-discuss] RAID type for system pool Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi All, We are in the process of finalizing the purchase of some new storage arrays (so no sales people who might be monitoring this list need contact me) to life-cycle some older hardware. One of the things we are considering is the purchase of some new SSD?s for our ?/home? filesystem and I have a question or two related to that. Currently, the existing home filesystem has it?s metadata on SSD?s ? two RAID 1 mirrors and metadata replication set to two. However, the filesystem itself is old enough that it uses 512 byte inodes. We have analyzed our users files and know that if we create a new filesystem with 4K inodes that a very significant portion of the files would now have their _data_ stored in the inode as well due to the files being 3.5K or smaller (currently all data is on spinning HD RAID 1 mirrors). Of course, if we increase the size of the inodes by a factor of 8 then we also need 8 times as much space to store those inodes. Given that Enterprise class SSDs are still very expensive and our budget is not unlimited, we?re trying to get the best bang for the buck. We have always - even back in the day when our metadata was on spinning disk and not SSD - used RAID 1 mirrors and metadata replication of two. However, we are wondering if it might be possible to switch to RAID 5? Specifically, what we are considering doing is buying 8 new SSDs and creating two 3+1P RAID 5 LUNs (metadata replication would stay at two). That would give us 50% more usable space than if we configured those same 8 drives as four RAID 1 mirrors. Unfortunately, unless I?m misunderstanding something, mean that the RAID stripe size and the GPFS block size could not match. Therefore, even though we don?t need the space, would we be much better off to buy 10 SSDs and create two 4+1P RAID 5 LUNs? I?ve searched the mailing list archives and scanned the DeveloperWorks wiki and even glanced at the GPFS documentation and haven?t found anything that says ?bad idea, Kevin?? ;-) Expanding on this further ? if we just present those two RAID 5 LUNs to GPFS as NSDs then we can only have two NSD servers as primary for them. So another thing we?re considering is to take those RAID 5 LUNs and further sub-divide them into a total of 8 logical volumes, each of which could be a GPFS NSD and therefore would allow us to have each of our 8 NSD servers be primary for one of them. Even worse idea?!? Good idea? Anybody have any better ideas??? ;-) Oh, and currently we?re on GPFS 4.2.3-10, but are also planning on moving to GPFS 5.0.1-x before creating the new filesystem. Thanks much? ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From bbanister at jumptrading.com Thu Sep 6 17:19:00 2018 From: bbanister at jumptrading.com (Bryan Banister) Date: Thu, 6 Sep 2018 16:19:00 +0000 Subject: [gpfsug-discuss] RAID type for system pool In-Reply-To: <3B1C05ED-3058-4644-BC54-5BD1ED583C88@vanderbilt.edu> References: <92924CCC-FF4A-4F14-BA19-D35BC93CE69F@vanderbilt.edu>

<3B1C05ED-3058-4644-BC54-5BD1ED583C88@vanderbilt.edu> Message-ID: <269a9d7aeb3f4597adb22dfb3c2d8365@jumptrading.com> I have questions about how the GPFS metadata replication of 3 works. 1. Is it basically the same as replication of 2 but just have one more copy, making recovery much more likely? 1. If there is nothing that is checking that the data was correctly read off of the device (e.g. CRC checking ON READS like the DDNs do, T10PI or Data Integrity Field) then how does GPFS handle a corrupted read of the data? - unlikely with SSD but head could be off on a NLSAS read, no errors, but you get some garbage instead, plus no auto retries 1. Does GPFS read at least two of the three replicas and compares them to ensure the data is correct? - expensive operation, so very unlikely 1. If not reading multiple replicas for comparison, are reads round robin across all three copies? 1. If one replica is corrupted (bad blocks) what does GPFS do to recover this metadata copy? Is this automatic or does this require a manual `mmrestripefs -c` operation or something? - If not, seems like a pretty simple idea and maybe an RFE worthy submission 1. Would the idea of an option to run ?background scrub/verifies? of the data/metadata be worthwhile to ensure no hidden bad blocks? - Using QoS this should be relatively painless 1. With a drive failure do you have to delete the NSD from the file system and cluster, recreate the NSD, add it back to the FS, then again run the `mmrestripefs -c` operation to restore the replication? - As Kevin mentions this will end up being a FULL file system scan vs. a block-based scan and replication. That could take a long time depending on number of inodes and type of storage! Thanks for any insight, -Bryan From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Buterbaugh, Kevin L Sent: Thursday, September 6, 2018 9:59 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] RAID type for system pool Note: External Email ________________________________ Hi All, Wow - my query got more responses than I expected and my sincere thanks to all who took the time to respond! At this point in time we do have two GPFS filesystems ? one which is basically ?/home? and some software installations and the other which is ?/scratch? and ?/data? (former backed up, latter not). Both of them have their metadata on SSDs set up as RAID 1 mirrors and replication set to two. But at this point in time all of the SSDs are in a single storage array (albeit with dual redundant controllers) ? so the storage array itself is my only SPOF. As part of the hardware purchase we are in the process of making we will be buying a 2nd storage array that can house 2.5? SSDs. Therefore, we will be splitting our SSDs between chassis and eliminating that last SPOF. Of course, this includes the new SSDs we are getting for our new /home filesystem. Our plan right now is to buy 10 SSDs, which will allow us to test 3 configurations: 1) two 4+1P RAID 5 LUNs split up into a total of 8 LV?s (with each of my 8 NSD servers as primary for one of those LV?s and the other 7 as backups) and GPFS metadata replication set to 2. 2) four RAID 1 mirrors (which obviously leaves 2 SSDs unused) and GPFS metadata replication set to 2. This would mean that only 4 of my 8 NSD servers would be a primary. 3) nine RAID 0 / bare drives with GPFS metadata replication set to 3 (which leaves 1 SSD unused). All 8 NSD servers primary for one SSD and 1 serving up two. The responses I received concerning RAID 5 and performance were not a surprise to me. The main advantage that option gives is the most usable storage space for the money (in fact, it gives us way more storage space than we currently need) ? but if it tanks performance, then that?s a deal breaker. Personally, I like the four RAID 1 mirrors config like we?ve been using for years, but it has the disadvantage of giving us the least usable storage space ? that config would give us the minimum we need for right now, but doesn?t really allow for much future growth. I have no experience with metadata replication of 3 (but had actually thought of that option, so feel good that others suggested it) so option 3 will be a brand new experience for us. It is the most optimal in terms of meeting current needs plus allowing for future growth without giving us way more space than we are likely to need). I will be curious to see how long it takes GPFS to re-replicate the data when we simulate a drive failure as opposed to how long a RAID rebuild takes. I am a big believer in Murphy?s Law (Sunday I paid off a bill, Wednesday my refrigerator died!) ? and also believe that the definition of a pessimist is ?someone with experience? ? so we will definitely not set GPFS metadata replication to less than two, nor will we use non-Enterprise class SSDs for metadata ? but I do still appreciate the suggestions. If there is interest, I will report back on our findings. If anyone has any additional thoughts or suggestions, I?d also appreciate hearing them. Again, thank you! Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential, or privileged information and/or personal data. If you are not the intended recipient, you are hereby notified that any review, dissemination, or copying of this email is strictly prohibited, and requested to notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request, or solicitation of any kind to buy, sell, subscribe, redeem, or perform any type of transaction of a financial product. Personal data, as defined by applicable data privacy laws, contained in this email may be processed by the Company, and any of its affiliated or related companies, for potential ongoing compliance and/or business-related purposes. You may have rights regarding your personal data; for information on exercising these rights or the Company?s treatment of personal data, please email datarequests at jumptrading.com. -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Thu Sep 6 18:06:17 2018 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Thu, 6 Sep 2018 13:06:17 -0400 Subject: [gpfsug-discuss] RAID type for system pool In-Reply-To: <269a9d7aeb3f4597adb22dfb3c2d8365@jumptrading.com> References: <92924CCC-FF4A-4F14-BA19-D35BC93CE69F@vanderbilt.edu>

<3B1C05ED-3058-4644-BC54-5BD1ED583C88@vanderbilt.edu> <269a9d7aeb3f4597adb22dfb3c2d8365@jumptrading.com> Message-ID: Answers inline based on my recollection of experiences we've had here: On 9/6/18 12:19 PM, Bryan Banister wrote: > I have questions about how the GPFS metadata replication of 3 works. > > 1. Is it basically the same as replication of 2 but just have one more > copy, making recovery much more likely? That's my understanding. > 2. If there is nothing that is checking that the data was correctly > read off of the device (e.g. CRC checking ON READS like the DDNs do, > T10PI or Data Integrity Field) then how does GPFS handle a corrupted > read of the data? > - unlikely with SSD but head could be off on a NLSAS read, no > errors, but you get some garbage instead, plus no auto retries The inode itself is checksummed: # /usr/lpp/mmfs/bin/tsdbfs mysuperawesomespacefs Enter command or null to read next sector. Type ? for help. inode 20087366 Inode 20087366 [20087366] snap 0 (index 582 in block 9808): Inode address: 30:263275078 32:263264838 size 512 nAddrs 32 indirectionLevel=3 status=USERFILE objectVersion=49352 generation=0x2B519B3 nlink=1 owner uid=8675309 gid=999 mode=0200100600: -rw------- blocksize code=5 (32 subblocks) lastBlockSubblocks=1 checksum=0xF2EF3427 is Valid ... Disk pointers [32]: 0: 31:217629376 1: 30:217632960 2: (null) ... 31: (null) as are indirect blocks (I'm sure that's not an exhaustive list of checksummed metadata structures): ind 31:217629376 Indirect block starting in sector 31:217629376: magic=0x112DF307 generation=0x2B519B3 blockNum=0 inodeNum=20087366 indirection level=2 checksum=0x6BDAA92A CalcChecksum(0x5B6DC9FC000, 32768, 20)=0x6BDAA92A Data pointers: > 3. Does GPFS read at least two of the three replicas and compares them > to ensure the data is correct? > - expensive operation, so very unlikely I don't know, but I do know it verifies the checksum and I believe if that's wrong it will try another replica. > 4. If not reading multiple replicas for comparison, are reads round > robin across all three copies? I feel like we see pretty even distribution of reads across all replicas of our metadata LUNs, although this is looking overall at the array level so it may be a red herring. > 5. If one replica is corrupted (bad blocks) what does GPFS do to > recover this metadata copy?? Is this automatic or does this require > a manual `mmrestripefs -c` operation or something? > - If not, seems like a pretty simple idea and maybe an RFE worthy > submission My experience has been it will attempt to correct it (and maybe log an fsstruct error?). This was in the 3.5 days, though. > 6. Would the idea of an option to run ?background scrub/verifies? of > the data/metadata be worthwhile to ensure no hidden bad blocks? > - Using QoS this should be relatively painless If you don't have array-level background scrubbing, this is what I'd suggest. (e.g. mmrestripefs -c --metadata-only). > 7. With a drive failure do you have to delete the NSD from the file > system and cluster, recreate the NSD, add it back to the FS, then > again run the `mmrestripefs -c` operation to restore the replication? > - As Kevin mentions this will end up being a FULL file system scan > vs. a block-based scan and replication.? That could take a long time > depending on number of inodes and type of storage! > > Thanks for any insight, > > -Bryan > > *From:* gpfsug-discuss-bounces at spectrumscale.org > *On Behalf Of *Buterbaugh, > Kevin L > *Sent:* Thursday, September 6, 2018 9:59 AM > *To:* gpfsug main discussion list > *Subject:* Re: [gpfsug-discuss] RAID type for system pool > > /Note: External Email/ > > ------------------------------------------------------------------------ > > Hi All, > > Wow - my query got more responses than I expected and my sincere thanks > to all who took the time to respond! > > At this point in time we do have two GPFS filesystems ? one which is > basically ?/home? and some software installations and the other which is > ?/scratch? and ?/data? (former backed up, latter not). ?Both of them > have their metadata on SSDs set up as RAID 1 mirrors and replication set > to two. ?But at this point in time all of the SSDs are in a single > storage array (albeit with dual redundant controllers) ? so the storage > array itself is my only SPOF. > > As part of the hardware purchase we are in the process of making we will > be buying a 2nd storage array that can house 2.5? SSDs. ?Therefore, we > will be splitting our SSDs between chassis and eliminating that last > SPOF. ?Of course, this includes the new SSDs we are getting for our new > /home filesystem. > > Our plan right now is to buy 10 SSDs, which will allow us to test 3 > configurations: > > 1) two 4+1P RAID 5 LUNs split up into a total of 8 LV?s (with each of my > 8 NSD servers as primary for one of those LV?s and the other 7 as > backups) and GPFS metadata replication set to 2. > > 2) four RAID 1 mirrors (which obviously leaves 2 SSDs unused) and GPFS > metadata replication set to 2. ?This would mean that only 4 of my 8 NSD > servers would be a primary. > > 3) nine RAID 0 / bare drives with GPFS metadata replication set to 3 > (which leaves 1 SSD unused). ?All 8 NSD servers primary for one SSD and > 1 serving up two. > > The responses I received concerning RAID 5 and performance were not a > surprise to me. ?The main advantage that option gives is the most usable > storage space for the money (in fact, it gives us way more storage space > than we currently need) ? but if it tanks performance, then that?s a > deal breaker. > > Personally, I like the four RAID 1 mirrors config like we?ve been using > for years, but it has the disadvantage of giving us the least usable > storage space ? that config would give us the minimum we need for right > now, but doesn?t really allow for much future growth. > > I have no experience with metadata replication of 3 (but had actually > thought of that option, so feel good that others suggested it) so option > 3 will be a brand new experience for us. ?It is the most optimal in > terms of meeting current needs plus allowing for future growth without > giving us way more space than we are likely to need). ?I will be curious > to see how long it takes GPFS to re-replicate the data when we simulate > a drive failure as opposed to how long a RAID rebuild takes. > > I am a big believer in Murphy?s Law (Sunday I paid off a bill, Wednesday > my refrigerator died!) ? and also believe that the definition of a > pessimist is ?someone with experience? ? so we will definitely > not set GPFS metadata replication to less than two, nor will we use > non-Enterprise class SSDs for metadata ? but I do still appreciate the > suggestions. > > If there is interest, I will report back on our findings. ?If anyone has > any additional thoughts or suggestions, I?d also appreciate hearing > them. ?Again, thank you! > > Kevin > > ? > > Kevin Buterbaugh - Senior System Administrator > > Vanderbilt University - Advanced Computing Center for Research and Education > > Kevin.Buterbaugh at vanderbilt.edu > ?- (615)875-9633 > > > ------------------------------------------------------------------------ > > Note: This email is for the confidential use of the named addressee(s) > only and may contain proprietary, confidential, or privileged > information and/or personal data. If you are not the intended recipient, > you are hereby notified that any review, dissemination, or copying of > this email is strictly prohibited, and requested to notify the sender > immediately and destroy this email and any attachments. Email > transmission cannot be guaranteed to be secure or error-free. The > Company, therefore, does not make any guarantees as to the completeness > or accuracy of this email or any attachments. This email is for > informational purposes only and does not constitute a recommendation, > offer, request, or solicitation of any kind to buy, sell, subscribe, > redeem, or perform any type of transaction of a financial product. > Personal data, as defined by applicable data privacy laws, contained in > this email may be processed by the Company, and any of its affiliated or > related companies, for potential ongoing compliance and/or > business-related purposes. You may have rights regarding your personal > data; for information on exercising these rights or the Company?s > treatment of personal data, please email datarequests at jumptrading.com. > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From S.J.Thompson at bham.ac.uk Thu Sep 6 18:49:25 2018 From: S.J.Thompson at bham.ac.uk (Simon Thompson) Date: Thu, 6 Sep 2018 17:49:25 +0000 Subject: [gpfsug-discuss] RAID type for system pool In-Reply-To: References: <92924CCC-FF4A-4F14-BA19-D35BC93CE69F@vanderbilt.edu>

<3B1C05ED-3058-4644-BC54-5BD1ED583C88@vanderbilt.edu> <269a9d7aeb3f4597adb22dfb3c2d8365@jumptrading.com>, Message-ID: I thought reads were always round robin's (in some form) unless you set readreplicapolicy. And I thought with fsstruct you had to use mmfsck offline to fix. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Aaron Knister [aaron.s.knister at nasa.gov] Sent: 06 September 2018 18:06 To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] RAID type for system pool Answers inline based on my recollection of experiences we've had here: On 9/6/18 12:19 PM, Bryan Banister wrote: > I have questions about how the GPFS metadata replication of 3 works. > > 1. Is it basically the same as replication of 2 but just have one more > copy, making recovery much more likely? That's my understanding. > 2. If there is nothing that is checking that the data was correctly > read off of the device (e.g. CRC checking ON READS like the DDNs do, > T10PI or Data Integrity Field) then how does GPFS handle a corrupted > read of the data? > - unlikely with SSD but head could be off on a NLSAS read, no > errors, but you get some garbage instead, plus no auto retries The inode itself is checksummed: # /usr/lpp/mmfs/bin/tsdbfs mysuperawesomespacefs Enter command or null to read next sector. Type ? for help. inode 20087366 Inode 20087366 [20087366] snap 0 (index 582 in block 9808): Inode address: 30:263275078 32:263264838 size 512 nAddrs 32 indirectionLevel=3 status=USERFILE objectVersion=49352 generation=0x2B519B3 nlink=1 owner uid=8675309 gid=999 mode=0200100600: -rw------- blocksize code=5 (32 subblocks) lastBlockSubblocks=1 checksum=0xF2EF3427 is Valid ... Disk pointers [32]: 0: 31:217629376 1: 30:217632960 2: (null) ... 31: (null) as are indirect blocks (I'm sure that's not an exhaustive list of checksummed metadata structures): ind 31:217629376 Indirect block starting in sector 31:217629376: magic=0x112DF307 generation=0x2B519B3 blockNum=0 inodeNum=20087366 indirection level=2 checksum=0x6BDAA92A CalcChecksum(0x5B6DC9FC000, 32768, 20)=0x6BDAA92A Data pointers: > 3. Does GPFS read at least two of the three replicas and compares them > to ensure the data is correct? > - expensive operation, so very unlikely I don't know, but I do know it verifies the checksum and I believe if that's wrong it will try another replica. > 4. If not reading multiple replicas for comparison, are reads round > robin across all three copies? I feel like we see pretty even distribution of reads across all replicas of our metadata LUNs, although this is looking overall at the array level so it may be a red herring. > 5. If one replica is corrupted (bad blocks) what does GPFS do to > recover this metadata copy? Is this automatic or does this require > a manual `mmrestripefs -c` operation or something? > - If not, seems like a pretty simple idea and maybe an RFE worthy > submission My experience has been it will attempt to correct it (and maybe log an fsstruct error?). This was in the 3.5 days, though. > 6. Would the idea of an option to run ?background scrub/verifies? of > the data/metadata be worthwhile to ensure no hidden bad blocks? > - Using QoS this should be relatively painless If you don't have array-level background scrubbing, this is what I'd suggest. (e.g. mmrestripefs -c --metadata-only). > 7. With a drive failure do you have to delete the NSD from the file > system and cluster, recreate the NSD, add it back to the FS, then > again run the `mmrestripefs -c` operation to restore the replication? > - As Kevin mentions this will end up being a FULL file system scan > vs. a block-based scan and replication. That could take a long time > depending on number of inodes and type of storage! > > Thanks for any insight, > > -Bryan > > *From:* gpfsug-discuss-bounces at spectrumscale.org > *On Behalf Of *Buterbaugh, > Kevin L > *Sent:* Thursday, September 6, 2018 9:59 AM > *To:* gpfsug main discussion list > *Subject:* Re: [gpfsug-discuss] RAID type for system pool > > /Note: External Email/ > > ------------------------------------------------------------------------ > > Hi All, > > Wow - my query got more responses than I expected and my sincere thanks > to all who took the time to respond! > > At this point in time we do have two GPFS filesystems ? one which is > basically ?/home? and some software installations and the other which is > ?/scratch? and ?/data? (former backed up, latter not). Both of them > have their metadata on SSDs set up as RAID 1 mirrors and replication set > to two. But at this point in time all of the SSDs are in a single > storage array (albeit with dual redundant controllers) ? so the storage > array itself is my only SPOF. > > As part of the hardware purchase we are in the process of making we will > be buying a 2nd storage array that can house 2.5? SSDs. Therefore, we > will be splitting our SSDs between chassis and eliminating that last > SPOF. Of course, this includes the new SSDs we are getting for our new > /home filesystem. > > Our plan right now is to buy 10 SSDs, which will allow us to test 3 > configurations: > > 1) two 4+1P RAID 5 LUNs split up into a total of 8 LV?s (with each of my > 8 NSD servers as primary for one of those LV?s and the other 7 as > backups) and GPFS metadata replication set to 2. > > 2) four RAID 1 mirrors (which obviously leaves 2 SSDs unused) and GPFS > metadata replication set to 2. This would mean that only 4 of my 8 NSD > servers would be a primary. > > 3) nine RAID 0 / bare drives with GPFS metadata replication set to 3 > (which leaves 1 SSD unused). All 8 NSD servers primary for one SSD and > 1 serving up two. > > The responses I received concerning RAID 5 and performance were not a > surprise to me. The main advantage that option gives is the most usable > storage space for the money (in fact, it gives us way more storage space > than we currently need) ? but if it tanks performance, then that?s a > deal breaker. > > Personally, I like the four RAID 1 mirrors config like we?ve been using > for years, but it has the disadvantage of giving us the least usable > storage space ? that config would give us the minimum we need for right > now, but doesn?t really allow for much future growth. > > I have no experience with metadata replication of 3 (but had actually > thought of that option, so feel good that others suggested it) so option > 3 will be a brand new experience for us. It is the most optimal in > terms of meeting current needs plus allowing for future growth without > giving us way more space than we are likely to need). I will be curious > to see how long it takes GPFS to re-replicate the data when we simulate > a drive failure as opposed to how long a RAID rebuild takes. > > I am a big believer in Murphy?s Law (Sunday I paid off a bill, Wednesday > my refrigerator died!) ? and also believe that the definition of a > pessimist is ?someone with experience? ? so we will definitely > not set GPFS metadata replication to less than two, nor will we use > non-Enterprise class SSDs for metadata ? but I do still appreciate the > suggestions. > > If there is interest, I will report back on our findings. If anyone has > any additional thoughts or suggestions, I?d also appreciate hearing > them. Again, thank you! > > Kevin > > ? > > Kevin Buterbaugh - Senior System Administrator > > Vanderbilt University - Advanced Computing Center for Research and Education > > Kevin.Buterbaugh at vanderbilt.edu > ?- (615)875-9633 > > > ------------------------------------------------------------------------ > > Note: This email is for the confidential use of the named addressee(s) > only and may contain proprietary, confidential, or privileged > information and/or personal data. If you are not the intended recipient, > you are hereby notified that any review, dissemination, or copying of > this email is strictly prohibited, and requested to notify the sender > immediately and destroy this email and any attachments. Email > transmission cannot be guaranteed to be secure or error-free. The > Company, therefore, does not make any guarantees as to the completeness > or accuracy of this email or any attachments. This email is for > informational purposes only and does not constitute a recommendation, > offer, request, or solicitation of any kind to buy, sell, subscribe, > redeem, or perform any type of transaction of a financial product. > Personal data, as defined by applicable data privacy laws, contained in > this email may be processed by the Company, and any of its affiliated or > related companies, for potential ongoing compliance and/or > business-related purposes. You may have rights regarding your personal > data; for information on exercising these rights or the Company?s > treatment of personal data, please email datarequests at jumptrading.com. > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From chair at spectrumscale.org Fri Sep 7 11:00:08 2018 From: chair at spectrumscale.org (Simon Thompson (Spectrum Scale User Group Chair)) Date: Fri, 07 Sep 2018 11:00:08 +0100 Subject: [gpfsug-discuss] Request for Enhancements (RFE) Forum - Submission Deadline October 1 In-Reply-To: <263e53c18647421f8b3cd936da0075df@jumptrading.com> References: <52220937-CE0A-4949-89A0-6EA41D5ECF93@lbl.gov> <263e53c18647421f8b3cd936da0075df@jumptrading.com> Message-ID: <0341213A-6CB7-434F-A575-9099C2D0D703@spectrumscale.org> GPFS/Spectrum Scale Users, Here?s a long-ish note about our plans to try and improve the RFE process. We?ve tried to include a tl;dr version if you just read the headers. You?ll find the details underneath ;-) and reading to the end is ideal. IMPROVING THE RFE PROCESS As you?ve heard on the list, and at some of the in-person User Group events, we?ve been talking about ways we can improve the RFE process. We?d like to begin having an RFE forum, and have it be de-coupled from the in-person events because we know not everyone can travel. LIGHTNING PRESENTATIONS ON-LINE In general terms, we?d have regular on-line events, where RFEs could be very briefly (5 minutes, lightning talk) presented by the requester. There would then be time for brief follow-on discussion and questions. The session would be recorded to deal with large time zone differences. The live meeting is planned for October 10th 2018, at 4PM BST (that should be 11am EST if we worked is out right!) FOLLOW UP POLL A poll, independent of current RFE voting, would be conducted a couple days after the recording was available to gather votes and feedback on the RFEs submitted ?we may collect site name, to see how many votes are coming from a certain site. MAY NOT GET IT RIGHT THE FIRST TIME We view this supplemental RFE process as organic, that is, we?ll learn as we go and make modifications. The overall goal here is to highlight the RFEs that matter the most to the largest number of UG members by providing a venue for people to speak about their RFEs and collect feedback from fellow community members. RFE PRESENTERS WANTED, SUBMISSION DEADLINE OCTOBER 1ST We?d like to guide a small handful of RFE submitters through this process the first time around, so if you?re interested in being a presenter, let us know now. We?re planning on doing the online meeting and poll for the first time in mid-October, so the submission deadline for your RFE is October 1st. If it?s useful, when you?re drafting your RFE feel free to use the list as a sounding board for feedback. Often sites have similar needs and you may find someone to collaborate with on your RFE to make it useful to more sites, and thereby get more votes. Some guidelines are here: https://drive.google.com/file/d/1o8nN39DTU32qj_EFia5wRhnvfvNfr3cI/view?usp=sharing You can submit you RFE by email to: rfe at spectrumscaleug.org PARTICIPANTS (AKA YOU!!), VIEW AND VOTE We are seeking very good participation in the RFE on-line events needed to make this an effective method of Spectrum Scale Community and IBM Developer collaboration. It is to your benefit to participate and help set priorities on Spectrum Scale enhancements!! We want to make this process light lifting for you as a participant. We will limit the duration of the meeting to 1 hour to minimize the use of your valuable time. Please register for the online meeting via Eventbrite (https://www.eventbrite.com/e/spectrum-scale-request-for-enhancements-voting-tickets-49979954389) ? we?ll send details of how to join the online meeting nearer the time. Thanks! Simon, Kristy, Bob, Bryan and Carl! -------------- next part -------------- An HTML attachment was scrubbed... URL: From Matthias.Knigge at rohde-schwarz.com Fri Sep 7 12:51:15 2018 From: Matthias.Knigge at rohde-schwarz.com (Matthias Knigge) Date: Fri, 7 Sep 2018 11:51:15 +0000 Subject: [gpfsug-discuss] Problem with mmlscluster and callback scripts Message-ID: Hello together, I am using the version 5.0.2.0 of GPFS and have problems with the command mmlscluster and callback-scripts. It is a small cluster of two nodes only. If I shutdown one of the nodes sometimes mmlscluster reports the following output: [root at gpfs-tier1 gpfs5.2]# mmgetstate Node number Node name GPFS state ------------------------------------------- 1 gpfs-tier1 arbitrating [root at gpfs-tier1 gpfs5.2]# mmlscluster ssh: connect to host gpfs-tier2 port 22: No route to host mmlscluster: Unable to retrieve GPFS cluster files from node gpfs-tier2 mmlscluster: Command failed. Examine previous error messages to determine cause. Normally the output is like this: [root at gpfs-tier1 gpfs5.2]# mmlscluster GPFS cluster information ======================== GPFS cluster name: TIERCLUSTER.gpfs-tier1 GPFS cluster id: 12458173498278694815 GPFS UID domain: TIERCLUSTER.gpfs-tier1 Remote shell command: /usr/bin/ssh Remote file copy command: /usr/bin/scp Repository type: server-based GPFS cluster configuration servers: ----------------------------------- Primary server: gpfs-tier2 Secondary server: gpfs-tier1 Node Daemon node name IP address Admin node name Designation ---------------------------------------------------------------------- 1 gpfs-tier1 192.168.178.10 gpfs-tier1 quorum-manager 2 gpfs-tier2 192.168.178.11 gpfs-tier2 quorum-manager [root at gpfs-tier1 gpfs5.2]# mmlscallback NodeDownCallback command = /var/mmfs/rs/nodedown.ksh priority = 1 event = quorumNodeLeave parms = %eventNode %quorumNodes NodeUpCallback command = /var/mmfs/rs/nodeup.ksh priority = 1 event = quorumNodeJoin parms = %eventNode %quorumNodes If I shutdown the filesystem via mmshutdown the callback script works but if I shutdown the whole node the scripts does not run. The latest log-entry in mmfs.log.latest shows only this information: 2018-09-07_13:12:36.724+0200: [I] Cluster Manager connection broke. Probing cluster TIERCLUSTER.gpfs-tier1 2018-09-07_13:12:37.226+0200: [E] Unable to contact enough other quorum nodes during cluster probe. 2018-09-07_13:12:37.226+0200: [E] Lost membership in cluster TIERCLUSTER.gpfs-tier1. Unmounting file systems. 2018-09-07_13:12:38.448+0200: [N] Connecting to 192.168.178.11 gpfs-tier2 Could anybody help me in this case? I want to try to start a script if one node goes down or up to change the roles for starting the filesystem. The callback event NodeLeave and NodeJoin do not run too. Any more information required? If yes, please let me know! Many thanks in advance and a nice weekend! Matthias Best Regards Matthias Knigge R&D File Based Media Solutions Rohde & Schwarz GmbH & Co. KG Hanomaghof 1 30449 Hannover Telefon +49 511 67 80 7 213 Fax +49 511 37 19 74 Internet: Matthias.Knigge at rohde-schwarz.com ------------------------------------------------------------ Gesch?ftsf?hrung / Executive Board: Christian Leicher (Vorsitzender / Chairman), Peter Riedel, Sitz der Gesellschaft / Company's Place of Business: M?nchen, Registereintrag / Commercial Register No.: HRA 16 270, Pers?nlich haftender Gesellschafter / Personally Liable Partner: RUSEG Verwaltungs-GmbH, Sitz der Gesellschaft / Company's Place of Business: M?nchen, Registereintrag / Commercial Register No.: HRB 7 534, Umsatzsteuer-Identifikationsnummer (USt-IdNr.) / VAT Identification No.: DE 130 256 683, Elektro-Altger?te Register (EAR) / WEEE Register No.: DE 240 437 86 -------------- next part -------------- An HTML attachment was scrubbed... URL: From stockf at us.ibm.com Fri Sep 7 14:19:51 2018 From: stockf at us.ibm.com (Frederick Stock) Date: Fri, 7 Sep 2018 09:19:51 -0400 Subject: [gpfsug-discuss] Problem with mmlscluster and callback scripts In-Reply-To: References: Message-ID: Are you really running version 5.0.2? If so then I presume you have a beta version since it has not yet been released. For beta problems there is a specific feedback mechanism that should be used to report problems. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: Matthias Knigge To: "gpfsug-discuss at spectrumscale.org" Date: 09/07/2018 08:08 AM Subject: [gpfsug-discuss] Problem with mmlscluster and callback scripts Sent by: gpfsug-discuss-bounces at spectrumscale.org Hello together, I am using the version 5.0.2.0 of GPFS and have problems with the command mmlscluster and callback-scripts. It is a small cluster of two nodes only. If I shutdown one of the nodes sometimes mmlscluster reports the following output: [root at gpfs-tier1 gpfs5.2]# mmgetstate Node number Node name GPFS state ------------------------------------------- 1 gpfs-tier1 arbitrating [root at gpfs-tier1 gpfs5.2]# mmlscluster ssh: connect to host gpfs-tier2 port 22: No route to host mmlscluster: Unable to retrieve GPFS cluster files from node gpfs-tier2 mmlscluster: Command failed. Examine previous error messages to determine cause. Normally the output is like this: [root at gpfs-tier1 gpfs5.2]# mmlscluster GPFS cluster information ======================== GPFS cluster name: TIERCLUSTER.gpfs-tier1 GPFS cluster id: 12458173498278694815 GPFS UID domain: TIERCLUSTER.gpfs-tier1 Remote shell command: /usr/bin/ssh Remote file copy command: /usr/bin/scp Repository type: server-based GPFS cluster configuration servers: ----------------------------------- Primary server: gpfs-tier2 Secondary server: gpfs-tier1 Node Daemon node name IP address Admin node name Designation ---------------------------------------------------------------------- 1 gpfs-tier1 192.168.178.10 gpfs-tier1 quorum-manager 2 gpfs-tier2 192.168.178.11 gpfs-tier2 quorum-manager [root at gpfs-tier1 gpfs5.2]# mmlscallback NodeDownCallback command = /var/mmfs/rs/nodedown.ksh priority = 1 event = quorumNodeLeave parms = %eventNode %quorumNodes NodeUpCallback command = /var/mmfs/rs/nodeup.ksh priority = 1 event = quorumNodeJoin parms = %eventNode %quorumNodes If I shutdown the filesystem via mmshutdown the callback script works but if I shutdown the whole node the scripts does not run. The latest log-entry in mmfs.log.latest shows only this information: 2018-09-07_13:12:36.724+0200: [I] Cluster Manager connection broke. Probing cluster TIERCLUSTER.gpfs-tier1 2018-09-07_13:12:37.226+0200: [E] Unable to contact enough other quorum nodes during cluster probe. 2018-09-07_13:12:37.226+0200: [E] Lost membership in cluster TIERCLUSTER.gpfs-tier1. Unmounting file systems. 2018-09-07_13:12:38.448+0200: [N] Connecting to 192.168.178.11 gpfs-tier2 Could anybody help me in this case? I want to try to start a script if one node goes down or up to change the roles for starting the filesystem. The callback event NodeLeave and NodeJoin do not run too. Any more information required? If yes, please let me know! Many thanks in advance and a nice weekend! Matthias Best Regards Matthias Knigge R&D File Based Media Solutions Rohde & Schwarz GmbH & Co. KG Hanomaghof 1 30449 Hannover Telefon +49 511 67 80 7 213 Fax +49 511 37 19 74 Internet: Matthias.Knigge at rohde-schwarz.com ------------------------------------------------------------ Gesch?ftsf?hrung / Executive Board: Christian Leicher (Vorsitzender / Chairman), Peter Riedel, Sitz der Gesellschaft / Company's Place of Business: M?nchen, Registereintrag / Commercial Register No.: HRA 16 270, Pers?nlich haftender Gesellschafter / Personally Liable Partner: RUSEG Verwaltungs-GmbH, Sitz der Gesellschaft / Company's Place of Business: M?nchen, Registereintrag / Commercial Register No.: HRB 7 534, Umsatzsteuer-Identifikationsnummer (USt-IdNr.) / VAT Identification No.: DE 130 256 683, Elektro-Altger?te Register (EAR) / WEEE Register No.: DE 240 437 86 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Fri Sep 7 14:35:24 2018 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Fri, 7 Sep 2018 09:35:24 -0400 Subject: [gpfsug-discuss] Problem with mmlscluster and callback scripts In-Reply-To: References: Message-ID: <44c29793-ef1a-8b58-2ad0-75c8328d9364@nasa.gov> Hi Matthias, Looks like you lost quorum in the cluster (you've got to have (n/2+1) quorum nodes up if you're using node-based quorum). Do you have a tiebreaker disk defined? (i.e. mmlsconfig tiebreakerdisk). -Aaron On 9/7/18 7:51 AM, Matthias Knigge wrote: > Hello together, > > I am using the version 5.0.2.0 of GPFS and have problems with the > command mmlscluster and callback-scripts. It is a small cluster of two > nodes only. If I shutdown one of the nodes sometimes mmlscluster reports > the following output: > > [root at gpfs-tier1 gpfs5.2]# mmgetstate > > Node number? Node name??????? GPFS state > > ------------------------------------------- > > ?????? 1????? gpfs-tier1?????? arbitrating > > [root at gpfs-tier1 gpfs5.2]# mmlscluster > > ssh: connect to host gpfs-tier2 port 22: No route to host > > mmlscluster: Unable to retrieve GPFS cluster files from node gpfs-tier2 > > mmlscluster: Command failed. Examine previous error messages to > determine cause. > > Normally the output is like this: > > [root at gpfs-tier1 gpfs5.2]# mmlscluster > > GPFS cluster information > > ======================== > > ? GPFS cluster name:???????? TIERCLUSTER.gpfs-tier1 > > ? GPFS cluster id:?????????? 12458173498278694815 > > ? GPFS UID domain:?????????? TIERCLUSTER.gpfs-tier1 > > ? Remote shell command:????? /usr/bin/ssh > > ? Remote file copy command:? /usr/bin/scp > > ? Repository type:?????????? server-based > > GPFS cluster configuration servers: > > ----------------------------------- > > ? Primary server:??? gpfs-tier2 > > ? Secondary server:? gpfs-tier1 > > Node? Daemon node name? IP address????? Admin node name? Designation > > ---------------------------------------------------------------------- > > ?? 1?? gpfs-tier1??????? 192.168.178.10? gpfs-tier1?????? quorum-manager > > ?? 2?? gpfs-tier2??????? 192.168.178.11? gpfs-tier2?????? quorum-manager > > [root at gpfs-tier1 gpfs5.2]# mmlscallback > > NodeDownCallback > > ??????? command?????? = /var/mmfs/rs/nodedown.ksh > > ??????? priority????? = 1 > > ??????? event???????? = quorumNodeLeave > > ??????? parms???????? = %eventNode %quorumNodes > > NodeUpCallback > > ??????? command?????? = /var/mmfs/rs/nodeup.ksh > > ??????? priority????? = 1 > > ??????? event???????? = quorumNodeJoin > > ??????? parms???????? = %eventNode %quorumNodes > > If I shutdown the filesystem via mmshutdown the callback script works > but if I shutdown the whole node the scripts does not run. > > The latest log-entry in mmfs.log.latest shows only this information: > > 2018-09-07_13:12:36.724+0200: [I] Cluster Manager connection broke. > Probing cluster TIERCLUSTER.gpfs-tier1 > > 2018-09-07_13:12:37.226+0200: [E] Unable to contact enough other quorum > nodes during cluster probe. > > 2018-09-07_13:12:37.226+0200: [E] Lost membership in cluster > TIERCLUSTER.gpfs-tier1. Unmounting file systems. > > 2018-09-07_13:12:38.448+0200: [N] Connecting to 192.168.178.11 > gpfs-tier2 > > Could anybody help me in this case? I want to try to start a script if > one node goes down or up to change the roles for starting the > filesystem. The callback event NodeLeave and NodeJoin do not run too. > > Any more information required? If yes, please let me know! > > Many thanks in advance and a nice weekend! > > Matthias > > Best Regards > > Matthias Knigge > R&D File Based Media Solutions > > Rohde & Schwarz > GmbH & Co. KG > Hanomaghof 1 > 30449 Hannover > Telefon +49 511 67 80 7 213 > Fax +49 511 37 19 74 > Internet: Matthias.Knigge at rohde-schwarz.com > ------------------------------------------------------------ > Gesch?ftsf?hrung / Executive Board: Christian Leicher (Vorsitzender / > Chairman), Peter Riedel, Sitz der Gesellschaft / Company's Place of > Business: M?nchen, Registereintrag / Commercial Register No.: HRA 16 > 270, Pers?nlich haftender Gesellschafter / Personally Liable Partner: > RUSEG Verwaltungs-GmbH, Sitz der Gesellschaft / Company's Place of > Business: M?nchen, Registereintrag / Commercial Register No.: HRB 7 534, > Umsatzsteuer-Identifikationsnummer (USt-IdNr.) / VAT Identification No.: > DE 130 256 683, Elektro-Altger?te Register (EAR) / WEEE Register No.: DE > 240 437 86 > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From aaron.knister at gmail.com Fri Sep 7 18:27:04 2018 From: aaron.knister at gmail.com (Aaron Knister) Date: Fri, 7 Sep 2018 13:27:04 -0400 Subject: [gpfsug-discuss] mmfsadm dump condvar event blocks Message-ID: Looking at the output of mmfsadm dump condvar I see that the various condvar entries are grouped into event blocks. I?m curious of the significance of that. If you?ve got say two sets of condvars in the same event block what does that mean? Is there necessarily any relation between them? -Aaron From ty.tran at applieddatasystems.com Fri Sep 7 18:34:06 2018 From: ty.tran at applieddatasystems.com (Ty Tran) Date: Fri, 7 Sep 2018 17:34:06 +0000 Subject: [gpfsug-discuss] Spectrum V5.0.0 and CentOS 7.5 Message-ID: Good Morning ? We have been trying to install V5.0.0 and CentOS 7.5 but it doesn?t seem to like the new Kernel. Does anyone have this running and do we need to do anything special? Or must we go to V5.0.1? TQT Ty Q. Tran Managing Partner Applied Data Systems 12180 Dearborn Place Poway, CA 92064 (714) 392- 6690 (Cell) (844) 371- 4949 x100 (Work) (858) 842- 4678 (Fax) ty.tran at applieddatasystems.com www.applieddatasystems.com [cid:image001.png at 01D44696.4DB62B50] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 18825 bytes Desc: image001.png URL: From knop at us.ibm.com Fri Sep 7 23:08:02 2018 From: knop at us.ibm.com (Felipe Knop) Date: Fri, 7 Sep 2018 18:08:02 -0400 Subject: [gpfsug-discuss] Spectrum V5.0.0 and CentOS 7.5 In-Reply-To: References: Message-ID: Ty, For queries on Scale versions and specific distros, please refer to the FAQ: https://www.ibm.com/support/knowledgecenter/en/STXKQY/gpfsclustersfaq.html Table 33. IBM Spectrum Scale for Linux RedHat kernel support |---+----------+----------+-----------------------+-----------------------| | 7.| 3.10.0-86| 3.10.0-86| From V4.1.1.20 in the | From V4.1.1.20 in the | | 5 | 2.el7 | 2.el7 | 4.1.1 release | 4.1.1 release | | | | | | | | | | | | | | | | | From V4.2.3.9 in the | From V4.2.3.9 in the | | | | | 4.2 release | 4.2 release | | | | | | | | | | | | | | | | | From V5.0.1.1 in the | From V5.0.1.1 in the | | | | | 5.0 release | 5.0 release | |---+----------+----------+-----------------------+-----------------------| Assuming the levels of CentOS and RHEL are the same (they are supposed to be?), then CentOS 7.5 should work with Scale V5.0.1.1 or later. Felipe ---- Felipe Knop knop at us.ibm.com GPFS Development and Security IBM Systems IBM Building 008 2455 South Rd, Poughkeepsie, NY 12601 (845) 433-9314 T/L 293-9314 From: Ty Tran To: "gpfsug-discuss at spectrumscale.org" Date: 09/07/2018 05:05 PM Subject: [gpfsug-discuss] Spectrum V5.0.0 and CentOS 7.5 Sent by: gpfsug-discuss-bounces at spectrumscale.org Good Morning ? We have been trying to install V5.0.0 and CentOS 7.5 but it doesn?t seem to like the new Kernel. Does anyone have this running and do we need to do anything special? Or must we go to V5.0.1? TQT Ty Q. Tran Managing Partner Applied Data Systems 12180 Dearborn Place Poway, CA 92064 (714) 392- 6690 (Cell) (844) 371- 4949 x100 (Work) (858) 842- 4678 (Fax) ty.tran at applieddatasystems.com www.applieddatasystems.com _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 1B307238.gif Type: image/gif Size: 18825 bytes Desc: not available URL: From kkr at lbl.gov Fri Sep 7 23:13:48 2018 From: kkr at lbl.gov (Kristy Kallback-Rose) Date: Fri, 7 Sep 2018 15:13:48 -0700 Subject: [gpfsug-discuss] SC18 Planning Message-ID: <26F104F4-C367-4F77-938D-BFB2937FBB2D@lbl.gov> Hi all, If you?re planning on going to SC18 in November, we?d love to hear how you?re using Scale (GPFS) at your site. If you?d be willing to give a 20-30 minute user talk about something you?re doing at your site, please let us know. We?ll be working to fill up the agenda soon. Thanks, Kristy From cblack at nygenome.org Sat Sep 8 01:09:00 2018 From: cblack at nygenome.org (Christopher Black) Date: Sat, 8 Sep 2018 00:09:00 +0000 Subject: [gpfsug-discuss] Spectrum V5.0.0 and CentOS 7.5 In-Reply-To: References:

Message-ID: I can confirm gpfs 5.0.1.1 works with CentOS 7.5 for us (kernel package version 3.10.0-862.el7.x86_64). Best, Chris From: on behalf of Felipe Knop Reply-To: gpfsug main discussion list Date: Friday, September 7, 2018 at 6:08 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Spectrum V5.0.0 and CentOS 7.5 Ty, For queries on Scale versions and specific distros, please refer to the FAQ: https://www.ibm.com/support/knowledgecenter/en/STXKQY/gpfsclustersfaq.html Table 33. IBM Spectrum Scale for Linux RedHat kernel support 7.5 3.10.0-862.el7 3.10.0-862.el7 From V4.1.1.20 in the 4.1.1 release From V4.2.3.9 in the 4.2 release From V5.0.1.1 in the 5.0 release From V4.1.1.20 in the 4.1.1 release From V4.2.3.9 in the 4.2 release From V5.0.1.1 in the 5.0 release Assuming the levels of CentOS and RHEL are the same (they are supposed to be?), then CentOS 7.5 should work with Scale V5.0.1.1 or later. Felipe ---- Felipe Knop knop at us.ibm.com GPFS Development and Security IBM Systems IBM Building 008 2455 South Rd, Poughkeepsie, NY 12601 (845) 433-9314 T/L 293-9314 [Inactive hide details for Ty Tran ---09/07/2018 05:05:52 PM---Good Morning ? We have been trying to install V5.0.0 and CentOS]Ty Tran ---09/07/2018 05:05:52 PM---Good Morning ? We have been trying to install V5.0.0 and CentOS 7.5 but it doesn?t seem to like the From: Ty Tran To: "gpfsug-discuss at spectrumscale.org" Date: 09/07/2018 05:05 PM Subject: [gpfsug-discuss] Spectrum V5.0.0 and CentOS 7.5 Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Good Morning ? We have been trying to install V5.0.0 and CentOS 7.5 but it doesn?t seem to like the new Kernel. Does anyone have this running and do we need to do anything special? Or must we go to V5.0.1? TQT Ty Q. Tran Managing Partner Applied Data Systems 12180 Dearborn Place Poway, CA 92064 (714) 392- 6690 (Cell) (844) 371- 4949 x100 (Work) (858) 842- 4678 (Fax) ty.tran at applieddatasystems.com www.applieddatasystems.com [cid:2__=8FBB0992DFEAA5C28f9e8a93df938690918c8FB@] _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ This message is for the recipient?s use only, and may contain confidential, privileged or protected information. Any unauthorized use or dissemination of this communication is prohibited. If you received this message in error, please immediately notify the sender and destroy all copies of this message. The recipient should check this email and any attachments for the presence of viruses, as we accept no liability for any damage caused by any virus transmitted by this email. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 106 bytes Desc: image001.gif URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.png Type: image/png Size: 18826 bytes Desc: image002.png URL: From novosirj at rutgers.edu Sat Sep 8 03:13:18 2018 From: novosirj at rutgers.edu (Ryan Novosielski) Date: Sat, 8 Sep 2018 02:13:18 +0000 Subject: [gpfsug-discuss] Spectrum V5.0.0 and CentOS 7.5 In-Reply-To: References:

Message-ID: Someone asked me this the other day and I wasn?t quite sure of the answer: how likely is it that we will ever see/have we ever seen a kernel update (eg. 862.9.1 to 862.11.6) that breaks GPFS compatibility, or can one generally expect it will continue to work for 862*? > On Sep 7, 2018, at 6:08 PM, Felipe Knop wrote: > > Ty, > > For queries on Scale versions and specific distros, please refer to the FAQ: > > https://www.ibm.com/support/knowledgecenter/en/STXKQY/gpfsclustersfaq.html > > Table 33. IBM Spectrum Scale for Linux RedHat kernel support > > > 7.5 3.10.0-862.el7 3.10.0-862.el7 From V4.1.1.20 in the 4.1.1 release > From V4.2.3.9 in the 4.2 release > > From V5.0.1.1 in the 5.0 release > > From V4.1.1.20 in the 4.1.1 release > From V4.2.3.9 in the 4.2 release > > From V5.0.1.1 in the 5.0 release > > > Assuming the levels of CentOS and RHEL are the same (they are supposed to be?), then CentOS 7.5 should work with Scale V5.0.1.1 or later. > > Felipe > > ---- > Felipe Knop knop at us.ibm.com > GPFS Development and Security > IBM Systems > IBM Building 008 > 2455 South Rd, Poughkeepsie, NY 12601 > (845) 433-9314 T/L 293-9314 > > > > Ty Tran ---09/07/2018 05:05:52 PM---Good Morning ? We have been trying to install V5.0.0 and CentOS 7.5 but it doesn?t seem to like the > > From: Ty Tran > To: "gpfsug-discuss at spectrumscale.org" > Date: 09/07/2018 05:05 PM > Subject: [gpfsug-discuss] Spectrum V5.0.0 and CentOS 7.5 > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > Good Morning ? > > We have been trying to install V5.0.0 and CentOS 7.5 but it doesn?t seem to like the new Kernel. Does anyone have this running and do we need to do anything special? Or must we go to V5.0.1? > > TQT > > Ty Q. Tran > Managing Partner > Applied Data Systems > 12180 Dearborn Place > Poway, CA 92064 > (714) 392- 6690 (Cell) > (844) 371- 4949 x100 (Work) > (858) 842- 4678 (Fax) > ty.tran at applieddatasystems.com > www.applieddatasystems.com > > <1B307238.gif> > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss From bachmann.f at gmail.com Sat Sep 8 15:54:07 2018 From: bachmann.f at gmail.com (Florian Bachmann) Date: Sat, 8 Sep 2018 16:54:07 +0200 Subject: [gpfsug-discuss] Spectrum V5.0.0 and CentOS 7.5 In-Reply-To: References:

Message-ID: <73cf0385-a410-bba6-5338-71cab7ffe34f@gmail.com> From my experience you are better off with locking kernel packages at a known-to-work version in production (e.g. install yum-plugin-versionlock and do a yum versionlock "kernel*") and test new kernel versions in a test environment. You cannot rely on made up rules like "minor version updates will never break GPFS" or similiar; Linux kernel developers do not care if GPFS works or not. Kind Regards Florian On 08.09.2018 04:13, Ryan Novosielski wrote: > Someone asked me this the other day and I wasn?t quite sure of the answer: how likely is it that we will ever see/have we ever seen a kernel update (eg. 862.9.1 to 862.11.6) that breaks GPFS compatibility, or can one generally expect it will continue to work for 862*? From UWEFALKE at de.ibm.com Mon Sep 10 00:04:12 2018 From: UWEFALKE at de.ibm.com (Uwe Falke) Date: Mon, 10 Sep 2018 01:04:12 +0200 Subject: [gpfsug-discuss] RAID type for system pool In-Reply-To: References: <92924CCC-FF4A-4F14-BA19-D35BC93CE69F@vanderbilt.edu>

Message-ID: Hi, Marc, I was clearly unaware of that function. If my understanding of parity-based redundancy is about correct, then that method would only work with RAID 5, because that is a simple XOR-based hash, but RAID 6, if used, would not allow that stripped-down RMW. Is that correct? Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: "Marc A Kaplan" To: gpfsug main discussion list Date: 06/09/2018 18:09 Subject: Re: [gpfsug-discuss] RAID type for system pool Sent by: gpfsug-discuss-bounces at spectrumscale.org A somewhat smarter RAID controller will "only" need to read the old values of the single changed segment of data and the corresponding parity segment, and know the new value of the data block. Then it can compute the new parity segment value. Not necessarily the entire stripe. Still 2 reads and 2 writes + access delay times ( guaranteed more than one full rotation time when on spinning disks, average something like 1.7x rotation time ). From: "Uwe Falke" To: gpfsug main discussion list Date: 09/05/2018 04:07 PM Subject: Re: [gpfsug-discuss] RAID type for system pool Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi, just think that your RAID controller on parity-backed redundancy needs to read the full stripe, modify it, and write it back (including parity) - the infamous Read-Modify-Write penalty. As long as your users don't bulk-create inodes and doo amend some metadata, (create a file sometimes, e.g.) The writing of a 4k inode, or the update of a 32k dir block causes your controller to read a full block (let's say you use 1MiB on MD) and write back the full block plus parity (on 4+1p RAID 5 at 1MiB that'll be 1.25MiB. Overhead two orders of magnitude above the payload. SSDs have become better now and expensive enterprise SSDs will endure quite a lot of full rewrites, but you need to estimate the MD change rate, apply the RMW overhead and see where you end WRT lifetime (and performance). Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 05/09/2018 17:35 Subject: [gpfsug-discuss] RAID type for system pool Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi All, We are in the process of finalizing the purchase of some new storage arrays (so no sales people who might be monitoring this list need contact me) to life-cycle some older hardware. One of the things we are considering is the purchase of some new SSD?s for our ?/home? filesystem and I have a question or two related to that. Currently, the existing home filesystem has it?s metadata on SSD?s ? two RAID 1 mirrors and metadata replication set to two. However, the filesystem itself is old enough that it uses 512 byte inodes. We have analyzed our users files and know that if we create a new filesystem with 4K inodes that a very significant portion of the files would now have their _data_ stored in the inode as well due to the files being 3.5K or smaller (currently all data is on spinning HD RAID 1 mirrors). Of course, if we increase the size of the inodes by a factor of 8 then we also need 8 times as much space to store those inodes. Given that Enterprise class SSDs are still very expensive and our budget is not unlimited, we?re trying to get the best bang for the buck. We have always - even back in the day when our metadata was on spinning disk and not SSD - used RAID 1 mirrors and metadata replication of two. However, we are wondering if it might be possible to switch to RAID 5? Specifically, what we are considering doing is buying 8 new SSDs and creating two 3+1P RAID 5 LUNs (metadata replication would stay at two). That would give us 50% more usable space than if we configured those same 8 drives as four RAID 1 mirrors. Unfortunately, unless I?m misunderstanding something, mean that the RAID stripe size and the GPFS block size could not match. Therefore, even though we don?t need the space, would we be much better off to buy 10 SSDs and create two 4+1P RAID 5 LUNs? I?ve searched the mailing list archives and scanned the DeveloperWorks wiki and even glanced at the GPFS documentation and haven?t found anything that says ?bad idea, Kevin?? ;-) Expanding on this further ? if we just present those two RAID 5 LUNs to GPFS as NSDs then we can only have two NSD servers as primary for them. So another thing we?re considering is to take those RAID 5 LUNs and further sub-divide them into a total of 8 logical volumes, each of which could be a GPFS NSD and therefore would allow us to have each of our 8 NSD servers be primary for one of them. Even worse idea?!? Good idea? Anybody have any better ideas??? ;-) Oh, and currently we?re on GPFS 4.2.3-10, but are also planning on moving to GPFS 5.0.1-x before creating the new filesystem. Thanks much? ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From spectrumscale at kiranghag.com Mon Sep 10 09:21:48 2018 From: spectrumscale at kiranghag.com (KG) Date: Mon, 10 Sep 2018 13:51:48 +0530 Subject: [gpfsug-discuss] Spectrum V5.0.0 and CentOS 7.5 In-Reply-To: References:

Message-ID: If the release level is supported then all patch levels should work, just need to run mmbuildgpl to re-compile portability layer for new kernel revision. KG On Sat, Sep 8, 2018 at 7:43 AM, Ryan Novosielski wrote: > Someone asked me this the other day and I wasn?t quite sure of the answer: > how likely is it that we will ever see/have we ever seen a kernel update > (eg. 862.9.1 to 862.11.6) that breaks GPFS compatibility, or can one > generally expect it will continue to work for 862*? > > > On Sep 7, 2018, at 6:08 PM, Felipe Knop wrote: > > > > Ty, > > > > For queries on Scale versions and specific distros, please refer to the > FAQ: > > > > https://www.ibm.com/support/knowledgecenter/en/STXKQY/ > gpfsclustersfaq.html > > > > Table 33. IBM Spectrum Scale for Linux RedHat kernel support > > > > > > 7.5 3.10.0-862.el7 3.10.0-862.el7 From V4.1.1.20 in the 4.1.1 release > > From V4.2.3.9 in the 4.2 release > > > > From V5.0.1.1 in the 5.0 release > > > > From V4.1.1.20 in the 4.1.1 release > > From V4.2.3.9 in the 4.2 release > > > > From V5.0.1.1 in the 5.0 release > > > > > > Assuming the levels of CentOS and RHEL are the same (they are supposed > to be?), then CentOS 7.5 should work with Scale V5.0.1.1 or later. > > > > Felipe > > > > ---- > > Felipe Knop knop at us.ibm.com > > GPFS Development and Security > > IBM Systems > > IBM Building 008 > > 2455 South Rd, Poughkeepsie, NY 12601 > > (845) 433-9314 T/L 293-9314 > > > > > > > > Ty Tran ---09/07/2018 05:05:52 PM---Good Morning ? We have > been trying to install V5.0.0 and CentOS 7.5 but it doesn?t seem to like the > > > > From: Ty Tran > > To: "gpfsug-discuss at spectrumscale.org" org> > > Date: 09/07/2018 05:05 PM > > Subject: [gpfsug-discuss] Spectrum V5.0.0 and CentOS 7.5 > > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > > > > Good Morning ? > > > > We have been trying to install V5.0.0 and CentOS 7.5 but it doesn?t seem > to like the new Kernel. Does anyone have this running and do we need to do > anything special? Or must we go to V5.0.1? > > > > TQT > > > > Ty Q. Tran > > Managing Partner > > Applied Data Systems > > 12180 Dearborn Place > > Poway, CA 92064 > > (714) 392- 6690 (Cell) > > (844) 371- 4949 x100 (Work) > > (858) 842- 4678 (Fax) > > ty.tran at applieddatasystems.com > > www.applieddatasystems.com > > > > <1B307238.gif> > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Matthias.Knigge at rohde-schwarz.com Mon Sep 10 12:19:14 2018 From: Matthias.Knigge at rohde-schwarz.com (Matthias Knigge) Date: Mon, 10 Sep 2018 11:19:14 +0000 Subject: [gpfsug-discuss] [Newsletter] Re: Problem with mmlscluster and callback scripts Message-ID: <685e1f5c26c548ec85046e761563f583@rohde-schwarz.com> Hi Araon, in my setup I have no chance to define a tiebreaker disk. So if one node goes down I would change the role if this node. mmchnode --nonquorum -N nodename --force After that I can start the filesystem and mount it. Thanks, Matthias Best Regards Matthias Knigge R&D File Based Media Solutions Rohde & Schwarz GmbH & Co. KG Hanomaghof 1 30449 Hannover Telefon +49 511 67 80 7 213 Fax +49 511 37 19 74 Internet: Matthias.Knigge at rohde-schwarz.com ------------------------------------------------------------ Gesch?ftsf?hrung / Executive Board: Christian Leicher (Vorsitzender / Chairman), Peter Riedel, Sitz der Gesellschaft / Company's Place of Business: M?nchen, Registereintrag / Commercial Register No.: HRA 16 270, Pers?nlich haftender Gesellschafter / Personally Liable Partner: RUSEG Verwaltungs-GmbH, Sitz der Gesellschaft / Company's Place of Business: M?nchen, Registereintrag / Commercial Register No.: HRB 7 534, Umsatzsteuer-Identifikationsnummer (USt-IdNr.) / VAT Identification No.: DE 130 256 683, Elektro-Altger?te Register (EAR) / WEEE Register No.: DE 240 437 86 -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Aaron Knister Sent: Friday, September 07, 2018 3:35 PM To: gpfsug-discuss at spectrumscale.org Subject: *EXT* [Newsletter] Re: [gpfsug-discuss] Problem with mmlscluster and callback scripts Hi Matthias, Looks like you lost quorum in the cluster (you've got to have (n/2+1) quorum nodes up if you're using node-based quorum). Do you have a tiebreaker disk defined? (i.e. mmlsconfig tiebreakerdisk). -Aaron On 9/7/18 7:51 AM, Matthias Knigge wrote: > Hello together, > > I am using the version 5.0.2.0 of GPFS and have problems with the > command mmlscluster and callback-scripts. It is a small cluster of two > nodes only. If I shutdown one of the nodes sometimes mmlscluster > reports the following output: > > [root at gpfs-tier1 gpfs5.2]# mmgetstate > > Node number? Node name??????? GPFS state > > ------------------------------------------- > > ?????? 1????? gpfs-tier1?????? arbitrating > > [root at gpfs-tier1 gpfs5.2]# mmlscluster > > ssh: connect to host gpfs-tier2 port 22: No route to host > > mmlscluster: Unable to retrieve GPFS cluster files from node > gpfs-tier2 > > mmlscluster: Command failed. Examine previous error messages to > determine cause. > > Normally the output is like this: > > [root at gpfs-tier1 gpfs5.2]# mmlscluster > > GPFS cluster information > > ======================== > > ? GPFS cluster name:???????? TIERCLUSTER.gpfs-tier1 > > ? GPFS cluster id:?????????? 12458173498278694815 > > ? GPFS UID domain:?????????? TIERCLUSTER.gpfs-tier1 > > ? Remote shell command:????? /usr/bin/ssh > > ? Remote file copy command:? /usr/bin/scp > > ? Repository type:?????????? server-based > > GPFS cluster configuration servers: > > ----------------------------------- > > ? Primary server:??? gpfs-tier2 > > ? Secondary server:? gpfs-tier1 > > Node? Daemon node name? IP address????? Admin node name? Designation > > ---------------------------------------------------------------------- > > ?? 1?? gpfs-tier1??????? 192.168.178.10? gpfs-tier1?????? > quorum-manager > > ?? 2?? gpfs-tier2??????? 192.168.178.11? gpfs-tier2?????? > quorum-manager > > [root at gpfs-tier1 gpfs5.2]# mmlscallback > > NodeDownCallback > > ??????? command?????? = /var/mmfs/rs/nodedown.ksh > > ??????? priority????? = 1 > > ??????? event???????? = quorumNodeLeave > > ??????? parms???????? = %eventNode %quorumNodes > > NodeUpCallback > > ??????? command?????? = /var/mmfs/rs/nodeup.ksh > > ??????? priority????? = 1 > > ??????? event???????? = quorumNodeJoin > > ??????? parms???????? = %eventNode %quorumNodes > > If I shutdown the filesystem via mmshutdown the callback script works > but if I shutdown the whole node the scripts does not run. > > The latest log-entry in mmfs.log.latest shows only this information: > > 2018-09-07_13:12:36.724+0200: [I] Cluster Manager connection broke. > Probing cluster TIERCLUSTER.gpfs-tier1 > > 2018-09-07_13:12:37.226+0200: [E] Unable to contact enough other > quorum nodes during cluster probe. > > 2018-09-07_13:12:37.226+0200: [E] Lost membership in cluster > TIERCLUSTER.gpfs-tier1. Unmounting file systems. > > 2018-09-07_13:12:38.448+0200: [N] Connecting to 192.168.178.11 > gpfs-tier2 > > Could anybody help me in this case? I want to try to start a script if > one node goes down or up to change the roles for starting the > filesystem. The callback event NodeLeave and NodeJoin do not run too. > > Any more information required? If yes, please let me know! > > Many thanks in advance and a nice weekend! > > Matthias > > Best Regards > > Matthias Knigge > R&D File Based Media Solutions > > Rohde & Schwarz > GmbH & Co. KG > Hanomaghof 1 > 30449 Hannover > Telefon +49 511 67 80 7 213 > Fax +49 511 37 19 74 > Internet: Matthias.Knigge at rohde-schwarz.com > ------------------------------------------------------------ > Gesch?ftsf?hrung / Executive Board: Christian Leicher (Vorsitzender / > Chairman), Peter Riedel, Sitz der Gesellschaft / Company's Place of > Business: M?nchen, Registereintrag / Commercial Register No.: HRA 16 > 270, Pers?nlich haftender Gesellschafter / Personally Liable Partner: > RUSEG Verwaltungs-GmbH, Sitz der Gesellschaft / Company's Place of > Business: M?nchen, Registereintrag / Commercial Register No.: HRB 7 > 534, Umsatzsteuer-Identifikationsnummer (USt-IdNr.) / VAT Identification No.: > DE 130 256 683, Elektro-Altger?te Register (EAR) / WEEE Register No.: > DE > 240 437 86 > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From Matthias.Knigge at rohde-schwarz.com Mon Sep 10 12:21:21 2018 From: Matthias.Knigge at rohde-schwarz.com (Matthias Knigge) Date: Mon, 10 Sep 2018 11:21:21 +0000 Subject: [gpfsug-discuss] [Newsletter] Re: Problem with mmlscluster and callback scripts Message-ID: Hi Fred, I have the same problem with the version 5.0.1.0. Thanks, Matthias Best Regards Matthias Knigge R&D File Based Media Solutions Rohde & Schwarz GmbH & Co. KG Hanomaghof 1 30449 Hannover Telefon +49 511 67 80 7 213 Fax +49 511 37 19 74 Internet: Matthias.Knigge at rohde-schwarz.com ------------------------------------------------------------ Gesch?ftsf?hrung / Executive Board: Christian Leicher (Vorsitzender / Chairman), Peter Riedel, Sitz der Gesellschaft / Company's Place of Business: M?nchen, Registereintrag / Commercial Register No.: HRA 16 270, Pers?nlich haftender Gesellschafter / Personally Liable Partner: RUSEG Verwaltungs-GmbH, Sitz der Gesellschaft / Company's Place of Business: M?nchen, Registereintrag / Commercial Register No.: HRB 7 534, Umsatzsteuer-Identifikationsnummer (USt-IdNr.) / VAT Identification No.: DE 130 256 683, Elektro-Altger?te Register (EAR) / WEEE Register No.: DE 240 437 86 From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Frederick Stock Sent: Friday, September 07, 2018 3:20 PM To: gpfsug main discussion list Subject: *EXT* [Newsletter] Re: [gpfsug-discuss] Problem with mmlscluster and callback scripts Are you really running version 5.0.2? If so then I presume you have a beta version since it has not yet been released. For beta problems there is a specific feedback mechanism that should be used to report problems. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: Matthias Knigge > To: "gpfsug-discuss at spectrumscale.org" > Date: 09/07/2018 08:08 AM Subject: [gpfsug-discuss] Problem with mmlscluster and callback scripts Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hello together, I am using the version 5.0.2.0 of GPFS and have problems with the command mmlscluster and callback-scripts. It is a small cluster of two nodes only. If I shutdown one of the nodes sometimes mmlscluster reports the following output: [root at gpfs-tier1 gpfs5.2]# mmgetstate Node number Node name GPFS state ------------------------------------------- 1 gpfs-tier1 arbitrating [root at gpfs-tier1 gpfs5.2]# mmlscluster ssh: connect to host gpfs-tier2 port 22: No route to host mmlscluster: Unable to retrieve GPFS cluster files from node gpfs-tier2 mmlscluster: Command failed. Examine previous error messages to determine cause. Normally the output is like this: [root at gpfs-tier1 gpfs5.2]# mmlscluster GPFS cluster information ======================== GPFS cluster name: TIERCLUSTER.gpfs-tier1 GPFS cluster id: 12458173498278694815 GPFS UID domain: TIERCLUSTER.gpfs-tier1 Remote shell command: /usr/bin/ssh Remote file copy command: /usr/bin/scp Repository type: server-based GPFS cluster configuration servers: ----------------------------------- Primary server: gpfs-tier2 Secondary server: gpfs-tier1 Node Daemon node name IP address Admin node name Designation ---------------------------------------------------------------------- 1 gpfs-tier1 192.168.178.10 gpfs-tier1 quorum-manager 2 gpfs-tier2 192.168.178.11 gpfs-tier2 quorum-manager [root at gpfs-tier1 gpfs5.2]# mmlscallback NodeDownCallback command = /var/mmfs/rs/nodedown.ksh priority = 1 event = quorumNodeLeave parms = %eventNode %quorumNodes NodeUpCallback command = /var/mmfs/rs/nodeup.ksh priority = 1 event = quorumNodeJoin parms = %eventNode %quorumNodes If I shutdown the filesystem via mmshutdown the callback script works but if I shutdown the whole node the scripts does not run. The latest log-entry in mmfs.log.latest shows only this information: 2018-09-07_13:12:36.724+0200: [I] Cluster Manager connection broke. Probing cluster TIERCLUSTER.gpfs-tier1 2018-09-07_13:12:37.226+0200: [E] Unable to contact enough other quorum nodes during cluster probe. 2018-09-07_13:12:37.226+0200: [E] Lost membership in cluster TIERCLUSTER.gpfs-tier1. Unmounting file systems. 2018-09-07_13:12:38.448+0200: [N] Connecting to 192.168.178.11 gpfs-tier2 Could anybody help me in this case? I want to try to start a script if one node goes down or up to change the roles for starting the filesystem. The callback event NodeLeave and NodeJoin do not run too. Any more information required? If yes, please let me know! Many thanks in advance and a nice weekend! Matthias Best Regards Matthias Knigge R&D File Based Media Solutions Rohde & Schwarz GmbH & Co. KG Hanomaghof 1 30449 Hannover Telefon +49 511 67 80 7 213 Fax +49 511 37 19 74 Internet: Matthias.Knigge at rohde-schwarz.com ------------------------------------------------------------ Gesch?ftsf?hrung / Executive Board: Christian Leicher (Vorsitzender / Chairman), Peter Riedel, Sitz der Gesellschaft / Company's Place of Business: M?nchen, Registereintrag / Commercial Register No.: HRA 16 270, Pers?nlich haftender Gesellschafter / Personally Liable Partner: RUSEG Verwaltungs-GmbH, Sitz der Gesellschaft / Company's Place of Business: M?nchen, Registereintrag / Commercial Register No.: HRB 7 534, Umsatzsteuer-Identifikationsnummer (USt-IdNr.) / VAT Identification No.: DE 130 256 683, Elektro-Altger?te Register (EAR) / WEEE Register No.: DE 240 437 86 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From abeattie at au1.ibm.com Mon Sep 10 13:19:31 2018 From: abeattie at au1.ibm.com (Andrew Beattie) Date: Mon, 10 Sep 2018 12:19:31 +0000 Subject: [gpfsug-discuss] [Newsletter] Re: Problem with mmlscluster and callback scripts In-Reply-To: Message-ID: An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Mon Sep 10 15:36:52 2018 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Mon, 10 Sep 2018 10:36:52 -0400 Subject: [gpfsug-discuss] RAID type for system pool In-Reply-To: References: <92924CCC-FF4A-4F14-BA19-D35BC93CE69F@vanderbilt.edu>

Message-ID: No, but of course for a RAID with additional parity (error correction) bits the controller needs to read and write more. So if, for example, n+2 sub-blocks per stripe = n data and 2 error correction... Then the smallest update requires read-compute-write on 1 data and 2 ecc. = 3 reads and 3 writes. The calculation each parity block of the requires "subtracting" out the contribution of the old data value and adding in the contribution of the new data value Ref: http://igoro.com/archive/how-raid-6-dual-parity-calculation-works/ Look at it this way: The k'th parity value is Parityk= Ak*(data1) + Bk*(data2) + Ck*(data3) + ... (Ak, Bk, Ck, ... are coefficients for the computation of the k'th parity value) When updating data2 to data2x we update Parityk to Paritykx with Paritykx = Pariktyk - Bk*(data2) + Bk*(data2x) (Arithmetic done in a Galois Field chosen to make error correction practical.) From: "Uwe Falke" Hi, Marc, I was clearly unaware of that function. If my understanding of parity-based redundancy is about correct, then that method would only work with RAID 5, because that is a simple XOR-based hash, but RAID 6, if used, would not allow that stripped-down RMW. Is that correct? -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Mon Sep 10 19:26:34 2018 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Mon, 10 Sep 2018 18:26:34 +0000 Subject: [gpfsug-discuss] RAID type for system pool Message-ID: <6232FD43-12A5-49BA-84DA-B3801F42EAF6@vanderbilt.edu> Hi All, So while I?m waiting for the purchase of new hardware to go thru, I?m trying to gather more data about the current workload. One of the things I?m trying to do is get a handle on the ratio of reads versus writes for my metadata. I?m using ?mmdiag ?iohist? ? in this case ?dm-12? is one of my metadataOnly disks and I?m running this on the primary NSD server for that NSD. I?m seeing output like: 11:22:13.931117 W inode 4:299844163 1 0.448 srv dm-12 11:22:13.932344 R metadata 4:36659676 4 0.307 srv dm-12 11:22:13.932005 W logData 4:49676176 1 0.726 srv dm-12 And I?m confused as to the difference between ?inode? and ?metadata? (I at least _think_ I understand ?logData?)?!? The man page for mmdiag doesn?t help and I?ve not found anything useful yet in my Googling. This is on a filesystem that currently uses 512 byte inodes, if that matters. Thanks? Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Mon Sep 10 17:37:26 2018 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Mon, 10 Sep 2018 16:37:26 +0000 Subject: [gpfsug-discuss] RAID type for system pool References: Message-ID: From: gpfsug-discuss-owner at spectrumscale.org Subject: Re: [gpfsug-discuss] RAID type for system pool Date: September 10, 2018 at 11:35:05 AM CDT To: klb at accre.vanderbilt.edu Hi All, So while I?m waiting for the purchase of new hardware to go thru, I?m trying to gather more data about the current workload. One of the things I?m trying to do is get a handle on the ratio of reads versus writes for my metadata. I?m using ?mmdiag ?iohist? ? in this case ?dm-12? is one of my metadataOnly disks and I?m running this on the primary NSD server for that NSD. I?m seeing output like: 11:22:13.931117 W inode 4:299844163 1 0.448 srv dm-12 11:22:13.932344 R metadata 4:36659676 4 0.307 srv dm-12 11:22:13.932005 W logData 4:49676176 1 0.726 srv dm-12 And I?m confused as to the difference between ?inode? and ?metadata? (I at least _think_ I understand ?logData?)?!? The man page for mmdiag doesn?t help and I?ve not found anything useful yet in my Googling. This is on a filesystem that currently uses 512 byte inodes, if that matters. Thanks? Kevin -------------- next part -------------- An HTML attachment was scrubbed... URL: From stockf at us.ibm.com Mon Sep 10 20:49:36 2018 From: stockf at us.ibm.com (Frederick Stock) Date: Mon, 10 Sep 2018 15:49:36 -0400 Subject: [gpfsug-discuss] RAID type for system pool In-Reply-To: <6232FD43-12A5-49BA-84DA-B3801F42EAF6@vanderbilt.edu> References: <6232FD43-12A5-49BA-84DA-B3801F42EAF6@vanderbilt.edu> Message-ID: My guess is that the "metadata" IO is for either for directory data since directories are considered metadata, or fileset metadata. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 09/10/2018 02:27 PM Subject: [gpfsug-discuss] RAID type for system pool Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi All, So while I?m waiting for the purchase of new hardware to go thru, I?m trying to gather more data about the current workload. One of the things I?m trying to do is get a handle on the ratio of reads versus writes for my metadata. I?m using ?mmdiag ?iohist? ? in this case ?dm-12? is one of my metadataOnly disks and I?m running this on the primary NSD server for that NSD. I?m seeing output like: 11:22:13.931117 W inode 4:299844163 1 0.448 srv dm-12 11:22:13.932344 R metadata 4:36659676 4 0.307 srv dm-12 11:22:13.932005 W logData 4:49676176 1 0.726 srv dm-12 And I?m confused as to the difference between ?inode? and ?metadata? (I at least _think_ I understand ?logData?)?!? The man page for mmdiag doesn?t help and I?ve not found anything useful yet in my Googling. This is on a filesystem that currently uses 512 byte inodes, if that matters. Thanks? Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From alvise.dorigo at psi.ch Tue Sep 11 10:05:23 2018 From: alvise.dorigo at psi.ch (Dorigo Alvise (PSI)) Date: Tue, 11 Sep 2018 09:05:23 +0000 Subject: [gpfsug-discuss] mmperfmon report some "null" data Message-ID: <83A6EEB0EC738F459A39439733AE80452675425E@MBX214.d.ethz.ch> Dear experts, during a intensive writing into a GPFS FS (~9.5 GB/s), if I run mmperfmon to collect performance data I get many "null" strings instead of real data:: [root at sf-dss-1 ~]# date;mmperfmon query 'sf-dssio-.*.psi.ch|GPFSNSDFS|RAW|gpfs_nsdfs_bytes_written' --short --number-buckets 10 -b 1 Tue Sep 11 10:57:06 CEST 2018 Legend: 1: sf-dssio-1.psi.ch|GPFSNSDFS|RAW|gpfs_nsdfs_bytes_written 2: sf-dssio-2.psi.ch|GPFSNSDFS|RAW|gpfs_nsdfs_bytes_written Row Timestamp _1 _2 1 2018-09-11-10:56:57 4135583744 4329193472 2 2018-09-11-10:56:58 4799332352 4697755648 3 2018-09-11-10:56:59 4799332352 4697755648 4 2018-09-11-10:57:00 null null 5 2018-09-11-10:57:01 null null 6 2018-09-11-10:57:02 null null 7 2018-09-11-10:57:03 null null 8 2018-09-11-10:57:04 null null 9 2018-09-11-10:57:05 null null 10 2018-09-11-10:57:06 null null Even worse if I reduce the number of buckets: [root at sf-dss-1 ~]# date;mmperfmon query 'sf-dssio-.*.psi.ch|GPFSNSDFS|RAW|gpfs_nsdfs_bytes_written' --short --number-buckets 5 -b 1 Tue Sep 11 10:59:26 CEST 2018 Legend: 1: sf-dssio-1.psi.ch|GPFSNSDFS|RAW|gpfs_nsdfs_bytes_written 2: sf-dssio-2.psi.ch|GPFSNSDFS|RAW|gpfs_nsdfs_bytes_written Row Timestamp _1 _2 1 2018-09-11-10:59:21 null null 2 2018-09-11-10:59:22 null null 3 2018-09-11-10:59:23 null null 4 2018-09-11-10:59:24 null null 5 2018-09-11-10:59:25 null null To get real data the number of buckets must be at least 6, but sometime it is better to set it to 10 otherwise there's the risk to get only "null" data anyway. The question is: which particular configuration can be wrong in my mmperfmon's configuration file (see below for the dump of "config show") that produces those null data ? My system is a Lenovo DSS-G220 updated to version dss-g-2.0a (gpfs version 4.2.3-7). thanks, Alvise ------------------------------------ cephMon = "/opt/IBM/zimon/CephMonProxy" cephRados = "/opt/IBM/zimon/CephRadosProxy" colCandidates = "sf-dss-1", "daas-mon.psi.ch" colRedundancy = 2 collectors = { host = "" port = "4739" } config = "/opt/IBM/zimon/ZIMonSensors.cfg" ctdbstat = "" daemonize = T hostname = "" ipfixinterface = "0.0.0.0" logfile = "/var/log/zimon/ZIMonSensors.log" loglevel = "info" mmcmd = "/opt/IBM/zimon/MMCmdProxy" mmdfcmd = "/opt/IBM/zimon/MMDFProxy" mmpmon = "/opt/IBM/zimon/MmpmonSockProxy" piddir = "/var/run" release = "4.2.3-4" sensors = { name = "CPU" period = 5 }, { name = "Load" period = 5 }, { name = "Memory" period = 5 }, { name = "Network" period = 1 }, { name = "Netstat" period = 0 }, { name = "Diskstat" period = 0 }, { name = "DiskFree" period = 60 restrict = "sf-dss-1.psi.ch" }, { name = "Infiniband" period = 1 }, { name = "GPFSDisk" period = 1 restrict = "nsdNodes" }, { name = "GPFSFilesystem" period = 1 }, { name = "GPFSNSDDisk" period = 1 restrict = "nsdNodes" }, { name = "GPFSNSDFS" period = 1 restrict = "nsdNodes" }, { name = "GPFSPoolIO" period = 1 }, { name = "GPFSVFS" period = 1 }, { name = "GPFSIOC" period = 1 }, { name = "GPFSVIO" period = 1 }, { name = "GPFSPDDisk" period = 1 restrict = "nsdNodes" }, { name = "GPFSvFLUSH" period = 1 }, { name = "GPFSNode" period = 1 }, { name = "GPFSNodeAPI" period = 1 }, { name = "GPFSFilesystemAPI" period = 1 }, { name = "GPFSLROC" period = 1 }, { name = "GPFSCHMS" period = 1 }, { name = "GPFSAFM" period = 5 }, { name = "GPFSAFMFS" period = 5 }, { name = "GPFSAFMFSET" period = 5 }, { name = "GPFSRPCS" period = 1 }, { name = "GPFSWaiters" period = 5 }, { name = "GPFSFilesetQuota" period = 60 restrict = "sf-dss-1" }, { name = "GPFSFileset" period = 60 restrict = "sf-dss-1" }, { name = "GPFSPool" period = 60 restrict = "sf-dss-1" }, { name = "GPFSDiskCap" period = 0 } smbstat = "" -------------- next part -------------- An HTML attachment was scrubbed... URL: From Michael.Dutchak at ibm.com Tue Sep 11 14:20:19 2018 From: Michael.Dutchak at ibm.com (Michael Dutchak) Date: Tue, 11 Sep 2018 09:20:19 -0400 Subject: [gpfsug-discuss] Optimal range on inode count for a single folder Message-ID: I would like to find out what the limitation, or optimal range on inode count for a single folder is in GPFS. We have several users that have caused issues with our current files system by adding up to a million small files (1 ~ 40k) to a single directory. This causes issues during system remount where restarting the system can take excessive amounts of time. -------------- next part -------------- An HTML attachment was scrubbed... URL: From stockf at us.ibm.com Tue Sep 11 15:04:33 2018 From: stockf at us.ibm.com (Frederick Stock) Date: Tue, 11 Sep 2018 10:04:33 -0400 Subject: [gpfsug-discuss] Optimal range on inode count for a single folder In-Reply-To: References: Message-ID: I am not sure I can provide you an optimal range but I can list some factors to consider. In general the guideline is to keep directories to 500K files or so. Keeping your metadata on separate NSDs, and preferably fast NSDs, helps especially with directory listings. And running the latest version of Scale also helps. It is unclear to me why the number of files in a directory would impact remount unless these are exported directories and the remount is occurring on a user node that also attempts to scan through the directory. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: "Michael Dutchak" To: gpfsug-discuss at spectrumscale.org Date: 09/11/2018 09:21 AM Subject: [gpfsug-discuss] Optimal range on inode count for a single folder Sent by: gpfsug-discuss-bounces at spectrumscale.org I would like to find out what the limitation, or optimal range on inode count for a single folder is in GPFS. We have several users that have caused issues with our current files system by adding up to a million small files (1 ~ 40k) to a single directory. This causes issues during system remount where restarting the system can take excessive amounts of time. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Tue Sep 11 15:03:24 2018 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Tue, 11 Sep 2018 10:03:24 -0400 Subject: [gpfsug-discuss] Optimal range on inode count for a single folder In-Reply-To: References: Message-ID: There is no single "optimal" number of files per directory. GPFS can handle millions of files in a directory, rather efficiently. It uses fairly modern extensible hashing and caching techniques that makes lookup, insertions and deletions go fast. But of course, reading or "listing" all directory entries is going to require reading all the disk sectors that contain the directory... "during system remount... restarting the system" -- NO! There is no relation between directory sizes and mount and startup times... If you are experiencing long mount times, something else is happening. IF restart is after a crash of some kind, then it is possible GPFS may need to process many log entries -- but that would be proportional to the number of directory updates "in flight" at the time of the crash... Having said that there are some changeover conditions in the way directories are stored, as one adds more and more entries. Since directory entries are of variable size, varying with the size of the file names, the exact numbers depend on file name length, inode size and (meta)data block size: A) All directory entries fit in the directory inode. Best performance! But I do not recommend deliberately changing apps to avoid spilling to ... B) All directory entries fit in one metadata block. C) Directory entries are spread over several blocks. You can determine how much storage a directory is using by a `stat /path` command or equivalent. From: "Michael Dutchak" To: gpfsug-discuss at spectrumscale.org Date: 09/11/2018 09:21 AM Subject: [gpfsug-discuss] Optimal range on inode count for a single folder Sent by: gpfsug-discuss-bounces at spectrumscale.org I would like to find out what the limitation, or optimal range on inode count for a single folder is in GPFS. We have several users that have caused issues with our current files system by adding up to a million small files (1 ~ 40k) to a single directory. This causes issues during system remount where restarting the system can take excessive amounts of time. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 21994 bytes Desc: not available URL: