From kenneth.waegeman at ugent.be Mon Sep 3 16:06:28 2018 From: kenneth.waegeman at ugent.be (Kenneth Waegeman) Date: Mon, 3 Sep 2018 17:06:28 +0200 Subject: [gpfsug-discuss] system.log pool on client nodes for HAWC In-Reply-To: References: Message-ID: <85d5591a-cf74-f55e-1802-e3e14983abbf@ugent.be> Thank you Vasily and Simon for the clarification! I was looking further into it, and I got stuck with more questions :) - In https://www.ibm.com/support/knowledgecenter/STXKQY_5.0.0/com.ibm.spectrum.scale.v5r00.doc/bl1adv_hawc_tuning.htm I read: ??? HAWC does not change the following behaviors: ??????? write behavior of small files when the data is placed in the inode itself ??????? write behavior of directory blocks or other metadata I wondered why? Is the metadata not logged in the (same) recovery logs? (It seemed by reading https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.ins.doc/bl1ins_logfile.htm it does ) - Would there be a way to estimate how much of the write requests on a running cluster would benefit from enabling HAWC ? Thanks again! Kenneth On 31/08/18 19:49, Vasily Tarasov wrote: > That is correct. The blocks of each recovery log are striped across > the devices in the system.log pool (if it is defined). As a result, > even when all clients have a local device in the system.log pool, many > writes to the recovery log will go to remote devices. For a client > that lacks a local device in the system.log pool, log writes will > always be remote. > Notice, that typically in such a setup you would enable log > replication for HA. Otherwise, if a single client fails (and its > recover log is lost) the whole cluster fails as there is no log? to > recover FS to consistent state. Therefore, at least one remote write > is essential. > HTH, > -- > Vasily Tarasov, > Research Staff Member, > Storage Systems Research, > IBM Research - Almaden > > ----- Original message ----- > From: Kenneth Waegeman > Sent by: gpfsug-discuss-bounces at spectrumscale.org > To: gpfsug main discussion list > Cc: > Subject: [gpfsug-discuss] system.log pool on client nodes for HAWC > Date: Tue, Aug 28, 2018 5:31 AM > Hi all, > > I was looking into HAWC , using the 'distributed fast storage in > client > nodes' method ( > https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.0/com.ibm.spectrum.scale.v5r00.doc/bl1adv_hawc_using.htm > > ) > > This is achieved by putting? a local device on the clients in the > system.log pool. Reading another article > (https://www.ibm.com/support/knowledgecenter/STXKQY_5.0.0/com.ibm.spectrum.scale.v5r00.doc/bl1adv_syslogpool.htm > > ) this would now be used for ALL File system recovery logs. > > Does this mean that if you have a (small) subset of clients with fast > local devices added in the system.log pool, all other clients will use > these too instead of the central system pool? > > Thank you! > > Kenneth > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From oehmes at gmail.com Mon Sep 3 16:32:11 2018 From: oehmes at gmail.com (Sven Oehme) Date: Mon, 3 Sep 2018 08:32:11 -0700 Subject: [gpfsug-discuss] system.log pool on client nodes for HAWC In-Reply-To: <85d5591a-cf74-f55e-1802-e3e14983abbf@ugent.be> References: <85d5591a-cf74-f55e-1802-e3e14983abbf@ugent.be> Message-ID: Hi Ken, what the documents is saying (or try to) is that the behavior of data in inode or metadata operations are not changed if HAWC is enabled, means if the data fits into the inode it will be placed there directly instead of writing the data i/o into a data recovery log record (which is what HAWC uses) and then later destage it where ever the data blocks of a given file eventually will be written. that also means if all your application does is creating small files that fit into the inode, HAWC will not be able to improve performance. its unfortunate not so simple to say if HAWC will help or not, but here are a couple of thoughts where HAWC will not help and help : on the where it won't help : 1. if you have storage device which has very large or even better are log structured write cache. 2. if majority of your files are very small 3. if your files will almost always be accesses sequentially 4. your storage is primarily flash based where it most likely will help : 1. your majority of storage is direct attached HDD (e.g. FPO) with a small SSD pool for metadata and HAWC 2. your ratio of clients to storage devices is very high (think hundreds of clients and only 1 storage array) 3. your workload is primarily virtual machines or databases as always there are lots of exceptions and corner cases, but is the best list i could come up with. on how to find out if HAWC could help, there are 2 ways of doing this first, look at mmfsadm dump iocounters , you see the average size of i/os and you could check if there is a lot of small write operations done. a more involved but more accurate way would be to take a trace with trace level trace=io , that will generate a very lightweight trace of only the most relevant io layers of GPFS, you could then post process the operations performance, but the data is not the simplest to understand for somebody with low knowledge of filesystems, but if you stare at it for a while it might make some sense to you. Sven On Mon, Sep 3, 2018 at 4:06 PM Kenneth Waegeman wrote: > Thank you Vasily and Simon for the clarification! > > I was looking further into it, and I got stuck with more questions :) > > > - In > https://www.ibm.com/support/knowledgecenter/STXKQY_5.0.0/com.ibm.spectrum.scale.v5r00.doc/bl1adv_hawc_tuning.htm > I read: > HAWC does not change the following behaviors: > write behavior of small files when the data is placed in the inode > itself > write behavior of directory blocks or other metadata > I wondered why? Is the metadata not logged in the (same) recovery logs? > (It seemed by reading > https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.ins.doc/bl1ins_logfile.htm > it does ) > > > - Would there be a way to estimate how much of the write requests on a > running cluster would benefit from enabling HAWC ? > > > Thanks again! > > > Kenneth > > > On 31/08/18 19:49, Vasily Tarasov wrote: > > That is correct. The blocks of each recovery log are striped across the > devices in the system.log pool (if it is defined). As a result, even when > all clients have a local device in the system.log pool, many writes to the > recovery log will go to remote devices. For a client that lacks a local > device in the system.log pool, log writes will always be remote. > > Notice, that typically in such a setup you would enable log replication > for HA. Otherwise, if a single client fails (and its recover log is lost) > the whole cluster fails as there is no log to recover FS to consistent > state. Therefore, at least one remote write is essential. > > HTH, > -- > Vasily Tarasov, > Research Staff Member, > Storage Systems Research, > IBM Research - Almaden > > > > ----- Original message ----- > From: Kenneth Waegeman > > Sent by: gpfsug-discuss-bounces at spectrumscale.org > To: gpfsug main discussion list > > Cc: > Subject: [gpfsug-discuss] system.log pool on client nodes for HAWC > Date: Tue, Aug 28, 2018 5:31 AM > > Hi all, > > I was looking into HAWC , using the 'distributed fast storage in client > nodes' method ( > > https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.0/com.ibm.spectrum.scale.v5r00.doc/bl1adv_hawc_using.htm > > ) > > This is achieved by putting a local device on the clients in the > system.log pool. Reading another article > ( > https://www.ibm.com/support/knowledgecenter/STXKQY_5.0.0/com.ibm.spectrum.scale.v5r00.doc/bl1adv_syslogpool.htm > > ) this would now be used for ALL File system recovery logs. > > Does this mean that if you have a (small) subset of clients with fast > local devices added in the system.log pool, all other clients will use > these too instead of the central system pool? > > Thank you! > > Kenneth > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.orghttp://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From r.sobey at imperial.ac.uk Tue Sep 4 09:44:59 2018 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Tue, 4 Sep 2018 08:44:59 +0000 Subject: [gpfsug-discuss] CES file authentication - bind account deleted? Message-ID: Hi all, I don't like using long subject lines as a rule so it probably doesn't make sense, but consider: FILE access configuration : AD PARAMETERS VALUES ------------------------------------------------- ENABLE_NFS_KERBEROS true SERVERS domaincontroller.ic.ac.uk USER_NAME joebloggs at IC.AC.UK NETBIOS_NAME store IDMAP_ROLE master IDMAP_RANGE 10000000-299999999 IDMAP_RANGE_SIZE 1000000 UNIXMAP_DOMAINS IC(500 - 2000000) LDAPMAP_DOMAINS none If "joebloggs" was to leave the organization and that account deleted from Active Directory, what is the impact on file authentication in CES? Thanks Richard -------------- next part -------------- An HTML attachment was scrubbed... URL: From z.han at imperial.ac.uk Tue Sep 4 14:03:52 2018 From: z.han at imperial.ac.uk (z.han at imperial.ac.uk) Date: Tue, 4 Sep 2018 14:03:52 +0100 (BST) Subject: [gpfsug-discuss] CES file authentication - bind account deleted? In-Reply-To: References: Message-ID: Files owned by "joebloggs" will be owned by the user's uid and gid. Assuming those ids aren't recycled, then there shouldn't be any impact on file authentication, right? It's a different matter if the ids are recycled by AD. Kind regards, Zong-Pei -------------------------------------------- Zong-Pei Han (BSc MSc PhD) UK MED-BIO Data Systems Administrator Room 126, Sir Alexander Fleming Building South Kensington Campus Imperial College London, SW7 2AZ -------------------------------------------- On Tue, 4 Sep 2018, Sobey, Richard A wrote: > Date: Tue, 4 Sep 2018 08:44:59 +0000 > From: "Sobey, Richard A" > Reply-To: gpfsug main discussion list > To: "'gpfsug-discuss at spectrumscale.org'" > Subject: [gpfsug-discuss] CES file authentication - bind account deleted? > > > Hi all, > > ? > > I don?t like using long subject lines as a rule so it probably doesn?t make sense, but consider: > > ? > > FILE access configuration : AD > > PARAMETERS?????????????? VALUES > > ------------------------------------------------- > > ENABLE_NFS_KERBEROS????? true > > SERVERS????????????????? domaincontroller.ic.ac.uk > > USER_NAME??????????????? joebloggs at IC.AC.UK > > NETBIOS_NAME???????????? store > > IDMAP_ROLE?????????????? master > > IDMAP_RANGE????????????? 10000000-299999999 > > IDMAP_RANGE_SIZE???????? 1000000 > > UNIXMAP_DOMAINS????????? IC(500 - 2000000) > > LDAPMAP_DOMAINS????????? none > > ? > > If ?joebloggs? was to leave the organization and that account deleted from Active Directory, what is the impact on file > authentication in CES? > > ? > > Thanks > > Richard > > > From abeattie at au1.ibm.com Tue Sep 4 14:17:56 2018 From: abeattie at au1.ibm.com (Andrew Beattie) Date: Tue, 4 Sep 2018 13:17:56 +0000 Subject: [gpfsug-discuss] CES file authentication - bind account deleted? In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From rohwedder at de.ibm.com Tue Sep 4 14:40:33 2018 From: rohwedder at de.ibm.com (Markus Rohwedder) Date: Tue, 4 Sep 2018 15:40:33 +0200 Subject: [gpfsug-discuss] CES file authentication - bind account deleted? In-Reply-To: References: Message-ID: Hello. the user name should not matter for operations beyon domain join. mmuserauth man page: --user-name userName .... In case of --type ad with --data-access-method file, the specified username is used to join the cluster to AD domain. It results in creating a machine account for the cluster based on the --netbios-name specified in the command. After successful configuration, the cluster connects with its machine account, and not the user used during the domain join. So the specified username after domain join has no role to play in communication with the AD domain controller and can be even deleted from the AD server. The cluster can still keep using AD for authentication via the machine account created. Mit freundlichen Gr??en / Kind regards Dr. Markus Rohwedder Spectrum Scale GUI Development Phone: +49 7034 6430190 IBM Deutschland Research & Development E-Mail: rohwedder at de.ibm.com Am Weiher 24 65451 Kelsterbach Germany From: "Andrew Beattie" To: gpfsug-discuss at spectrumscale.org Cc: gpfsug-discuss at spectrumscale.org Date: 04.09.2018 15:18 Subject: Re: [gpfsug-discuss] CES file authentication - bind account deleted? Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Richard, If you are setting up Protocol authentication against the active directory, would you not choose to use a service account that isn't going to get deleted? If you choose to use an user account of a Sys Admin who has Domain admin privileges and they leave the company and their account is deleted, you would almost certainly have issues with the Scale cluster trying to validate users permissions and having scale get an error from AD when the credentials that it uses are no longer valid. Andrew Beattie Software Defined Storage - IT Specialist Phone: 614-2133-7927 E-mail: abeattie at au1.ibm.com ----- Original message ----- From: "Sobey, Richard A" Sent by: gpfsug-discuss-bounces at spectrumscale.org To: "'gpfsug-discuss at spectrumscale.org'" Cc: Subject: [gpfsug-discuss] CES file authentication - bind account deleted? Date: Tue, Sep 4, 2018 8:45 AM Hi all, I don?t like using long subject lines as a rule so it probably doesn?t make sense, but consider: FILE access configuration : AD PARAMETERS VALUES ------------------------------------------------- ENABLE_NFS_KERBEROS true SERVERS domaincontroller.ic.ac.uk USER_NAME joebloggs at IC.AC.UK NETBIOS_NAME store IDMAP_ROLE master IDMAP_RANGE 10000000-299999999 IDMAP_RANGE_SIZE 1000000 UNIXMAP_DOMAINS IC(500 - 2000000) LDAPMAP_DOMAINS none If ?joebloggs? was to leave the organization and that account deleted from Active Directory, what is the impact on file authentication in CES? Thanks Richard _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ecblank.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 1A629793.gif Type: image/gif Size: 4659 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From r.sobey at imperial.ac.uk Tue Sep 4 14:44:28 2018 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Tue, 4 Sep 2018 13:44:28 +0000 Subject: [gpfsug-discuss] CES file authentication - bind account deleted? In-Reply-To: References: Message-ID: Ah, thanks Markus, that?s what I was looking for. Andrew yes, the service account has been created now, I am more interested in the ?what if? we didn?t change things. I suppose this is the result of ~4 years of technical debt on our part! Thanks, Richard From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Markus Rohwedder Sent: 04 September 2018 14:41 To: gpfsug main discussion list Cc: gpfsug-discuss-bounces at spectrumscale.org Subject: Re: [gpfsug-discuss] CES file authentication - bind account deleted? Hello. the user name should not matter for operations beyon domain join. mmuserauth man page: --user-name userName .... In case of --type ad with --data-access-method file, the specified username is used to join the cluster to AD domain. It results in creating a machine account for the cluster based on the --netbios-name specified in the command. After successful configuration, the cluster connects with its machine account, and not the user used during the domain join. So the specified username after domain join has no role to play in communication with the AD domain controller and can be even deleted from the AD server. The cluster can still keep using AD for authentication via the machine account created. Mit freundlichen Gr??en / Kind regards Dr. Markus Rohwedder Spectrum Scale GUI Development ________________________________ Phone: +49 7034 6430190 IBM Deutschland Research & Development [cid:image002.png at 01D4445D.C716BB30] E-Mail: rohwedder at de.ibm.com Am Weiher 24 65451 Kelsterbach Germany ________________________________ [Inactive hide details for "Andrew Beattie" ---04.09.2018 15:18:43---Hi Richard,]"Andrew Beattie" ---04.09.2018 15:18:43---Hi Richard, From: "Andrew Beattie" > To: gpfsug-discuss at spectrumscale.org Cc: gpfsug-discuss at spectrumscale.org Date: 04.09.2018 15:18 Subject: Re: [gpfsug-discuss] CES file authentication - bind account deleted? Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi Richard, If you are setting up Protocol authentication against the active directory, would you not choose to use a service account that isn't going to get deleted? If you choose to use an user account of a Sys Admin who has Domain admin privileges and they leave the company and their account is deleted, you would almost certainly have issues with the Scale cluster trying to validate users permissions and having scale get an error from AD when the credentials that it uses are no longer valid. Andrew Beattie Software Defined Storage - IT Specialist Phone: 614-2133-7927 E-mail: abeattie at au1.ibm.com ----- Original message ----- From: "Sobey, Richard A" > Sent by: gpfsug-discuss-bounces at spectrumscale.org To: "'gpfsug-discuss at spectrumscale.org'" > Cc: Subject: [gpfsug-discuss] CES file authentication - bind account deleted? Date: Tue, Sep 4, 2018 8:45 AM Hi all, I don?t like using long subject lines as a rule so it probably doesn?t make sense, but consider: FILE access configuration : AD PARAMETERS VALUES ------------------------------------------------- ENABLE_NFS_KERBEROS true SERVERS domaincontroller.ic.ac.uk USER_NAME joebloggs at IC.AC.UK NETBIOS_NAME store IDMAP_ROLE master IDMAP_RANGE 10000000-299999999 IDMAP_RANGE_SIZE 1000000 UNIXMAP_DOMAINS IC(500 - 2000000) LDAPMAP_DOMAINS none If ?joebloggs? was to leave the organization and that account deleted from Active Directory, what is the impact on file authentication in CES? Thanks Richard _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 166 bytes Desc: image001.png URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.png Type: image/png Size: 4659 bytes Desc: image002.png URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image003.gif Type: image/gif Size: 105 bytes Desc: image003.gif URL: From vtarasov at us.ibm.com Tue Sep 4 16:57:37 2018 From: vtarasov at us.ibm.com (Vasily Tarasov) Date: Tue, 4 Sep 2018 15:57:37 +0000 Subject: [gpfsug-discuss] system.log pool on client nodes for HAWC In-Reply-To: References: , <85d5591a-cf74-f55e-1802-e3e14983abbf@ugent.be> Message-ID: An HTML attachment was scrubbed... URL: From stijn.deweirdt at ugent.be Tue Sep 4 20:23:35 2018 From: stijn.deweirdt at ugent.be (Stijn De Weirdt) Date: Tue, 4 Sep 2018 21:23:35 +0200 Subject: [gpfsug-discuss] system.log pool on client nodes for HAWC In-Reply-To: References: <85d5591a-cf74-f55e-1802-e3e14983abbf@ugent.be> Message-ID: <29b9209e-d17b-f109-983a-c14c6e0966ef@ugent.be> hi vasily, sven, and is there any advantage in moving the system.log pool to faster storage (like nvdimm) or increasing its default size when HAWC is not used (ie write-cache-threshold kept to 0). (i remember the (very creative) logtip placement on the gss boxes ;) thanks a lot for the detailed answer stijn On 09/04/2018 05:57 PM, Vasily Tarasov wrote: > Let me add just one more item to Sven's detailed reply: HAWC is especially > helpful to decrease the latencies of small synchronous I/Os that come in > *bursts*. If your workload contains a sustained high rate of writes, the > recovery log will get full very quickly, and HAWC won't help much (or can even > decrease performance). Making the recovery log larger allows to adsorb longer > I/O bursts. The specific amount of improvements depends on the workload (how > long/high are bursts, e.g.) and hardware. > Best, > Vasily > -- > Vasily Tarasov, > Research Staff Member, > Storage Systems Research, > IBM Research - Almaden > > ----- Original message ----- > From: Sven Oehme > To: gpfsug main discussion list > Cc: Vasily Tarasov > Subject: Re: [gpfsug-discuss] system.log pool on client nodes for HAWC > Date: Mon, Sep 3, 2018 8:32 AM > Hi Ken, > what the documents is saying (or try to) is that the behavior of data in > inode or metadata operations are not changed if HAWC is enabled, means if > the data fits into the inode it will be placed there directly instead of > writing the data i/o into a data recovery log record (which is what HAWC > uses) and then later destage it where ever the data blocks of a given file > eventually will be written. that also means if all your application does is > creating small files that fit into the inode, HAWC will not be able to > improve performance. > its unfortunate not so simple to say if HAWC will help or not, but here are > a couple of thoughts where HAWC will not help and help : > on the where it won't help : > 1. if you have storage device which has very large or even better are log > structured write cache. > 2. if majority of your files are very small > 3. if your files will almost always be accesses sequentially > 4. your storage is primarily flash based > where it most likely will help : > 1. your majority of storage is direct attached HDD (e.g. FPO) with a small > SSD pool for metadata and HAWC > 2. your ratio of clients to storage devices is very high (think hundreds of > clients and only 1 storage array) > 3. your workload is primarily virtual machines or databases > as always there are lots of exceptions and corner cases, but is the best > list i could come up with. > on how to find out if HAWC could help, there are 2 ways of doing this > first, look at mmfsadm dump iocounters , you see the average size of i/os > and you could check if there is a lot of small write operations done. > a more involved but more accurate way would be to take a trace with trace > level trace=io , that will generate a very lightweight trace of only the > most relevant io layers of GPFS, you could then post process the operations > performance, but the data is not the simplest to understand for somebody > with low knowledge of filesystems, but if you stare at it for a while it > might make some sense to you. > Sven > On Mon, Sep 3, 2018 at 4:06 PM Kenneth Waegeman > wrote: > > Thank you Vasily and Simon for the clarification! > > I was looking further into it, and I got stuck with more questions :) > > > - In > https://www.ibm.com/support/knowledgecenter/STXKQY_5.0.0/com.ibm.spectrum.scale.v5r00.doc/bl1adv_hawc_tuning.htm > I read: > HAWC does not change the following behaviors: > write behavior of small files when the data is placed in the > inode itself > write behavior of directory blocks or other metadata > > I wondered why? Is the metadata not logged in the (same) recovery logs? > (It seemed by reading > https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.ins.doc/bl1ins_logfile.htm > it does ) > > > - Would there be a way to estimate how much of the write requests on a > running cluster would benefit from enabling HAWC ? > > > Thanks again! > > > Kenneth > On 31/08/18 19:49, Vasily Tarasov wrote: >> That is correct. The blocks of each recovery log are striped across >> the devices in the system.log pool (if it is defined). As a result, >> even when all clients have a local device in the system.log pool, many >> writes to the recovery log will go to remote devices. For a client >> that lacks a local device in the system.log pool, log writes will >> always be remote. >> Notice, that typically in such a setup you would enable log >> replication for HA. Otherwise, if a single client fails (and its >> recover log is lost) the whole cluster fails as there is no log to >> recover FS to consistent state. Therefore, at least one remote write >> is essential. >> HTH, >> -- >> Vasily Tarasov, >> Research Staff Member, >> Storage Systems Research, >> IBM Research - Almaden >> >> ----- Original message ----- >> From: Kenneth Waegeman >> >> Sent by: gpfsug-discuss-bounces at spectrumscale.org >> >> To: gpfsug main discussion list >> >> Cc: >> Subject: [gpfsug-discuss] system.log pool on client nodes for HAWC >> Date: Tue, Aug 28, 2018 5:31 AM >> Hi all, >> >> I was looking into HAWC , using the 'distributed fast storage in >> client >> nodes' method ( >> https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.0/com.ibm.spectrum.scale.v5r00.doc/bl1adv_hawc_using.htm >> >> ) >> >> This is achieved by putting a local device on the clients in the >> system.log pool. Reading another article >> (https://www.ibm.com/support/knowledgecenter/STXKQY_5.0.0/com.ibm.spectrum.scale.v5r00.doc/bl1adv_syslogpool.htm >> >> ) this would now be used for ALL File system recovery logs. >> >> Does this mean that if you have a (small) subset of clients with fast >> local devices added in the system.log pool, all other clients will use >> these too instead of the central system pool? >> >> Thank you! >> >> Kenneth >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > From anobre at br.ibm.com Tue Sep 4 20:40:06 2018 From: anobre at br.ibm.com (Anderson Ferreira Nobre) Date: Tue, 4 Sep 2018 19:40:06 +0000 Subject: [gpfsug-discuss] Top files on GPFS filesystem In-Reply-To: Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image._2_DBC5F19CDBC5ECBC00214F54C12582E8.jpg Type: image/jpeg Size: 5698 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image._1_DBCF2504DBCF20E800214F54C12582E8.gif Type: image/gif Size: 360 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image._2_DBC5F19CDBC5ECBC00214F54C12582E8.jpg Type: image/jpeg Size: 5698 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image._1_DBCF2504DBCF20E800214F54C12582E8.gif Type: image/gif Size: 360 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image._2_DBC5F19CDBC5ECBC00214F54C12582E8.jpg Type: image/jpeg Size: 5698 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image._1_DBCF2504DBCF20E800214F54C12582E8.gif Type: image/gif Size: 360 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image.1536071547526146.jpg Type: image/jpeg Size: 5698 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image.1536071547526147.gif Type: image/gif Size: 360 bytes Desc: not available URL: From vtarasov at us.ibm.com Wed Sep 5 01:19:15 2018 From: vtarasov at us.ibm.com (Vasily Tarasov) Date: Wed, 5 Sep 2018 00:19:15 +0000 Subject: [gpfsug-discuss] system.log pool on client nodes for HAWC In-Reply-To: <29b9209e-d17b-f109-983a-c14c6e0966ef@ugent.be> References: <29b9209e-d17b-f109-983a-c14c6e0966ef@ugent.be>, <85d5591a-cf74-f55e-1802-e3e14983abbf@ugent.be> Message-ID: An HTML attachment was scrubbed... URL: From sven.siebler at urz.uni-heidelberg.de Wed Sep 5 08:13:47 2018 From: sven.siebler at urz.uni-heidelberg.de (Sven Siebler) Date: Wed, 5 Sep 2018 09:13:47 +0200 Subject: [gpfsug-discuss] Getting inode information with REST API V2 Message-ID: <0975dcd6-a665-31f8-6070-a73b414f3d25@urz.uni-heidelberg.de> Hi all, i just started to use the REST API for our monitoring and my question is concerning about how can i get information about allocated inodes with REST API V2 ? Up to now i use "mmlsfileset" directly, which gives me information on maximum and allocated inodes (mmdf for total/free/allocated inodes of the filesystem) If i use the REST API V2 with "filesystems//filesets?fields=:all:", i get all information except the allocated inodes. On the documentation (https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.0/com.ibm.spectrum.scale.v5r00.doc/bl1adm_apiv2getfilesystemfilesets.htm) i found: > "inodeSpace": "Inodes" > The number of inodes that are allocated for use by the fileset. but for me the inodeSpace looks more like the ID of the inodespace, instead of the number of allocated inodes. In the documentation example the API can give output like this: "filesetName" : "root", ?????? "filesystemName" : "gpfs0", ?????? "usage" : { ? ? ?? ??? "allocatedInodes" : 100000, ????? ? ?? "inodeSpaceFreeInodes" : 95962, ?????????? "inodeSpaceUsedInodes" : 4038, ????? ? ?? "usedBytes" : 0, ?????? ? ? "usedInodes" : 4038 } but i could not retrieve such usage-fields in my queries. The only way for me to get inode information with REST is the usage of V1: https://REST_API_host:port/scalemgmt/v1/filesets?filesystemName=FileSystemName which gives exact the information of "mmlsfileset". But because V1 is deprecated i want to use V2 for rewriting our tools... Thanks, Sven -- Sven Siebler Servicebereich Future IT - Research & Education (FIRE) Tel. +49 6221 54 20032 sven.siebler at urz.uni-heidelberg.de Universit?t Heidelberg Universit?tsrechenzentrum (URZ) Im Neuenheimer Feld 293, D-69120 Heidelberg http://www.urz.uni-heidelberg.de -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 5437 bytes Desc: S/MIME Cryptographic Signature URL: From andreas.koeninger at de.ibm.com Wed Sep 5 10:13:19 2018 From: andreas.koeninger at de.ibm.com (Andreas Koeninger) Date: Wed, 5 Sep 2018 09:13:19 +0000 Subject: [gpfsug-discuss] Getting inode information with REST API V2 In-Reply-To: <0975dcd6-a665-31f8-6070-a73b414f3d25@urz.uni-heidelberg.de> References: <0975dcd6-a665-31f8-6070-a73b414f3d25@urz.uni-heidelberg.de> Message-ID: An HTML attachment was scrubbed... URL: From sven.siebler at urz.uni-heidelberg.de Wed Sep 5 12:44:32 2018 From: sven.siebler at urz.uni-heidelberg.de (Sven Siebler) Date: Wed, 5 Sep 2018 13:44:32 +0200 Subject: [gpfsug-discuss] Getting inode information with REST API V2 In-Reply-To: References: <0975dcd6-a665-31f8-6070-a73b414f3d25@urz.uni-heidelberg.de> Message-ID: Hi Andreas, i've forgotten to mention that we are currently using ISS v4.2.1, not v5.0.0. Invastigating the command i got the following: # /usr/lpp/mmfs/gui/cli/runtask FILESETS --debug debug: locale=en_US debug: Running 'mmlsfileset 'lsdf02' -di -Y ' on node localhost debug: Raising event: inode_normal debug: Running 'mmsysmonc event 'filesystem' 'inode_normal' 'lsdf02/sd17e005' 'lsdf02/sd17e005,' ' on node localhost debug: Raising event: inode_normal debug: Running 'mmsysmonc event 'filesystem' 'inode_normal' 'lsdf02/sd17g004' 'lsdf02/sd17g004,' ' on node localhost [...] debug: perf: Executing mmhealth node show --verbose -N 'llsdf02e4' -Y? took 1330ms [...] debug: Inserting 0 new informational HealthEvents for node llsdf02e4 debug: perf: processInfoEvents() with 2 events took 5ms debug: perf: Parsing 23 state rows took 9ms debug: Deleted 0 orphaned states. debug: Loaded list of state changing HealthEvent objects. Size: 4 debug: Inserting 0 new state changing HealthEvents in the history table for node llsdf02e4 debug: perf: processStateChangingEvents() with 3 events took 2ms debug: perf: pool-90578-thread-1 - Processing 5 eventlog rows of node llsdf02e4 took 10ms in total debug: Deleted 0 orphaned states from history. debug: Loaded list of state changing HealthEvent objects. Size: 281 debug: Inserting 0 new state changing HealthEvents for node llsdf02e4 debug: perf: Processing 23 state rows took 59ms in total The command takes very long due to the -di option. I tried also your posted zimon command: #? echo "get -a metrics max(gpfs_fset_maxInodes),max(gpfs_fset_freeInodes),max(gpfs_fset_allocInodes) from gpfs_fs_name=lsdf02 group_by gpfs_fset_name last 13 bucket_size 300" | /opt/IBM/zimon/zc 127.0.0.1 Error: No data available for query: 6396075 In the Admin GUI i noticed that the Information in "Files -> Filesets -> -> Details" shows inconsistent inode information, e.g. ? in Overview: ????? Inodes: 76M ????? Max Inodes: 315M ? in Properties: ? ?? Inodes:??? ??? 1 ???? Max inodes:??? ??? 314572800 thanks, Sven On 05.09.2018 11:13, Andreas Koeninger wrote: > Hi Sven, > the REST API v2 provides similar information to what v1 provided. See > an example from my system below: > /scalemgmt/v2/filesystems/gpfs0/filesets?fields=:all: > [...] > ??? "filesetName" : "fset1", > ??? "filesystemName" : "gpfs0", > ??? "usage" : { > ????? "allocatedInodes" : 51232, > ????? "inodeSpaceFreeInodes" : 51231, > ????? "inodeSpaceUsedInodes" : 1, > ????? "usedBytes" : 0, > ????? "usedInodes" : 1 > ??? } > ? } ], > *In 5.0.0 there are two sources for the inode information: the first > one is mmlsfileset and the second one is the data collected by Zimon.* > Depending on the availability of the data either one is used. > > To debug what's happening on your system you can *execute the FILESETS > task on the GUI node* manually with the --debug flag. The output is > then showing the exact queries that are used to retrieve the data: > *[root at os-11 ~]# /usr/lpp/mmfs/gui/cli/runtask FILESETS --debug* > debug: locale=en_US > debug: Running 'mmlsfileset 'gpfs0' -Y ' on node localhost > debug: Running zimon query: 'get -ja metrics > max(gpfs_fset_maxInodes),max(gpfs_fset_freeInodes),max(gpfs_fset_allocInodes),max(gpfs_rq_blk_current),max(gpfs_rq_file_current) > from gpfs_fs_name=gpfs0 group_by gpfs_fset_name last 13 bucket_size 300' > debug: Running 'mmlsfileset 'objfs' -Y ' on node localhost > debug: Running zimon query: 'get -ja metrics > max(gpfs_fset_maxInodes),max(gpfs_fset_freeInodes),max(gpfs_fset_allocInodes),max(gpfs_rq_blk_current),max(gpfs_rq_file_current) > from gpfs_fs_name=objfs group_by gpfs_fset_name last 13 bucket_size 300' > EFSSG1000I The command completed successfully. > *As a start I suggest running the displayed Zimon queries manually to > see what's returned there, e.g.:* > /(Removed -j for better readability)/ > > *[root at os-11 ~]# echo "get -a metrics > max(gpfs_fset_maxInodes),max(gpfs_fset_freeInodes),max(gpfs_fset_allocInodes),max(gpfs_rq_blk_current),max(gpfs_rq_file_current) > from gpfs_fs_name=gpfs0 group_by gpfs_fset_name last 13 bucket_size > 300" | /opt/IBM/zimon/zc 127.0.0.1* > 1: > ?gpfs-cluster-1.novalocal|GPFSFileset|gpfs0|.audit_log|gpfs_fset_maxInodes > 2: ?gpfs-cluster-1.novalocal|GPFSFileset|gpfs0|fset1|gpfs_fset_maxInodes > 3: ?gpfs-cluster-1.novalocal|GPFSFileset|gpfs0|root|gpfs_fset_maxInodes > 4: > ?gpfs-cluster-1.novalocal|GPFSFileset|gpfs0|.audit_log|gpfs_fset_freeInodes > 5: ?gpfs-cluster-1.novalocal|GPFSFileset|gpfs0|fset1|gpfs_fset_freeInodes > 6: ?gpfs-cluster-1.novalocal|GPFSFileset|gpfs0|root|gpfs_fset_freeInodes > 7: > ?gpfs-cluster-1.novalocal|GPFSFileset|gpfs0|.audit_log|gpfs_fset_allocInodes > 8: ?gpfs-cluster-1.novalocal|GPFSFileset|gpfs0|fset1|gpfs_fset_allocInodes > 9: ?gpfs-cluster-1.novalocal|GPFSFileset|gpfs0|root|gpfs_fset_allocInodes > Row?? ?Timestamp?? ??? ?max(gpfs_fset_maxInodes) > ?max(gpfs_fset_maxInodes)?? ?max(gpfs_fset_maxInodes) > ?max(gpfs_fset_freeInodes)?? ?max(gpfs_fset_freeInodes) > ?max(gpfs_fset_freeInodes)?? ?max(gpfs_fset_allocInodes) > ?max(gpfs_fset_allocInodes)?? ?max(gpfs_fset_allocInodes) > 1?? ?2018-09-05 10:10:00?? ?100000?? ?620640?? ?65792 ?65795?? > ?51231?? ?61749?? ?65824?? ?51232?? ?65792 > 2?? ?2018-09-05 10:15:00?? ?100000?? ?620640?? ?65792 ?65795?? > ?51231?? ?61749?? ?65824?? ?51232?? ?65792 > 3?? ?2018-09-05 10:20:00?? ?100000?? ?620640?? ?65792 ?65795?? > ?51231?? ?61749?? ?65824?? ?51232?? ?65792 > 4?? ?2018-09-05 10:25:00?? ?100000?? ?620640?? ?65792 ?65795?? > ?51231?? ?61749?? ?65824?? ?51232?? ?65792 > 5?? ?2018-09-05 10:30:00?? ?100000?? ?620640?? ?65792 ?65795?? > ?51231?? ?61749?? ?65824?? ?51232?? ?65792 > 6?? ?2018-09-05 10:35:00?? ?100000?? ?620640?? ?65792 ?65795?? > ?51231?? ?61749?? ?65824?? ?51232?? ?65792 > 7?? ?2018-09-05 10:40:00?? ?100000?? ?620640?? ?65792 ?65795?? > ?51231?? ?61749?? ?65824?? ?51232?? ?65792 > 8?? ?2018-09-05 10:45:00?? ?100000?? ?620640?? ?65792 ?65795?? > ?51231?? ?61749?? ?65824?? ?51232?? ?65792 > 9?? ?2018-09-05 10:50:00?? ?100000?? ?620640?? ?65792 ?65795?? > ?51231?? ?61749?? ?65824?? ?51232?? ?65792 > 10?? ?2018-09-05 10:55:00?? ?100000?? ?620640?? ?65792 ?65795?? > ?51231?? ?61749?? ?65824?? ?51232?? ?65792 > 11?? ?2018-09-05 11:00:00?? ?100000?? ?620640?? ?65792 ?65795?? > ?51231?? ?61749?? ?65824?? ?51232?? ?65792 > 12?? ?2018-09-05 11:05:00?? ?100000?? ?620640?? ?65792 ?65795?? > ?51231?? ?61749?? ?65824?? ?51232?? ?65792 > 13?? ?2018-09-05 11:10:00?? ?100000?? ?620640?? ?65792 ?65795?? > ?51231?? ?61749?? ?65824?? ?51232?? ?65792 > . > > Mit freundlichen Gr??en / Kind regards > > Andreas Koeninger > Scrum Master and Software Developer / Spectrum Scale GUI and REST API > IBM Systems &Technology Group, Integrated Systems Development / M069 > ------------------------------------------------------------------------------------------------------------------------------------------- > IBM Deutschland > Am Weiher 24 > 65451 Kelsterbach > Phone: +49-7034-643-0867 > Mobile: +49-7034-643-0867 > E-Mail: andreas.koeninger at de.ibm.com > ------------------------------------------------------------------------------------------------------------------------------------------- > IBM Deutschland Research & Development GmbH / Vorsitzende des > Aufsichtsrats: Martina Koederitz > Gesch?ftsf?hrung: Dirk Wittkopp Sitz der Gesellschaft: B?blingen / > Registergericht: Amtsgericht Stuttgart, HRB 243294 > > ----- Original message ----- > From: Sven Siebler > Sent by: gpfsug-discuss-bounces at spectrumscale.org > To: gpfsug-discuss at spectrumscale.org > Cc: > Subject: [gpfsug-discuss] Getting inode information with REST API V2 > Date: Wed, Sep 5, 2018 9:37 AM > Hi all, > > i just started to use the REST API for our monitoring and my > question is > concerning about how can i get information about allocated inodes with > REST API V2 ? > > Up to now i use "mmlsfileset" directly, which gives me information on > maximum and allocated inodes (mmdf for total/free/allocated inodes of > the filesystem) > > If i use the REST API V2 with > "filesystems//filesets?fields=:all:", i get all > information except the allocated inodes. > > On the documentation > (https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.0/com.ibm.spectrum.scale.v5r00.doc/bl1adm_apiv2getfilesystemfilesets.htm) > i found: > > ?> "inodeSpace": "Inodes" > ?> The number of inodes that are allocated for use by the fileset. > > but for me the inodeSpace looks more like the ID of the inodespace, > instead of the number of allocated inodes. > > In the documentation example the API can give output like this: > > "filesetName" : "root", > ??????? "filesystemName" : "gpfs0", > ??????? "usage" : { > ?? ? ?? ??? "allocatedInodes" : 100000, > ?????? ? ?? "inodeSpaceFreeInodes" : 95962, > ??????????? "inodeSpaceUsedInodes" : 4038, > ?????? ? ?? "usedBytes" : 0, > ??????? ? ? "usedInodes" : 4038 > } > > but i could not retrieve such usage-fields in my queries. > > The only way for me to get inode information with REST is the > usage of V1: > > https://REST_API_host:port/scalemgmt/v1/filesets?filesystemName=FileSystemName > > which gives exact the information of "mmlsfileset". > > But because V1 is deprecated i want to use V2 for rewriting our > tools... > > Thanks, > > Sven > > > -- > Sven Siebler > Servicebereich Future IT - Research & Education (FIRE) > > Tel. +49 6221 54 20032 > sven.siebler at urz.uni-heidelberg.de > Universit?t Heidelberg > Universit?tsrechenzentrum (URZ) > Im Neuenheimer Feld 293, D-69120 Heidelberg > http://www.urz.uni-heidelberg.de > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -- Sven Siebler Servicebereich Future IT - Research & Education (FIRE) Tel. +49 6221 54 20032 sven.siebler at urz.uni-heidelberg.de Universit?t Heidelberg Universit?tsrechenzentrum (URZ) Im Neuenheimer Feld 293, D-69120 Heidelberg http://www.urz.uni-heidelberg.de -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 5437 bytes Desc: S/MIME Cryptographic Signature URL: From andreas.koeninger at de.ibm.com Wed Sep 5 14:42:00 2018 From: andreas.koeninger at de.ibm.com (Andreas Koeninger) Date: Wed, 5 Sep 2018 13:42:00 +0000 Subject: [gpfsug-discuss] Getting inode information with REST API V2 In-Reply-To: References: , <0975dcd6-a665-31f8-6070-a73b414f3d25@urz.uni-heidelberg.de> Message-ID: An HTML attachment was scrubbed... URL: From sven.siebler at urz.uni-heidelberg.de Wed Sep 5 15:17:03 2018 From: sven.siebler at urz.uni-heidelberg.de (Sven Siebler) Date: Wed, 5 Sep 2018 16:17:03 +0200 Subject: [gpfsug-discuss] Getting inode information with REST API V2 In-Reply-To: References: Message-ID: Hi Andreas, you are right ... our Storage Cluster is on v4.2.1 at the moment, while the CES/GUI Nodes running on 4.2.3.6. The GPFSFilesetQuota Sensor is enabled and restricted to the GUI Node, due to the performance impact: { ??????? name = "GPFSFilesetQuota" ??????? period = 3600 ??????? restrict = "llsdf02e4" }, { ??????? name = "GPFSDiskCap" ??????? period = 10800 ??????? restrict = "llsdf02e4" }, thanks, Sven On 05.09.2018 15:42, gpfsug-discuss-request at spectrumscale.org wrote: > Send gpfsug-discuss mailing list submissions to > gpfsug-discuss at spectrumscale.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > or, via email, send a message with subject or body 'help' to > gpfsug-discuss-request at spectrumscale.org > > You can reach the person managing the list at > gpfsug-discuss-owner at spectrumscale.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of gpfsug-discuss digest..." > > > Today's Topics: > > 1. Re: Getting inode information with REST API V2 (Sven Siebler) > 2. Re: Getting inode information with REST API V2 (Andreas Koeninger) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Wed, 5 Sep 2018 13:44:32 +0200 > From: Sven Siebler > To: Andreas Koeninger > Cc: gpfsug-discuss at spectrumscale.org > Subject: Re: [gpfsug-discuss] Getting inode information with REST API > V2 > Message-ID: > > Content-Type: text/plain; charset="utf-8"; Format="flowed" > > Hi Andreas, > > i've forgotten to mention that we are currently using ISS v4.2.1, not > v5.0.0. > > Invastigating the command i got the following: > > # /usr/lpp/mmfs/gui/cli/runtask FILESETS --debug > debug: locale=en_US > debug: Running 'mmlsfileset 'lsdf02' -di -Y ' on node localhost > > debug: Raising event: inode_normal > debug: Running 'mmsysmonc event 'filesystem' 'inode_normal' > 'lsdf02/sd17e005' 'lsdf02/sd17e005,' ' on node localhost > debug: Raising event: inode_normal > debug: Running 'mmsysmonc event 'filesystem' 'inode_normal' > 'lsdf02/sd17g004' 'lsdf02/sd17g004,' ' on node localhost > [...] > debug: perf: Executing mmhealth node show --verbose -N 'llsdf02e4' -Y? > took 1330ms > [...] > debug: Inserting 0 new informational HealthEvents for node llsdf02e4 > debug: perf: processInfoEvents() with 2 events took 5ms > debug: perf: Parsing 23 state rows took 9ms > debug: Deleted 0 orphaned states. > debug: Loaded list of state changing HealthEvent objects. Size: 4 > debug: Inserting 0 new state changing HealthEvents in the history table > for node llsdf02e4 > debug: perf: processStateChangingEvents() with 3 events took 2ms > debug: perf: pool-90578-thread-1 - Processing 5 eventlog rows of node > llsdf02e4 took 10ms in total > debug: Deleted 0 orphaned states from history. > debug: Loaded list of state changing HealthEvent objects. Size: 281 > debug: Inserting 0 new state changing HealthEvents for node llsdf02e4 > debug: perf: Processing 23 state rows took 59ms in total > > The command takes very long due to the -di option. > > I tried also your posted zimon command: > > #? echo "get -a metrics > max(gpfs_fset_maxInodes),max(gpfs_fset_freeInodes),max(gpfs_fset_allocInodes) > from gpfs_fs_name=lsdf02 group_by gpfs_fset_name last 13 bucket_size > 300" | /opt/IBM/zimon/zc 127.0.0.1 > > Error: No data available for query: 6396075 > > In the Admin GUI i noticed that the Information in "Files -> Filesets -> > -> Details" shows inconsistent inode information, e.g. > > ? in Overview: > ????? Inodes: 76M > ????? Max Inodes: 315M > > ? in Properties: > ? ?? Inodes:??? ??? 1 > ???? Max inodes:??? ??? 314572800 > > thanks, > Sven > > > > On 05.09.2018 11:13, Andreas Koeninger wrote: >> Hi Sven, >> the REST API v2 provides similar information to what v1 provided. See >> an example from my system below: >> /scalemgmt/v2/filesystems/gpfs0/filesets?fields=:all: >> [...] >> ??? "filesetName" : "fset1", >> ??? "filesystemName" : "gpfs0", >> ??? "usage" : { >> ????? "allocatedInodes" : 51232, >> ????? "inodeSpaceFreeInodes" : 51231, >> ????? "inodeSpaceUsedInodes" : 1, >> ????? "usedBytes" : 0, >> ????? "usedInodes" : 1 >> ??? } >> ? } ], >> *In 5.0.0 there are two sources for the inode information: the first >> one is mmlsfileset and the second one is the data collected by Zimon.* >> Depending on the availability of the data either one is used. >> >> To debug what's happening on your system you can *execute the FILESETS >> task on the GUI node* manually with the --debug flag. The output is >> then showing the exact queries that are used to retrieve the data: >> *[root at os-11 ~]# /usr/lpp/mmfs/gui/cli/runtask FILESETS --debug* >> debug: locale=en_US >> debug: Running 'mmlsfileset 'gpfs0' -Y ' on node localhost >> debug: Running zimon query: 'get -ja metrics >> max(gpfs_fset_maxInodes),max(gpfs_fset_freeInodes),max(gpfs_fset_allocInodes),max(gpfs_rq_blk_current),max(gpfs_rq_file_current) >> from gpfs_fs_name=gpfs0 group_by gpfs_fset_name last 13 bucket_size 300' >> debug: Running 'mmlsfileset 'objfs' -Y ' on node localhost >> debug: Running zimon query: 'get -ja metrics >> max(gpfs_fset_maxInodes),max(gpfs_fset_freeInodes),max(gpfs_fset_allocInodes),max(gpfs_rq_blk_current),max(gpfs_rq_file_current) >> from gpfs_fs_name=objfs group_by gpfs_fset_name last 13 bucket_size 300' >> EFSSG1000I The command completed successfully. >> *As a start I suggest running the displayed Zimon queries manually to >> see what's returned there, e.g.:* >> /(Removed -j for better readability)/ >> >> *[root at os-11 ~]# echo "get -a metrics >> max(gpfs_fset_maxInodes),max(gpfs_fset_freeInodes),max(gpfs_fset_allocInodes),max(gpfs_rq_blk_current),max(gpfs_rq_file_current) >> from gpfs_fs_name=gpfs0 group_by gpfs_fset_name last 13 bucket_size >> 300" | /opt/IBM/zimon/zc 127.0.0.1* >> 1: >> ?gpfs-cluster-1.novalocal|GPFSFileset|gpfs0|.audit_log|gpfs_fset_maxInodes >> 2: ?gpfs-cluster-1.novalocal|GPFSFileset|gpfs0|fset1|gpfs_fset_maxInodes >> 3: ?gpfs-cluster-1.novalocal|GPFSFileset|gpfs0|root|gpfs_fset_maxInodes >> 4: >> ?gpfs-cluster-1.novalocal|GPFSFileset|gpfs0|.audit_log|gpfs_fset_freeInodes >> 5: ?gpfs-cluster-1.novalocal|GPFSFileset|gpfs0|fset1|gpfs_fset_freeInodes >> 6: ?gpfs-cluster-1.novalocal|GPFSFileset|gpfs0|root|gpfs_fset_freeInodes >> 7: >> ?gpfs-cluster-1.novalocal|GPFSFileset|gpfs0|.audit_log|gpfs_fset_allocInodes >> 8: ?gpfs-cluster-1.novalocal|GPFSFileset|gpfs0|fset1|gpfs_fset_allocInodes >> 9: ?gpfs-cluster-1.novalocal|GPFSFileset|gpfs0|root|gpfs_fset_allocInodes >> Row?? ?Timestamp?? ??? ?max(gpfs_fset_maxInodes) >> ?max(gpfs_fset_maxInodes)?? ?max(gpfs_fset_maxInodes) >> ?max(gpfs_fset_freeInodes)?? ?max(gpfs_fset_freeInodes) >> ?max(gpfs_fset_freeInodes)?? ?max(gpfs_fset_allocInodes) >> ?max(gpfs_fset_allocInodes)?? ?max(gpfs_fset_allocInodes) >> 1?? ?2018-09-05 10:10:00?? ?100000?? ?620640?? ?65792 ?65795?? >> ?51231?? ?61749?? ?65824?? ?51232?? ?65792 >> 2?? ?2018-09-05 10:15:00?? ?100000?? ?620640?? ?65792 ?65795?? >> ?51231?? ?61749?? ?65824?? ?51232?? ?65792 >> 3?? ?2018-09-05 10:20:00?? ?100000?? ?620640?? ?65792 ?65795?? >> ?51231?? ?61749?? ?65824?? ?51232?? ?65792 >> 4?? ?2018-09-05 10:25:00?? ?100000?? ?620640?? ?65792 ?65795?? >> ?51231?? ?61749?? ?65824?? ?51232?? ?65792 >> 5?? ?2018-09-05 10:30:00?? ?100000?? ?620640?? ?65792 ?65795?? >> ?51231?? ?61749?? ?65824?? ?51232?? ?65792 >> 6?? ?2018-09-05 10:35:00?? ?100000?? ?620640?? ?65792 ?65795?? >> ?51231?? ?61749?? ?65824?? ?51232?? ?65792 >> 7?? ?2018-09-05 10:40:00?? ?100000?? ?620640?? ?65792 ?65795?? >> ?51231?? ?61749?? ?65824?? ?51232?? ?65792 >> 8?? ?2018-09-05 10:45:00?? ?100000?? ?620640?? ?65792 ?65795?? >> ?51231?? ?61749?? ?65824?? ?51232?? ?65792 >> 9?? ?2018-09-05 10:50:00?? ?100000?? ?620640?? ?65792 ?65795?? >> ?51231?? ?61749?? ?65824?? ?51232?? ?65792 >> 10?? ?2018-09-05 10:55:00?? ?100000?? ?620640?? ?65792 ?65795?? >> ?51231?? ?61749?? ?65824?? ?51232?? ?65792 >> 11?? ?2018-09-05 11:00:00?? ?100000?? ?620640?? ?65792 ?65795?? >> ?51231?? ?61749?? ?65824?? ?51232?? ?65792 >> 12?? ?2018-09-05 11:05:00?? ?100000?? ?620640?? ?65792 ?65795?? >> ?51231?? ?61749?? ?65824?? ?51232?? ?65792 >> 13?? ?2018-09-05 11:10:00?? ?100000?? ?620640?? ?65792 ?65795?? >> ?51231?? ?61749?? ?65824?? ?51232?? ?65792 >> . >> >> Mit freundlichen Gr??en / Kind regards >> >> Andreas Koeninger >> Scrum Master and Software Developer / Spectrum Scale GUI and REST API >> IBM Systems &Technology Group, Integrated Systems Development / M069 >> ------------------------------------------------------------------------------------------------------------------------------------------- >> IBM Deutschland >> Am Weiher 24 >> 65451 Kelsterbach >> Phone: +49-7034-643-0867 >> Mobile: +49-7034-643-0867 >> E-Mail: andreas.koeninger at de.ibm.com >> ------------------------------------------------------------------------------------------------------------------------------------------- >> IBM Deutschland Research & Development GmbH / Vorsitzende des >> Aufsichtsrats: Martina Koederitz >> Gesch?ftsf?hrung: Dirk Wittkopp Sitz der Gesellschaft: B?blingen / >> Registergericht: Amtsgericht Stuttgart, HRB 243294 >> >> ----- Original message ----- >> From: Sven Siebler >> Sent by: gpfsug-discuss-bounces at spectrumscale.org >> To: gpfsug-discuss at spectrumscale.org >> Cc: >> Subject: [gpfsug-discuss] Getting inode information with REST API V2 >> Date: Wed, Sep 5, 2018 9:37 AM >> Hi all, >> >> i just started to use the REST API for our monitoring and my >> question is >> concerning about how can i get information about allocated inodes with >> REST API V2 ? >> >> Up to now i use "mmlsfileset" directly, which gives me information on >> maximum and allocated inodes (mmdf for total/free/allocated inodes of >> the filesystem) >> >> If i use the REST API V2 with >> "filesystems//filesets?fields=:all:", i get all >> information except the allocated inodes. >> >> On the documentation >> (https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.0/com.ibm.spectrum.scale.v5r00.doc/bl1adm_apiv2getfilesystemfilesets.htm) >> i found: >> >> ?> "inodeSpace": "Inodes" >> ?> The number of inodes that are allocated for use by the fileset. >> >> but for me the inodeSpace looks more like the ID of the inodespace, >> instead of the number of allocated inodes. >> >> In the documentation example the API can give output like this: >> >> "filesetName" : "root", >> ??????? "filesystemName" : "gpfs0", >> ??????? "usage" : { >> ?? ? ?? ??? "allocatedInodes" : 100000, >> ?????? ? ?? "inodeSpaceFreeInodes" : 95962, >> ??????????? "inodeSpaceUsedInodes" : 4038, >> ?????? ? ?? "usedBytes" : 0, >> ??????? ? ? "usedInodes" : 4038 >> } >> >> but i could not retrieve such usage-fields in my queries. >> >> The only way for me to get inode information with REST is the >> usage of V1: >> >> https://REST_API_host:port/scalemgmt/v1/filesets?filesystemName=FileSystemName >> >> which gives exact the information of "mmlsfileset". >> >> But because V1 is deprecated i want to use V2 for rewriting our >> tools... >> >> Thanks, >> >> Sven >> >> >> -- >> Sven Siebler >> Servicebereich Future IT - Research & Education (FIRE) >> >> Tel. +49 6221 54 20032 >> sven.siebler at urz.uni-heidelberg.de >> Universit?t Heidelberg >> Universit?tsrechenzentrum (URZ) >> Im Neuenheimer Feld 293, D-69120 Heidelberg >> http://www.urz.uni-heidelberg.de >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> -- Sven Siebler Servicebereich Future IT - Research & Education (FIRE) Tel. +49 6221 54 20032 sven.siebler at urz.uni-heidelberg.de Universit?t Heidelberg Universit?tsrechenzentrum (URZ) Im Neuenheimer Feld 293, D-69120 Heidelberg http://www.urz.uni-heidelberg.de -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 5437 bytes Desc: S/MIME Cryptographic Signature URL: From olaf.weiser at de.ibm.com Wed Sep 5 16:01:23 2018 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Wed, 5 Sep 2018 17:01:23 +0200 Subject: [gpfsug-discuss] Top files on GPFS filesystem In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/jpeg Size: 5698 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 360 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/jpeg Size: 5698 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 360 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/jpeg Size: 5698 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 360 bytes Desc: not available URL: From anobre at br.ibm.com Wed Sep 5 16:14:24 2018 From: anobre at br.ibm.com (Anderson Ferreira Nobre) Date: Wed, 5 Sep 2018 15:14:24 +0000 Subject: [gpfsug-discuss] Top files on GPFS filesystem In-Reply-To: References: , Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image._2_DDEEC03CDDEEBB8C0051A029C12582FF.jpg Type: image/jpeg Size: 5698 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image._1_DDE8F8F0DDEEE35C0051A029C12582FF.gif Type: image/gif Size: 360 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image._2_DDE9189CDDE913540051A029C12582FF.jpg Type: image/jpeg Size: 5698 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image._1_DDE93E04DDE939E80051A02AC12582FF.gif Type: image/gif Size: 360 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image._2_DDE95060DDE94DD80051A02AC12582FF.jpg Type: image/jpeg Size: 5698 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image._1_DDE95288DDE94DD80051A02AC12582FF.gif Type: image/gif Size: 360 bytes Desc: not available URL: From Kevin.Buterbaugh at Vanderbilt.Edu Wed Sep 5 16:34:51 2018 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Wed, 5 Sep 2018 15:34:51 +0000 Subject: [gpfsug-discuss] RAID type for system pool Message-ID: <92924CCC-FF4A-4F14-BA19-D35BC93CE69F@vanderbilt.edu> Hi All, We are in the process of finalizing the purchase of some new storage arrays (so no sales people who might be monitoring this list need contact me) to life-cycle some older hardware. One of the things we are considering is the purchase of some new SSD?s for our ?/home? filesystem and I have a question or two related to that. Currently, the existing home filesystem has it?s metadata on SSD?s ? two RAID 1 mirrors and metadata replication set to two. However, the filesystem itself is old enough that it uses 512 byte inodes. We have analyzed our users files and know that if we create a new filesystem with 4K inodes that a very significant portion of the files would now have their _data_ stored in the inode as well due to the files being 3.5K or smaller (currently all data is on spinning HD RAID 1 mirrors). Of course, if we increase the size of the inodes by a factor of 8 then we also need 8 times as much space to store those inodes. Given that Enterprise class SSDs are still very expensive and our budget is not unlimited, we?re trying to get the best bang for the buck. We have always - even back in the day when our metadata was on spinning disk and not SSD - used RAID 1 mirrors and metadata replication of two. However, we are wondering if it might be possible to switch to RAID 5? Specifically, what we are considering doing is buying 8 new SSDs and creating two 3+1P RAID 5 LUNs (metadata replication would stay at two). That would give us 50% more usable space than if we configured those same 8 drives as four RAID 1 mirrors. Unfortunately, unless I?m misunderstanding something, mean that the RAID stripe size and the GPFS block size could not match. Therefore, even though we don?t need the space, would we be much better off to buy 10 SSDs and create two 4+1P RAID 5 LUNs? I?ve searched the mailing list archives and scanned the DeveloperWorks wiki and even glanced at the GPFS documentation and haven?t found anything that says ?bad idea, Kevin?? ;-) Expanding on this further ? if we just present those two RAID 5 LUNs to GPFS as NSDs then we can only have two NSD servers as primary for them. So another thing we?re considering is to take those RAID 5 LUNs and further sub-divide them into a total of 8 logical volumes, each of which could be a GPFS NSD and therefore would allow us to have each of our 8 NSD servers be primary for one of them. Even worse idea?!? Good idea? Anybody have any better ideas??? ;-) Oh, and currently we?re on GPFS 4.2.3-10, but are also planning on moving to GPFS 5.0.1-x before creating the new filesystem. Thanks much? ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From anobre at br.ibm.com Wed Sep 5 17:50:45 2018 From: anobre at br.ibm.com (Anderson Ferreira Nobre) Date: Wed, 5 Sep 2018 16:50:45 +0000 Subject: [gpfsug-discuss] RAID type for system pool In-Reply-To: <92924CCC-FF4A-4F14-BA19-D35BC93CE69F@vanderbilt.edu> References: <92924CCC-FF4A-4F14-BA19-D35BC93CE69F@vanderbilt.edu> Message-ID: An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Wed Sep 5 18:20:00 2018 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Wed, 5 Sep 2018 13:20:00 -0400 Subject: [gpfsug-discuss] RAID type for system pool In-Reply-To: <92924CCC-FF4A-4F14-BA19-D35BC93CE69F@vanderbilt.edu> References: <92924CCC-FF4A-4F14-BA19-D35BC93CE69F@vanderbilt.edu> Message-ID: It's good to try to reason and think this out... But there's a good likelihood that we don't understand ALL the details, some of which may negatively impact performance - so no matter what scheme you come up with - test, test, and re-test before deploying and depending on it in production. Having said that, I'm pretty sure that old "spinning" RAID 5 implementations had horrible performance for GPFS metadata/system pool. Why? Among other things, the large stripe size vs the almost random small writes directed to system pool. That random-small-writes pattern won't change when we go to SSD RAID 5 - so you'd have to see if the SSD implementation is somehow smarter than an old fashioned RAID 5 implementation which I believe requires several physical reads and writes, for each "small" logical write. (Top decent google result I found quickly http://rickardnobel.se/raid-5-write-penalty/ But you will probably want to do more research!) Consider GPFS small write performance for: inode updates, log writes, small files (possibly in inode), directory updates, allocation map updates, index of indirect blocks. From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 09/05/2018 11:36 AM Subject: [gpfsug-discuss] RAID type for system pool Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi All, We are in the process of finalizing the purchase of some new storage arrays (so no sales people who might be monitoring this list need contact me) to life-cycle some older hardware. One of the things we are considering is the purchase of some new SSD?s for our ?/home? filesystem and I have a question or two related to that. Currently, the existing home filesystem has it?s metadata on SSD?s ? two RAID 1 mirrors and metadata replication set to two. However, the filesystem itself is old enough that it uses 512 byte inodes. We have analyzed our users files and know that if we create a new filesystem with 4K inodes that a very significant portion of the files would now have their _data_ stored in the inode as well due to the files being 3.5K or smaller (currently all data is on spinning HD RAID 1 mirrors). Of course, if we increase the size of the inodes by a factor of 8 then we also need 8 times as much space to store those inodes. Given that Enterprise class SSDs are still very expensive and our budget is not unlimited, we?re trying to get the best bang for the buck. We have always - even back in the day when our metadata was on spinning disk and not SSD - used RAID 1 mirrors and metadata replication of two. However, we are wondering if it might be possible to switch to RAID 5? Specifically, what we are considering doing is buying 8 new SSDs and creating two 3+1P RAID 5 LUNs (metadata replication would stay at two). That would give us 50% more usable space than if we configured those same 8 drives as four RAID 1 mirrors. Unfortunately, unless I?m misunderstanding something, mean that the RAID stripe size and the GPFS block size could not match. Therefore, even though we don?t need the space, would we be much better off to buy 10 SSDs and create two 4+1P RAID 5 LUNs? I?ve searched the mailing list archives and scanned the DeveloperWorks wiki and even glanced at the GPFS documentation and haven?t found anything that says ?bad idea, Kevin?? ;-) Expanding on this further ? if we just present those two RAID 5 LUNs to GPFS as NSDs then we can only have two NSD servers as primary for them. So another thing we?re considering is to take those RAID 5 LUNs and further sub-divide them into a total of 8 logical volumes, each of which could be a GPFS NSD and therefore would allow us to have each of our 8 NSD servers be primary for one of them. Even worse idea?!? Good idea? Anybody have any better ideas??? ;-) Oh, and currently we?re on GPFS 4.2.3-10, but are also planning on moving to GPFS 5.0.1-x before creating the new filesystem. Thanks much? ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From bbanister at jumptrading.com Wed Sep 5 18:33:03 2018 From: bbanister at jumptrading.com (Bryan Banister) Date: Wed, 5 Sep 2018 17:33:03 +0000 Subject: [gpfsug-discuss] RAID type for system pool In-Reply-To: References: <92924CCC-FF4A-4F14-BA19-D35BC93CE69F@vanderbilt.edu> Message-ID: <4bebae105b37448eab6226a68a23b47d@jumptrading.com> I agree with Anderson on his thoughts, mainly that if you want to go with RAID5 then you should analyze your current workload to see if it is mostly read operations or if you have more of a heavy write situation. Read-modify-write penalties and write amplification wearing problems on SSDs will become an issue for performance and life of the SSDs if you have a heavy metadata write workload. This also applies to the data in inode situation. The current workload can be inspected with standard iostat, mmdiag --iohist, mmpmon, and the GPFS perfmon stuff. We have SSDs in both RAID1 (metadata) and RAID5 configurations (data). We?re using the RAID controllers to split up the RAID sets into multiple virtual volumes so that we can have more NSD servers hosting the storage and increase the number of I/O commands (aka queue depth x N LUNs > queue depth x 1 LUN) being sent to the storage. Since there isn?t a seek penalty this is working well for us. As mentioned below, be sure to round-robin the ServerList for the NSDs to spread the load across servers. Hope that helps! -Bryan From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Anderson Ferreira Nobre Sent: Wednesday, September 5, 2018 11:51 AM To: gpfsug-discuss at spectrumscale.org Cc: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] RAID type for system pool Note: External Email ________________________________ Hi Kevin, RAID5 is good when the read ratio of I/Os is 70% or more. About creating two RAID5 you need to consider the size of the disks and the time to rebuild the RAID in case of failure. Maybe a single RAID5 would be better because you have more disks working in the backend for a single RAID. I think since if you are using SSD disks the time to rebuild the RAID will always be fast. So you wouldn't need a RAID6. Maybe it's a good idea to read the manual of SAS RAID controller to see how long takes to rebuild the RAID in case of a failure. About the stripe size of controller vs block size in GPFS. This is just a guess, and you would need to do some performance test to make sure. You could consider the stripe width of RAID to be the block size of metadata. I think this is the best you can do. Break in several LUNs I consider a good idea for you don't have large queue length in the LUNs. Specially if the I/O profile is many I/O with small block size. About balance the LUNs over the NSD Servers is a best practice. Do not leave all the LUNs pointing to the first node. Just remember that when you create the NSDs, the device is always corresponding to the first node of servers. This can be laborous work. So to make the things easier I create two NSD stanza files. The first one pointing to the first node like this: %nsd device=/dev/mapper/mpatha nsd=nsd001 servers=host1,host2,host3,host4 usage=metadataOnly failureGroup=1 pool=system %nsd device=/dev/mapper/mpathb nsd=nsd002 servers= servers=host1,host2,host3,host4 usage=metadataOnly failureGroup=1 pool=system Then I use this stanza file to create the nsds. And create a second stanza file: %nsd nsd=nsd001 servers=host1,host2,host3,host4 usage=metadataOnly failureGroup=1 pool=system %nsd nsd=nsd002 servers=host2,host3,host4,host1 usage=metadataOnly failureGroup=1 pool=system And change with mmchnsd. Abra?os / Regards / Saludos, Anderson Nobre AIX & Power Consultant Master Certified IT Specialist IBM Systems Hardware Client Technical Team ? IBM Systems Lab Services [community_general_lab_services] ________________________________ Phone: 55-19-2132-4317 E-mail: anobre at br.ibm.com [IBM] ----- Original message ----- From: "Buterbaugh, Kevin L" > Sent by: gpfsug-discuss-bounces at spectrumscale.org To: gpfsug main discussion list > Cc: Subject: [gpfsug-discuss] RAID type for system pool Date: Wed, Sep 5, 2018 12:35 PM Hi All, We are in the process of finalizing the purchase of some new storage arrays (so no sales people who might be monitoring this list need contact me) to life-cycle some older hardware. One of the things we are considering is the purchase of some new SSD?s for our ?/home? filesystem and I have a question or two related to that. Currently, the existing home filesystem has it?s metadata on SSD?s ? two RAID 1 mirrors and metadata replication set to two. However, the filesystem itself is old enough that it uses 512 byte inodes. We have analyzed our users files and know that if we create a new filesystem with 4K inodes that a very significant portion of the files would now have their _data_ stored in the inode as well due to the files being 3.5K or smaller (currently all data is on spinning HD RAID 1 mirrors). Of course, if we increase the size of the inodes by a factor of 8 then we also need 8 times as much space to store those inodes. Given that Enterprise class SSDs are still very expensive and our budget is not unlimited, we?re trying to get the best bang for the buck. We have always - even back in the day when our metadata was on spinning disk and not SSD - used RAID 1 mirrors and metadata replication of two. However, we are wondering if it might be possible to switch to RAID 5? Specifically, what we are considering doing is buying 8 new SSDs and creating two 3+1P RAID 5 LUNs (metadata replication would stay at two). That would give us 50% more usable space than if we configured those same 8 drives as four RAID 1 mirrors. Unfortunately, unless I?m misunderstanding something, mean that the RAID stripe size and the GPFS block size could not match. Therefore, even though we don?t need the space, would we be much better off to buy 10 SSDs and create two 4+1P RAID 5 LUNs? I?ve searched the mailing list archives and scanned the DeveloperWorks wiki and even glanced at the GPFS documentation and haven?t found anything that says ?bad idea, Kevin?? ;-) Expanding on this further ? if we just present those two RAID 5 LUNs to GPFS as NSDs then we can only have two NSD servers as primary for them. So another thing we?re considering is to take those RAID 5 LUNs and further sub-divide them into a total of 8 logical volumes, each of which could be a GPFS NSD and therefore would allow us to have each of our 8 NSD servers be primary for one of them. Even worse idea?!? Good idea? Anybody have any better ideas??? ;-) Oh, and currently we?re on GPFS 4.2.3-10, but are also planning on moving to GPFS 5.0.1-x before creating the new filesystem. Thanks much? ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential, or privileged information and/or personal data. If you are not the intended recipient, you are hereby notified that any review, dissemination, or copying of this email is strictly prohibited, and requested to notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request, or solicitation of any kind to buy, sell, subscribe, redeem, or perform any type of transaction of a financial product. Personal data, as defined by applicable data privacy laws, contained in this email may be processed by the Company, and any of its affiliated or related companies, for potential ongoing compliance and/or business-related purposes. You may have rights regarding your personal data; for information on exercising these rights or the Company?s treatment of personal data, please email datarequests at jumptrading.com. -------------- next part -------------- An HTML attachment was scrubbed... URL: From stockf at us.ibm.com Wed Sep 5 18:37:24 2018 From: stockf at us.ibm.com (Frederick Stock) Date: Wed, 5 Sep 2018 13:37:24 -0400 Subject: [gpfsug-discuss] RAID type for system pool In-Reply-To: References: <92924CCC-FF4A-4F14-BA19-D35BC93CE69F@vanderbilt.edu> Message-ID: Another option for saving space is to not keep 2 copies of the metadata within GPFS. The SSDs are mirrored so you have two copies though very likely they share a possible single point of failure and that could be a deal breaker. I have my doubts that RAID5 will perform well for the reasons Marc described but worth testing to see how it does perform. If you do test I presume you would also run equivalent tests with a RAID1 (mirrored) configuration. Regarding your point about making multiple volumes that would become GPFS NSDs for metadata. It has been my experience that for traditional RAID systems it is better to have many small metadata LUNs (more IO paths) then a few large metadata LUNs. This becomes less of an issue with ESS, i.e. there you can have a few metadata NSDs yet still get very good performance. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: "Marc A Kaplan" To: gpfsug main discussion list Date: 09/05/2018 01:22 PM Subject: Re: [gpfsug-discuss] RAID type for system pool Sent by: gpfsug-discuss-bounces at spectrumscale.org It's good to try to reason and think this out... But there's a good likelihood that we don't understand ALL the details, some of which may negatively impact performance - so no matter what scheme you come up with - test, test, and re-test before deploying and depending on it in production. Having said that, I'm pretty sure that old "spinning" RAID 5 implementations had horrible performance for GPFS metadata/system pool. Why? Among other things, the large stripe size vs the almost random small writes directed to system pool. That random-small-writes pattern won't change when we go to SSD RAID 5 - so you'd have to see if the SSD implementation is somehow smarter than an old fashioned RAID 5 implementation which I believe requires several physical reads and writes, for each "small" logical write. (Top decent google result I found quickly http://rickardnobel.se/raid-5-write-penalty/But you will probably want to do more research!) Consider GPFS small write performance for: inode updates, log writes, small files (possibly in inode), directory updates, allocation map updates, index of indirect blocks. From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 09/05/2018 11:36 AM Subject: [gpfsug-discuss] RAID type for system pool Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi All, We are in the process of finalizing the purchase of some new storage arrays (so no sales people who might be monitoring this list need contact me) to life-cycle some older hardware. One of the things we are considering is the purchase of some new SSD?s for our ?/home? filesystem and I have a question or two related to that. Currently, the existing home filesystem has it?s metadata on SSD?s ? two RAID 1 mirrors and metadata replication set to two. However, the filesystem itself is old enough that it uses 512 byte inodes. We have analyzed our users files and know that if we create a new filesystem with 4K inodes that a very significant portion of the files would now have their _data_ stored in the inode as well due to the files being 3.5K or smaller (currently all data is on spinning HD RAID 1 mirrors). Of course, if we increase the size of the inodes by a factor of 8 then we also need 8 times as much space to store those inodes. Given that Enterprise class SSDs are still very expensive and our budget is not unlimited, we?re trying to get the best bang for the buck. We have always - even back in the day when our metadata was on spinning disk and not SSD - used RAID 1 mirrors and metadata replication of two. However, we are wondering if it might be possible to switch to RAID 5? Specifically, what we are considering doing is buying 8 new SSDs and creating two 3+1P RAID 5 LUNs (metadata replication would stay at two). That would give us 50% more usable space than if we configured those same 8 drives as four RAID 1 mirrors. Unfortunately, unless I?m misunderstanding something, mean that the RAID stripe size and the GPFS block size could not match. Therefore, even though we don?t need the space, would we be much better off to buy 10 SSDs and create two 4+1P RAID 5 LUNs? I?ve searched the mailing list archives and scanned the DeveloperWorks wiki and even glanced at the GPFS documentation and haven?t found anything that says ?bad idea, Kevin?? ;-) Expanding on this further ? if we just present those two RAID 5 LUNs to GPFS as NSDs then we can only have two NSD servers as primary for them. So another thing we?re considering is to take those RAID 5 LUNs and further sub-divide them into a total of 8 logical volumes, each of which could be a GPFS NSD and therefore would allow us to have each of our 8 NSD servers be primary for one of them. Even worse idea?!? Good idea? Anybody have any better ideas??? ;-) Oh, and currently we?re on GPFS 4.2.3-10, but are also planning on moving to GPFS 5.0.1-x before creating the new filesystem. Thanks much? ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Wed Sep 5 19:05:35 2018 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Wed, 5 Sep 2018 14:05:35 -0400 Subject: [gpfsug-discuss] RAID type for system pool In-Reply-To: References: <92924CCC-FF4A-4F14-BA19-D35BC93CE69F@vanderbilt.edu> Message-ID: OR don't do RAID replication, but use GPFS triple replication. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 21994 bytes Desc: not available URL: From Rafael.Cezario at ibm.com Wed Sep 5 21:03:12 2018 From: Rafael.Cezario at ibm.com (Rafael Cezario) Date: Wed, 5 Sep 2018 20:03:12 +0000 Subject: [gpfsug-discuss] mmbackup failed Message-ID: Hi All, I have a filesystem "/dados" with 900TB of data. I have a backup routine with mmbackup and I receive several errors because incorrect values in the file .mmbackupShadow.1. The problem was resolved after I removed the lines of the file. Anyone had any a tool or utility to help me check the file .mmbackupShadow looking incorrect rows? Rafael Cezario IBM Power rafael.cezario at ibm.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From UWEFALKE at de.ibm.com Wed Sep 5 21:07:22 2018 From: UWEFALKE at de.ibm.com (Uwe Falke) Date: Wed, 5 Sep 2018 22:07:22 +0200 Subject: [gpfsug-discuss] RAID type for system pool In-Reply-To: <92924CCC-FF4A-4F14-BA19-D35BC93CE69F@vanderbilt.edu> References: <92924CCC-FF4A-4F14-BA19-D35BC93CE69F@vanderbilt.edu> Message-ID: Hi, just think that your RAID controller on parity-backed redundancy needs to read the full stripe, modify it, and write it back (including parity) - the infamous Read-Modify-Write penalty. As long as your users don't bulk-create inodes and doo amend some metadata, (create a file sometimes, e.g.) The writing of a 4k inode, or the update of a 32k dir block causes your controller to read a full block (let's say you use 1MiB on MD) and write back the full block plus parity (on 4+1p RAID 5 at 1MiB that'll be 1.25MiB. Overhead two orders of magnitude above the payload. SSDs have become better now and expensive enterprise SSDs will endure quite a lot of full rewrites, but you need to estimate the MD change rate, apply the RMW overhead and see where you end WRT lifetime (and performance). Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 05/09/2018 17:35 Subject: [gpfsug-discuss] RAID type for system pool Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi All, We are in the process of finalizing the purchase of some new storage arrays (so no sales people who might be monitoring this list need contact me) to life-cycle some older hardware. One of the things we are considering is the purchase of some new SSD?s for our ?/home? filesystem and I have a question or two related to that. Currently, the existing home filesystem has it?s metadata on SSD?s ? two RAID 1 mirrors and metadata replication set to two. However, the filesystem itself is old enough that it uses 512 byte inodes. We have analyzed our users files and know that if we create a new filesystem with 4K inodes that a very significant portion of the files would now have their _data_ stored in the inode as well due to the files being 3.5K or smaller (currently all data is on spinning HD RAID 1 mirrors). Of course, if we increase the size of the inodes by a factor of 8 then we also need 8 times as much space to store those inodes. Given that Enterprise class SSDs are still very expensive and our budget is not unlimited, we?re trying to get the best bang for the buck. We have always - even back in the day when our metadata was on spinning disk and not SSD - used RAID 1 mirrors and metadata replication of two. However, we are wondering if it might be possible to switch to RAID 5? Specifically, what we are considering doing is buying 8 new SSDs and creating two 3+1P RAID 5 LUNs (metadata replication would stay at two). That would give us 50% more usable space than if we configured those same 8 drives as four RAID 1 mirrors. Unfortunately, unless I?m misunderstanding something, mean that the RAID stripe size and the GPFS block size could not match. Therefore, even though we don?t need the space, would we be much better off to buy 10 SSDs and create two 4+1P RAID 5 LUNs? I?ve searched the mailing list archives and scanned the DeveloperWorks wiki and even glanced at the GPFS documentation and haven?t found anything that says ?bad idea, Kevin?? ;-) Expanding on this further ? if we just present those two RAID 5 LUNs to GPFS as NSDs then we can only have two NSD servers as primary for them. So another thing we?re considering is to take those RAID 5 LUNs and further sub-divide them into a total of 8 logical volumes, each of which could be a GPFS NSD and therefore would allow us to have each of our 8 NSD servers be primary for one of them. Even worse idea?!? Good idea? Anybody have any better ideas??? ;-) Oh, and currently we?re on GPFS 4.2.3-10, but are also planning on moving to GPFS 5.0.1-x before creating the new filesystem. Thanks much? ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From alex at calicolabs.com Wed Sep 5 21:13:17 2018 From: alex at calicolabs.com (Alex Chekholko) Date: Wed, 5 Sep 2018 13:13:17 -0700 Subject: [gpfsug-discuss] RAID type for system pool In-Reply-To: References: <92924CCC-FF4A-4F14-BA19-D35BC93CE69F@vanderbilt.edu> Message-ID: Hi Kevin, Why not do single SSD devices and then just use -m DefaultMetadataReplicas = 3 and -M MaxMetadataReplicas = 3 for your mmcrfs ? And maybe you can even get away with -m 2 -M 3. You will get higher performance overall by having more devices. You will get good redundancy with GPFS replicas (just make sure your failure groups make sense). Maybe you can split your SSDs across different shelves or RAID controllers or something. In any case, if you are creating a new filesystem, you can test all this out. Regards, Alex On Wed, Sep 5, 2018 at 1:07 PM Uwe Falke wrote: > Hi, > > just think that your RAID controller on parity-backed redundancy needs to > read the full stripe, modify it, and write it back (including parity) - > the infamous Read-Modify-Write penalty. > As long as your users don't bulk-create inodes and doo amend some > metadata, (create a file sometimes, e.g.) The writing of a 4k inode, or > the update of a 32k dir block causes your controller to read a full block > (let's say you use 1MiB on MD) and write back the full block plus parity > (on 4+1p RAID 5 at 1MiB that'll be 1.25MiB. Overhead two orders of > magnitude above the payload. > SSDs have become better now and expensive enterprise SSDs will endure > quite a lot of full rewrites, but you need to estimate the MD change rate, > apply the RMW overhead and see where you end WRT lifetime (and > performance). > > > > > Mit freundlichen Gr??en / Kind regards > > > Dr. Uwe Falke > > IT Specialist > High Performance Computing Services / Integrated Technology Services / > Data Center Services > > ------------------------------------------------------------------------------------------------------------------------------------------- > IBM Deutschland > Rathausstr. 7 > 09111 Chemnitz > Phone: +49 371 6978 2165 > Mobile: +49 175 575 2877 > E-Mail: uwefalke at de.ibm.com > > ------------------------------------------------------------------------------------------------------------------------------------------- > IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: > Thomas Wolter, Sven Schoo? > Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, > HRB 17122 > > > > > From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 05/09/2018 17:35 > Subject: [gpfsug-discuss] RAID type for system pool > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > Hi All, > > We are in the process of finalizing the purchase of some new storage > arrays (so no sales people who might be monitoring this list need contact > me) to life-cycle some older hardware. One of the things we are > considering is the purchase of some new SSD?s for our ?/home? filesystem > and I have a question or two related to that. > > Currently, the existing home filesystem has it?s metadata on SSD?s ? two > RAID 1 mirrors and metadata replication set to two. However, the > filesystem itself is old enough that it uses 512 byte inodes. We have > analyzed our users files and know that if we create a new filesystem with > 4K inodes that a very significant portion of the files would now have > their _data_ stored in the inode as well due to the files being 3.5K or > smaller (currently all data is on spinning HD RAID 1 mirrors). > > Of course, if we increase the size of the inodes by a factor of 8 then we > also need 8 times as much space to store those inodes. Given that > Enterprise class SSDs are still very expensive and our budget is not > unlimited, we?re trying to get the best bang for the buck. > > We have always - even back in the day when our metadata was on spinning > disk and not SSD - used RAID 1 mirrors and metadata replication of two. > However, we are wondering if it might be possible to switch to RAID 5? > Specifically, what we are considering doing is buying 8 new SSDs and > creating two 3+1P RAID 5 LUNs (metadata replication would stay at two). > That would give us 50% more usable space than if we configured those same > 8 drives as four RAID 1 mirrors. > > Unfortunately, unless I?m misunderstanding something, mean that the RAID > stripe size and the GPFS block size could not match. Therefore, even > though we don?t need the space, would we be much better off to buy 10 SSDs > and create two 4+1P RAID 5 LUNs? > > I?ve searched the mailing list archives and scanned the DeveloperWorks > wiki and even glanced at the GPFS documentation and haven?t found anything > that says ?bad idea, Kevin?? ;-) > > Expanding on this further ? if we just present those two RAID 5 LUNs to > GPFS as NSDs then we can only have two NSD servers as primary for them. So > another thing we?re considering is to take those RAID 5 LUNs and further > sub-divide them into a total of 8 logical volumes, each of which could be > a GPFS NSD and therefore would allow us to have each of our 8 NSD servers > be primary for one of them. Even worse idea?!? Good idea? > > Anybody have any better ideas??? ;-) > > Oh, and currently we?re on GPFS 4.2.3-10, but are also planning on moving > to GPFS 5.0.1-x before creating the new filesystem. > > Thanks much? > > ? > Kevin Buterbaugh - Senior System Administrator > Vanderbilt University - Advanced Computing Center for Research and > Education > Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stockf at us.ibm.com Wed Sep 5 21:20:51 2018 From: stockf at us.ibm.com (Frederick Stock) Date: Wed, 5 Sep 2018 16:20:51 -0400 Subject: [gpfsug-discuss] mmbackup failed In-Reply-To: References: Message-ID: There are options in the mmbackup command to rebuild the shadowDB file from data kept in TSM. Be aware that using this option will take time to rebuild the shadowDB file, i.e. it is not a fast procedure. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: "Rafael Cezario" To: gpfsug-discuss at spectrumscale.org Date: 09/05/2018 04:04 PM Subject: [gpfsug-discuss] mmbackup failed Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi All, I have a filesystem ?/dados? with 900TB of data. I have a backup routine with mmbackup and I receive several errors because incorrect values in the file .mmbackupShadow.1. The problem was resolved after I removed the lines of the file. Anyone had any a tool or utility to help me check the file .mmbackupShadow looking incorrect rows? Rafael Cezario IBM Power rafael.cezario at ibm.com _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From ulmer at ulmer.org Wed Sep 5 21:33:55 2018 From: ulmer at ulmer.org (Stephen Ulmer) Date: Wed, 5 Sep 2018 16:33:55 -0400 Subject: [gpfsug-discuss] RAID type for system pool In-Reply-To: <92924CCC-FF4A-4F14-BA19-D35BC93CE69F@vanderbilt.edu> References: <92924CCC-FF4A-4F14-BA19-D35BC93CE69F@vanderbilt.edu> Message-ID: > On Sep 5, 2018, at 11:34 AM, Buterbaugh, Kevin L > wrote: > > [?] > Of course, if we increase the size of the inodes by a factor of 8 then we also need 8 times as much space to store those inodes. Given that Enterprise class SSDs are still very expensive and our budget is not unlimited, we?re trying to get the best bang for the buck. > Nobody has gone in this direction yet, so I?ll play devil?s advocate: Are you sure you need enterprise class SSDs? The only practical difference between enterprise class SSDs and "read intensive" SSDs is the "endurance" in DWPD[1]. Read-intensive SSDs usually have a DWPD of 1-ish. Enterprise SSDs can have a DWPD as high as 30. So, how many times do you think you?ll actually write all of the data on the SSDs per day? I don?t know how much (meta)data you?ve got, but maybe consider buying the "cheap" SSDs (which will be *much* larger for your dollar) and just use fractions of them with GPFS replication[2] or maybe some vendor?s {distributed, de-clustererd} RAID. Keep some spares. This is probably bad advice, but the thought exercise will let you find the edges of what you meant. :) [1] DWPD = Drive Writes Per Day ? write all of the cells on the entire storage device every 24 hours. [2] Okay, somebody already said to use GPFS replication. ;) -- Stephen > We have always - even back in the day when our metadata was on spinning disk and not SSD - used RAID 1 mirrors and metadata replication of two. However, we are wondering if it might be possible to switch to RAID 5? Specifically, what we are considering doing is buying 8 new SSDs and creating two 3+1P RAID 5 LUNs (metadata replication would stay at two). That would give us 50% more usable space than if we configured those same 8 drives as four RAID 1 mirrors. > > Unfortunately, unless I?m misunderstanding something, mean that the RAID stripe size and the GPFS block size could not match. Therefore, even though we don?t need the space, would we be much better off to buy 10 SSDs and create two 4+1P RAID 5 LUNs? > > I?ve searched the mailing list archives and scanned the DeveloperWorks wiki and even glanced at the GPFS documentation and haven?t found anything that says ?bad idea, Kevin?? ;-) > > Expanding on this further ? if we just present those two RAID 5 LUNs to GPFS as NSDs then we can only have two NSD servers as primary for them. So another thing we?re considering is to take those RAID 5 LUNs and further sub-divide them into a total of 8 logical volumes, each of which could be a GPFS NSD and therefore would allow us to have each of our 8 NSD servers be primary for one of them. Even worse idea?!? Good idea? > > Anybody have any better ideas??? ;-) > > Oh, and currently we?re on GPFS 4.2.3-10, but are also planning on moving to GPFS 5.0.1-x before creating the new filesystem. > > Thanks much? > > ? > Kevin Buterbaugh - Senior System Administrator > Vanderbilt University - Advanced Computing Center for Research and Education > Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Wed Sep 5 23:42:05 2018 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Wed, 5 Sep 2018 18:42:05 -0400 Subject: [gpfsug-discuss] RAID type for system pool In-Reply-To: References: <92924CCC-FF4A-4F14-BA19-D35BC93CE69F@vanderbilt.edu> Message-ID: <5b172b0f-8a35-ec63-f122-7728aab25564@nasa.gov> I've heard it highly recommended (and have been *really* glad at times to have it) to have at least 2 replicas of metadata to help maintain fs consistency in the event of fs issues or hardware bugs (e.g. a torn write). -Aaron On 9/5/18 1:37 PM, Frederick Stock wrote: > Another option for saving space is to not keep 2 copies of the metadata > within GPFS. ?The SSDs are mirrored so you have two copies though very > likely they share a possible single point of failure and that could be a > deal breaker. ?I have my doubts that RAID5 will perform well for the > reasons Marc described but worth testing to see how it does perform. ?If > you do test I presume you would also run equivalent tests with a RAID1 > (mirrored) configuration. > > Regarding your point about making multiple volumes that would become > GPFS NSDs for metadata. ?It has been my experience that for traditional > RAID systems it is better to have many small metadata LUNs (more IO > paths) then a few large metadata LUNs. ?This becomes less of an issue > with ESS, i.e. there you can have a few metadata NSDs yet still get very > good performance. > > Fred > __________________________________________________ > Fred Stock | IBM Pittsburgh Lab | 720-430-8821 > stockf at us.ibm.com > > > > From: "Marc A Kaplan" > To: gpfsug main discussion list > Date: 09/05/2018 01:22 PM > Subject: Re: [gpfsug-discuss] RAID type for system pool > Sent by: gpfsug-discuss-bounces at spectrumscale.org > ------------------------------------------------------------------------ > > > > It's good to try to reason and think this out... But there's a good > likelihood that we don't understand ALL the details, some of which may > negatively impact performance - so no matter what scheme you come up > with - test, test, and re-test before deploying and depending on it in > production. > > Having said that, I'm pretty sure that old "spinning" RAID 5 > implementations had horrible performance for GPFS metadata/system pool. > Why? Among other things, the large stripe size vs the almost random > small writes directed to system pool. > > That random-small-writes pattern won't change when we go to SSD RAID 5 - > so you'd have to see if the SSD implementation is somehow smarter than > an old fashioned RAID 5 implementation which I believe requires several > physical reads and writes, for each "small" logical write. > (Top decent google result I found quickly > _http://rickardnobel.se/raid-5-write-penalty/_But you will probably want > to do more research!) > > Consider GPFS small write performance for: ?inode updates, log writes, > small files (possibly in inode), directory updates, allocation map > updates, index of indirect blocks. > > > > From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 09/05/2018 11:36 AM > Subject: [gpfsug-discuss] RAID type for system pool > Sent by: gpfsug-discuss-bounces at spectrumscale.org > ------------------------------------------------------------------------ > > > > Hi All, > > We are in the process of finalizing the purchase of some new storage > arrays (so no sales people who might be monitoring this list need > contact me) to life-cycle some older hardware. ?One of the things we are > considering is the purchase of some new SSD?s for our ?/home? filesystem > and I have a question or two related to that. > > Currently, the existing home filesystem has it?s metadata on SSD?s ? two > RAID 1 mirrors and metadata replication set to two. ?However, the > filesystem itself is old enough that it uses 512 byte inodes. ?We have > analyzed our users files and know that if we create a new filesystem > with 4K inodes that a very significant portion of the files would now > have their _data_ stored in the inode as well due to the files being > 3.5K or smaller (currently all data is on spinning HD RAID 1 mirrors). > > Of course, if we increase the size of the inodes by a factor of 8 then > we also need 8 times as much space to store those inodes. ?Given that > Enterprise class SSDs are still very expensive and our budget is not > unlimited, we?re trying to get the best bang for the buck. > > We have always - even back in the day when our metadata was on spinning > disk and not SSD - used RAID 1 mirrors and metadata replication of two. > ?However, we are wondering if it might be possible to switch to RAID 5? > ?Specifically, what we are considering doing is buying 8 new SSDs and > creating two 3+1P RAID 5 LUNs (metadata replication would stay at two). > ?That would give us 50% more usable space than if we configured those > same 8 drives as four RAID 1 mirrors. > > Unfortunately, unless I?m misunderstanding something, mean that the RAID > stripe size and the GPFS block size could not match. ?Therefore, even > though we don?t need the space, would we be much better off to buy 10 > SSDs and create two 4+1P RAID 5 LUNs? > > I?ve searched the mailing list archives and scanned the DeveloperWorks > wiki and even glanced at the GPFS documentation and haven?t found > anything that says ?bad idea, Kevin?? ;-) > > Expanding on this further ? if we just present those two RAID 5 LUNs to > GPFS as NSDs then we can only have two NSD servers as primary for them. > ?So another thing we?re considering is to take those RAID 5 LUNs and > further sub-divide them into a total of 8 logical volumes, each of which > could be a GPFS NSD and therefore would allow us to have each of our 8 > NSD servers be primary for one of them. ?Even worse idea?!? ?Good idea? > > Anybody have any better ideas??? ?;-) > > Oh, and currently we?re on GPFS 4.2.3-10, but are also planning on > moving to GPFS 5.0.1-x before creating the new filesystem. > > Thanks much? > > ? > Kevin Buterbaugh - Senior System Administrator > Vanderbilt University - Advanced Computing Center for Research and > Education_ > __Kevin.Buterbaugh at vanderbilt.edu_ > - (615)875-9633 > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org_ > __http://gpfsug.org/mailman/listinfo/gpfsug-discuss_ > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From Achim.Rehor at de.ibm.com Thu Sep 6 09:15:58 2018 From: Achim.Rehor at de.ibm.com (Achim Rehor) Date: Thu, 6 Sep 2018 08:15:58 +0000 Subject: [gpfsug-discuss] RAID type for system pool In-Reply-To: <92924CCC-FF4A-4F14-BA19-D35BC93CE69F@vanderbilt.edu> Message-ID: Hi Kevin, as you already pointed out, having a RAID stripe size (or a multiple of it) not matching GPFS blocksize, is a bad idea. Every write would cause a read-modify-write operation to keep the parity. So for data LUNs RAID5 with 4+P or 8+P is fully ok. For metadata, if you are keen on performance, I would stay with RAID1, or even RAID0, so you aren?t affected by possible RAID rebuild performance drops. Regards, Achim > Am 05.09.2018 um 17:35 schrieb Buterbaugh, Kevin L : > > Hi All, > > We are in the process of finalizing the purchase of some new storage arrays (so no sales people who might be monitoring this list need contact me) to life-cycle some older hardware. One of the things we are considering is the purchase of some new SSD?s for our ?/home? filesystem and I have a question or two related to that. > > Currently, the existing home filesystem has it?s metadata on SSD?s ? two RAID 1 mirrors and metadata replication set to two. However, the filesystem itself is old enough that it uses 512 byte inodes. We have analyzed our users files and know that if we create a new filesystem with 4K inodes that a very significant portion of the files would now have their _data_ stored in the inode as well due to the files being 3.5K or smaller (currently all data is on spinning HD RAID 1 mirrors). > > Of course, if we increase the size of the inodes by a factor of 8 then we also need 8 times as much space to store those inodes. Given that Enterprise class SSDs are still very expensive and our budget is not unlimited, we?re trying to get the best bang for the buck. > > We have always - even back in the day when our metadata was on spinning disk and not SSD - used RAID 1 mirrors and metadata replication of two. However, we are wondering if it might be possible to switch to RAID 5? Specifically, what we are considering doing is buying 8 new SSDs and creating two 3+1P RAID 5 LUNs (metadata replication would stay at two). That would give us 50% more usable space than if we configured those same 8 drives as four RAID 1 mirrors. > > Unfortunately, unless I?m misunderstanding something, mean that the RAID stripe size and the GPFS block size could not match. Therefore, even though we don?t need the space, would we be much better off to buy 10 SSDs and create two 4+1P RAID 5 LUNs? > > I?ve searched the mailing list archives and scanned the DeveloperWorks wiki and even glanced at the GPFS documentation and haven?t found anything that says ?bad idea, Kevin?? ;-) > > Expanding on this further ? if we just present those two RAID 5 LUNs to GPFS as NSDs then we can only have two NSD servers as primary for them. So another thing we?re considering is to take those RAID 5 LUNs and further sub-divide them into a total of 8 logical volumes, each of which could be a GPFS NSD and therefore would allow us to have each of our 8 NSD servers be primary for one of them. Even worse idea?!? Good idea? > > Anybody have any better ideas??? ;-) > > Oh, and currently we?re on GPFS 4.2.3-10, but are also planning on moving to GPFS 5.0.1-x before creating the new filesystem. > > Thanks much? > > ? > Kevin Buterbaugh - Senior System Administrator > Vanderbilt University - Advanced Computing Center for Research and Education > Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From luis.bolinches at fi.ibm.com Thu Sep 6 09:32:11 2018 From: luis.bolinches at fi.ibm.com (Luis Bolinches) Date: Thu, 6 Sep 2018 08:32:11 +0000 Subject: [gpfsug-discuss] RAID type for system pool In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From jonathan.buzzard at strath.ac.uk Thu Sep 6 11:45:39 2018 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Thu, 06 Sep 2018 11:45:39 +0100 Subject: [gpfsug-discuss] RAID type for system pool In-Reply-To: References: <92924CCC-FF4A-4F14-BA19-D35BC93CE69F@vanderbilt.edu> Message-ID: <1536230739.17046.18.camel@strath.ac.uk> On Wed, 2018-09-05 at 13:37 -0400, Frederick Stock wrote: > Another option for saving space is to not keep 2 copies of the > metadata within GPFS. ?The SSDs are mirrored so you have two copies > though very likely they share a possible single point of failure and > that could be a deal breaker. ?I have my doubts that RAID5 will > perform well for the reasons Marc described but worth testing to see > how it does perform. ?If you do test I presume you would also run > equivalent tests with a RAID1 (mirrored) configuration. > When you have been on the wrong end of a double disk failure in a RAID1 when the second disk failed during the rebuild you probably want to steer clear of such recklessness on a multi TB GPFS file system :-) JAB. -- Jonathan A. Buzzard?????????????????????????Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From Kevin.Buterbaugh at Vanderbilt.Edu Thu Sep 6 15:58:39 2018 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Thu, 6 Sep 2018 14:58:39 +0000 Subject: [gpfsug-discuss] RAID type for system pool In-Reply-To: References: <92924CCC-FF4A-4F14-BA19-D35BC93CE69F@vanderbilt.edu> Message-ID: <3B1C05ED-3058-4644-BC54-5BD1ED583C88@vanderbilt.edu> Hi All, Wow - my query got more responses than I expected and my sincere thanks to all who took the time to respond! At this point in time we do have two GPFS filesystems ? one which is basically ?/home? and some software installations and the other which is ?/scratch? and ?/data? (former backed up, latter not). Both of them have their metadata on SSDs set up as RAID 1 mirrors and replication set to two. But at this point in time all of the SSDs are in a single storage array (albeit with dual redundant controllers) ? so the storage array itself is my only SPOF. As part of the hardware purchase we are in the process of making we will be buying a 2nd storage array that can house 2.5? SSDs. Therefore, we will be splitting our SSDs between chassis and eliminating that last SPOF. Of course, this includes the new SSDs we are getting for our new /home filesystem. Our plan right now is to buy 10 SSDs, which will allow us to test 3 configurations: 1) two 4+1P RAID 5 LUNs split up into a total of 8 LV?s (with each of my 8 NSD servers as primary for one of those LV?s and the other 7 as backups) and GPFS metadata replication set to 2. 2) four RAID 1 mirrors (which obviously leaves 2 SSDs unused) and GPFS metadata replication set to 2. This would mean that only 4 of my 8 NSD servers would be a primary. 3) nine RAID 0 / bare drives with GPFS metadata replication set to 3 (which leaves 1 SSD unused). All 8 NSD servers primary for one SSD and 1 serving up two. The responses I received concerning RAID 5 and performance were not a surprise to me. The main advantage that option gives is the most usable storage space for the money (in fact, it gives us way more storage space than we currently need) ? but if it tanks performance, then that?s a deal breaker. Personally, I like the four RAID 1 mirrors config like we?ve been using for years, but it has the disadvantage of giving us the least usable storage space ? that config would give us the minimum we need for right now, but doesn?t really allow for much future growth. I have no experience with metadata replication of 3 (but had actually thought of that option, so feel good that others suggested it) so option 3 will be a brand new experience for us. It is the most optimal in terms of meeting current needs plus allowing for future growth without giving us way more space than we are likely to need). I will be curious to see how long it takes GPFS to re-replicate the data when we simulate a drive failure as opposed to how long a RAID rebuild takes. I am a big believer in Murphy?s Law (Sunday I paid off a bill, Wednesday my refrigerator died!) ? and also believe that the definition of a pessimist is ?someone with experience? ? so we will definitely not set GPFS metadata replication to less than two, nor will we use non-Enterprise class SSDs for metadata ? but I do still appreciate the suggestions. If there is interest, I will report back on our findings. If anyone has any additional thoughts or suggestions, I?d also appreciate hearing them. Again, thank you! Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Thu Sep 6 16:20:43 2018 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Thu, 6 Sep 2018 11:20:43 -0400 Subject: [gpfsug-discuss] RAID type for system pool In-Reply-To: References: <92924CCC-FF4A-4F14-BA19-D35BC93CE69F@vanderbilt.edu> Message-ID: Perhaps repeating myself, but consider no-RAID or RAID "0" and -M MaxMetadataReplicas Specifies the default maximum number of copies of inodes, directories, and indirect blocks for a file. Valid values are 1, 2, and 3. This value cannot be less than the value of DefaultMetadataReplicas. The default is 2. SO you can have triple redundancy with no shared physical point of failure. When you depend a particular RAID controller to do replication and subsequent recovery for you, then you are depending on that RAID controller. Of course, when you take this point of view to the extreme, you realize that for any individual datum you are depending on the single generator or source of that datum being correct, the OS and filesystem software and CPU, etc, etc.... Until you get to the point just beyond where the datum is replicated... -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Thu Sep 6 17:09:10 2018 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Thu, 6 Sep 2018 12:09:10 -0400 Subject: [gpfsug-discuss] RAID type for system pool In-Reply-To: References: <92924CCC-FF4A-4F14-BA19-D35BC93CE69F@vanderbilt.edu> Message-ID: A somewhat smarter RAID controller will "only" need to read the old values of the single changed segment of data and the corresponding parity segment, and know the new value of the data block. Then it can compute the new parity segment value. Not necessarily the entire stripe. Still 2 reads and 2 writes + access delay times ( guaranteed more than one full rotation time when on spinning disks, average something like 1.7x rotation time ). From: "Uwe Falke" To: gpfsug main discussion list Date: 09/05/2018 04:07 PM Subject: Re: [gpfsug-discuss] RAID type for system pool Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi, just think that your RAID controller on parity-backed redundancy needs to read the full stripe, modify it, and write it back (including parity) - the infamous Read-Modify-Write penalty. As long as your users don't bulk-create inodes and doo amend some metadata, (create a file sometimes, e.g.) The writing of a 4k inode, or the update of a 32k dir block causes your controller to read a full block (let's say you use 1MiB on MD) and write back the full block plus parity (on 4+1p RAID 5 at 1MiB that'll be 1.25MiB. Overhead two orders of magnitude above the payload. SSDs have become better now and expensive enterprise SSDs will endure quite a lot of full rewrites, but you need to estimate the MD change rate, apply the RMW overhead and see where you end WRT lifetime (and performance). Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 05/09/2018 17:35 Subject: [gpfsug-discuss] RAID type for system pool Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi All, We are in the process of finalizing the purchase of some new storage arrays (so no sales people who might be monitoring this list need contact me) to life-cycle some older hardware. One of the things we are considering is the purchase of some new SSD?s for our ?/home? filesystem and I have a question or two related to that. Currently, the existing home filesystem has it?s metadata on SSD?s ? two RAID 1 mirrors and metadata replication set to two. However, the filesystem itself is old enough that it uses 512 byte inodes. We have analyzed our users files and know that if we create a new filesystem with 4K inodes that a very significant portion of the files would now have their _data_ stored in the inode as well due to the files being 3.5K or smaller (currently all data is on spinning HD RAID 1 mirrors). Of course, if we increase the size of the inodes by a factor of 8 then we also need 8 times as much space to store those inodes. Given that Enterprise class SSDs are still very expensive and our budget is not unlimited, we?re trying to get the best bang for the buck. We have always - even back in the day when our metadata was on spinning disk and not SSD - used RAID 1 mirrors and metadata replication of two. However, we are wondering if it might be possible to switch to RAID 5? Specifically, what we are considering doing is buying 8 new SSDs and creating two 3+1P RAID 5 LUNs (metadata replication would stay at two). That would give us 50% more usable space than if we configured those same 8 drives as four RAID 1 mirrors. Unfortunately, unless I?m misunderstanding something, mean that the RAID stripe size and the GPFS block size could not match. Therefore, even though we don?t need the space, would we be much better off to buy 10 SSDs and create two 4+1P RAID 5 LUNs? I?ve searched the mailing list archives and scanned the DeveloperWorks wiki and even glanced at the GPFS documentation and haven?t found anything that says ?bad idea, Kevin?? ;-) Expanding on this further ? if we just present those two RAID 5 LUNs to GPFS as NSDs then we can only have two NSD servers as primary for them. So another thing we?re considering is to take those RAID 5 LUNs and further sub-divide them into a total of 8 logical volumes, each of which could be a GPFS NSD and therefore would allow us to have each of our 8 NSD servers be primary for one of them. Even worse idea?!? Good idea? Anybody have any better ideas??? ;-) Oh, and currently we?re on GPFS 4.2.3-10, but are also planning on moving to GPFS 5.0.1-x before creating the new filesystem. Thanks much? ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From bbanister at jumptrading.com Thu Sep 6 17:19:00 2018 From: bbanister at jumptrading.com (Bryan Banister) Date: Thu, 6 Sep 2018 16:19:00 +0000 Subject: [gpfsug-discuss] RAID type for system pool In-Reply-To: <3B1C05ED-3058-4644-BC54-5BD1ED583C88@vanderbilt.edu> References: <92924CCC-FF4A-4F14-BA19-D35BC93CE69F@vanderbilt.edu> <3B1C05ED-3058-4644-BC54-5BD1ED583C88@vanderbilt.edu> Message-ID: <269a9d7aeb3f4597adb22dfb3c2d8365@jumptrading.com> I have questions about how the GPFS metadata replication of 3 works. 1. Is it basically the same as replication of 2 but just have one more copy, making recovery much more likely? 1. If there is nothing that is checking that the data was correctly read off of the device (e.g. CRC checking ON READS like the DDNs do, T10PI or Data Integrity Field) then how does GPFS handle a corrupted read of the data? - unlikely with SSD but head could be off on a NLSAS read, no errors, but you get some garbage instead, plus no auto retries 1. Does GPFS read at least two of the three replicas and compares them to ensure the data is correct? - expensive operation, so very unlikely 1. If not reading multiple replicas for comparison, are reads round robin across all three copies? 1. If one replica is corrupted (bad blocks) what does GPFS do to recover this metadata copy? Is this automatic or does this require a manual `mmrestripefs -c` operation or something? - If not, seems like a pretty simple idea and maybe an RFE worthy submission 1. Would the idea of an option to run ?background scrub/verifies? of the data/metadata be worthwhile to ensure no hidden bad blocks? - Using QoS this should be relatively painless 1. With a drive failure do you have to delete the NSD from the file system and cluster, recreate the NSD, add it back to the FS, then again run the `mmrestripefs -c` operation to restore the replication? - As Kevin mentions this will end up being a FULL file system scan vs. a block-based scan and replication. That could take a long time depending on number of inodes and type of storage! Thanks for any insight, -Bryan From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Buterbaugh, Kevin L Sent: Thursday, September 6, 2018 9:59 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] RAID type for system pool Note: External Email ________________________________ Hi All, Wow - my query got more responses than I expected and my sincere thanks to all who took the time to respond! At this point in time we do have two GPFS filesystems ? one which is basically ?/home? and some software installations and the other which is ?/scratch? and ?/data? (former backed up, latter not). Both of them have their metadata on SSDs set up as RAID 1 mirrors and replication set to two. But at this point in time all of the SSDs are in a single storage array (albeit with dual redundant controllers) ? so the storage array itself is my only SPOF. As part of the hardware purchase we are in the process of making we will be buying a 2nd storage array that can house 2.5? SSDs. Therefore, we will be splitting our SSDs between chassis and eliminating that last SPOF. Of course, this includes the new SSDs we are getting for our new /home filesystem. Our plan right now is to buy 10 SSDs, which will allow us to test 3 configurations: 1) two 4+1P RAID 5 LUNs split up into a total of 8 LV?s (with each of my 8 NSD servers as primary for one of those LV?s and the other 7 as backups) and GPFS metadata replication set to 2. 2) four RAID 1 mirrors (which obviously leaves 2 SSDs unused) and GPFS metadata replication set to 2. This would mean that only 4 of my 8 NSD servers would be a primary. 3) nine RAID 0 / bare drives with GPFS metadata replication set to 3 (which leaves 1 SSD unused). All 8 NSD servers primary for one SSD and 1 serving up two. The responses I received concerning RAID 5 and performance were not a surprise to me. The main advantage that option gives is the most usable storage space for the money (in fact, it gives us way more storage space than we currently need) ? but if it tanks performance, then that?s a deal breaker. Personally, I like the four RAID 1 mirrors config like we?ve been using for years, but it has the disadvantage of giving us the least usable storage space ? that config would give us the minimum we need for right now, but doesn?t really allow for much future growth. I have no experience with metadata replication of 3 (but had actually thought of that option, so feel good that others suggested it) so option 3 will be a brand new experience for us. It is the most optimal in terms of meeting current needs plus allowing for future growth without giving us way more space than we are likely to need). I will be curious to see how long it takes GPFS to re-replicate the data when we simulate a drive failure as opposed to how long a RAID rebuild takes. I am a big believer in Murphy?s Law (Sunday I paid off a bill, Wednesday my refrigerator died!) ? and also believe that the definition of a pessimist is ?someone with experience? ? so we will definitely not set GPFS metadata replication to less than two, nor will we use non-Enterprise class SSDs for metadata ? but I do still appreciate the suggestions. If there is interest, I will report back on our findings. If anyone has any additional thoughts or suggestions, I?d also appreciate hearing them. Again, thank you! Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential, or privileged information and/or personal data. If you are not the intended recipient, you are hereby notified that any review, dissemination, or copying of this email is strictly prohibited, and requested to notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request, or solicitation of any kind to buy, sell, subscribe, redeem, or perform any type of transaction of a financial product. Personal data, as defined by applicable data privacy laws, contained in this email may be processed by the Company, and any of its affiliated or related companies, for potential ongoing compliance and/or business-related purposes. You may have rights regarding your personal data; for information on exercising these rights or the Company?s treatment of personal data, please email datarequests at jumptrading.com. -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Thu Sep 6 18:06:17 2018 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Thu, 6 Sep 2018 13:06:17 -0400 Subject: [gpfsug-discuss] RAID type for system pool In-Reply-To: <269a9d7aeb3f4597adb22dfb3c2d8365@jumptrading.com> References: <92924CCC-FF4A-4F14-BA19-D35BC93CE69F@vanderbilt.edu> <3B1C05ED-3058-4644-BC54-5BD1ED583C88@vanderbilt.edu> <269a9d7aeb3f4597adb22dfb3c2d8365@jumptrading.com> Message-ID: Answers inline based on my recollection of experiences we've had here: On 9/6/18 12:19 PM, Bryan Banister wrote: > I have questions about how the GPFS metadata replication of 3 works. > > 1. Is it basically the same as replication of 2 but just have one more > copy, making recovery much more likely? That's my understanding. > 2. If there is nothing that is checking that the data was correctly > read off of the device (e.g. CRC checking ON READS like the DDNs do, > T10PI or Data Integrity Field) then how does GPFS handle a corrupted > read of the data? > - unlikely with SSD but head could be off on a NLSAS read, no > errors, but you get some garbage instead, plus no auto retries The inode itself is checksummed: # /usr/lpp/mmfs/bin/tsdbfs mysuperawesomespacefs Enter command or null to read next sector. Type ? for help. inode 20087366 Inode 20087366 [20087366] snap 0 (index 582 in block 9808): Inode address: 30:263275078 32:263264838 size 512 nAddrs 32 indirectionLevel=3 status=USERFILE objectVersion=49352 generation=0x2B519B3 nlink=1 owner uid=8675309 gid=999 mode=0200100600: -rw------- blocksize code=5 (32 subblocks) lastBlockSubblocks=1 checksum=0xF2EF3427 is Valid ... Disk pointers [32]: 0: 31:217629376 1: 30:217632960 2: (null) ... 31: (null) as are indirect blocks (I'm sure that's not an exhaustive list of checksummed metadata structures): ind 31:217629376 Indirect block starting in sector 31:217629376: magic=0x112DF307 generation=0x2B519B3 blockNum=0 inodeNum=20087366 indirection level=2 checksum=0x6BDAA92A CalcChecksum(0x5B6DC9FC000, 32768, 20)=0x6BDAA92A Data pointers: > 3. Does GPFS read at least two of the three replicas and compares them > to ensure the data is correct? > - expensive operation, so very unlikely I don't know, but I do know it verifies the checksum and I believe if that's wrong it will try another replica. > 4. If not reading multiple replicas for comparison, are reads round > robin across all three copies? I feel like we see pretty even distribution of reads across all replicas of our metadata LUNs, although this is looking overall at the array level so it may be a red herring. > 5. If one replica is corrupted (bad blocks) what does GPFS do to > recover this metadata copy?? Is this automatic or does this require > a manual `mmrestripefs -c` operation or something? > - If not, seems like a pretty simple idea and maybe an RFE worthy > submission My experience has been it will attempt to correct it (and maybe log an fsstruct error?). This was in the 3.5 days, though. > 6. Would the idea of an option to run ?background scrub/verifies? of > the data/metadata be worthwhile to ensure no hidden bad blocks? > - Using QoS this should be relatively painless If you don't have array-level background scrubbing, this is what I'd suggest. (e.g. mmrestripefs -c --metadata-only). > 7. With a drive failure do you have to delete the NSD from the file > system and cluster, recreate the NSD, add it back to the FS, then > again run the `mmrestripefs -c` operation to restore the replication? > - As Kevin mentions this will end up being a FULL file system scan > vs. a block-based scan and replication.? That could take a long time > depending on number of inodes and type of storage! > > Thanks for any insight, > > -Bryan > > *From:* gpfsug-discuss-bounces at spectrumscale.org > *On Behalf Of *Buterbaugh, > Kevin L > *Sent:* Thursday, September 6, 2018 9:59 AM > *To:* gpfsug main discussion list > *Subject:* Re: [gpfsug-discuss] RAID type for system pool > > /Note: External Email/ > > ------------------------------------------------------------------------ > > Hi All, > > Wow - my query got more responses than I expected and my sincere thanks > to all who took the time to respond! > > At this point in time we do have two GPFS filesystems ? one which is > basically ?/home? and some software installations and the other which is > ?/scratch? and ?/data? (former backed up, latter not). ?Both of them > have their metadata on SSDs set up as RAID 1 mirrors and replication set > to two. ?But at this point in time all of the SSDs are in a single > storage array (albeit with dual redundant controllers) ? so the storage > array itself is my only SPOF. > > As part of the hardware purchase we are in the process of making we will > be buying a 2nd storage array that can house 2.5? SSDs. ?Therefore, we > will be splitting our SSDs between chassis and eliminating that last > SPOF. ?Of course, this includes the new SSDs we are getting for our new > /home filesystem. > > Our plan right now is to buy 10 SSDs, which will allow us to test 3 > configurations: > > 1) two 4+1P RAID 5 LUNs split up into a total of 8 LV?s (with each of my > 8 NSD servers as primary for one of those LV?s and the other 7 as > backups) and GPFS metadata replication set to 2. > > 2) four RAID 1 mirrors (which obviously leaves 2 SSDs unused) and GPFS > metadata replication set to 2. ?This would mean that only 4 of my 8 NSD > servers would be a primary. > > 3) nine RAID 0 / bare drives with GPFS metadata replication set to 3 > (which leaves 1 SSD unused). ?All 8 NSD servers primary for one SSD and > 1 serving up two. > > The responses I received concerning RAID 5 and performance were not a > surprise to me. ?The main advantage that option gives is the most usable > storage space for the money (in fact, it gives us way more storage space > than we currently need) ? but if it tanks performance, then that?s a > deal breaker. > > Personally, I like the four RAID 1 mirrors config like we?ve been using > for years, but it has the disadvantage of giving us the least usable > storage space ? that config would give us the minimum we need for right > now, but doesn?t really allow for much future growth. > > I have no experience with metadata replication of 3 (but had actually > thought of that option, so feel good that others suggested it) so option > 3 will be a brand new experience for us. ?It is the most optimal in > terms of meeting current needs plus allowing for future growth without > giving us way more space than we are likely to need). ?I will be curious > to see how long it takes GPFS to re-replicate the data when we simulate > a drive failure as opposed to how long a RAID rebuild takes. > > I am a big believer in Murphy?s Law (Sunday I paid off a bill, Wednesday > my refrigerator died!) ? and also believe that the definition of a > pessimist is ?someone with experience? ? so we will definitely > not set GPFS metadata replication to less than two, nor will we use > non-Enterprise class SSDs for metadata ? but I do still appreciate the > suggestions. > > If there is interest, I will report back on our findings. ?If anyone has > any additional thoughts or suggestions, I?d also appreciate hearing > them. ?Again, thank you! > > Kevin > > ? > > Kevin Buterbaugh - Senior System Administrator > > Vanderbilt University - Advanced Computing Center for Research and Education > > Kevin.Buterbaugh at vanderbilt.edu > ?- (615)875-9633 > > > ------------------------------------------------------------------------ > > Note: This email is for the confidential use of the named addressee(s) > only and may contain proprietary, confidential, or privileged > information and/or personal data. If you are not the intended recipient, > you are hereby notified that any review, dissemination, or copying of > this email is strictly prohibited, and requested to notify the sender > immediately and destroy this email and any attachments. Email > transmission cannot be guaranteed to be secure or error-free. The > Company, therefore, does not make any guarantees as to the completeness > or accuracy of this email or any attachments. This email is for > informational purposes only and does not constitute a recommendation, > offer, request, or solicitation of any kind to buy, sell, subscribe, > redeem, or perform any type of transaction of a financial product. > Personal data, as defined by applicable data privacy laws, contained in > this email may be processed by the Company, and any of its affiliated or > related companies, for potential ongoing compliance and/or > business-related purposes. You may have rights regarding your personal > data; for information on exercising these rights or the Company?s > treatment of personal data, please email datarequests at jumptrading.com. > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From S.J.Thompson at bham.ac.uk Thu Sep 6 18:49:25 2018 From: S.J.Thompson at bham.ac.uk (Simon Thompson) Date: Thu, 6 Sep 2018 17:49:25 +0000 Subject: [gpfsug-discuss] RAID type for system pool In-Reply-To: References: <92924CCC-FF4A-4F14-BA19-D35BC93CE69F@vanderbilt.edu> <3B1C05ED-3058-4644-BC54-5BD1ED583C88@vanderbilt.edu> <269a9d7aeb3f4597adb22dfb3c2d8365@jumptrading.com>, Message-ID: I thought reads were always round robin's (in some form) unless you set readreplicapolicy. And I thought with fsstruct you had to use mmfsck offline to fix. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Aaron Knister [aaron.s.knister at nasa.gov] Sent: 06 September 2018 18:06 To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] RAID type for system pool Answers inline based on my recollection of experiences we've had here: On 9/6/18 12:19 PM, Bryan Banister wrote: > I have questions about how the GPFS metadata replication of 3 works. > > 1. Is it basically the same as replication of 2 but just have one more > copy, making recovery much more likely? That's my understanding. > 2. If there is nothing that is checking that the data was correctly > read off of the device (e.g. CRC checking ON READS like the DDNs do, > T10PI or Data Integrity Field) then how does GPFS handle a corrupted > read of the data? > - unlikely with SSD but head could be off on a NLSAS read, no > errors, but you get some garbage instead, plus no auto retries The inode itself is checksummed: # /usr/lpp/mmfs/bin/tsdbfs mysuperawesomespacefs Enter command or null to read next sector. Type ? for help. inode 20087366 Inode 20087366 [20087366] snap 0 (index 582 in block 9808): Inode address: 30:263275078 32:263264838 size 512 nAddrs 32 indirectionLevel=3 status=USERFILE objectVersion=49352 generation=0x2B519B3 nlink=1 owner uid=8675309 gid=999 mode=0200100600: -rw------- blocksize code=5 (32 subblocks) lastBlockSubblocks=1 checksum=0xF2EF3427 is Valid ... Disk pointers [32]: 0: 31:217629376 1: 30:217632960 2: (null) ... 31: (null) as are indirect blocks (I'm sure that's not an exhaustive list of checksummed metadata structures): ind 31:217629376 Indirect block starting in sector 31:217629376: magic=0x112DF307 generation=0x2B519B3 blockNum=0 inodeNum=20087366 indirection level=2 checksum=0x6BDAA92A CalcChecksum(0x5B6DC9FC000, 32768, 20)=0x6BDAA92A Data pointers: > 3. Does GPFS read at least two of the three replicas and compares them > to ensure the data is correct? > - expensive operation, so very unlikely I don't know, but I do know it verifies the checksum and I believe if that's wrong it will try another replica. > 4. If not reading multiple replicas for comparison, are reads round > robin across all three copies? I feel like we see pretty even distribution of reads across all replicas of our metadata LUNs, although this is looking overall at the array level so it may be a red herring. > 5. If one replica is corrupted (bad blocks) what does GPFS do to > recover this metadata copy? Is this automatic or does this require > a manual `mmrestripefs -c` operation or something? > - If not, seems like a pretty simple idea and maybe an RFE worthy > submission My experience has been it will attempt to correct it (and maybe log an fsstruct error?). This was in the 3.5 days, though. > 6. Would the idea of an option to run ?background scrub/verifies? of > the data/metadata be worthwhile to ensure no hidden bad blocks? > - Using QoS this should be relatively painless If you don't have array-level background scrubbing, this is what I'd suggest. (e.g. mmrestripefs -c --metadata-only). > 7. With a drive failure do you have to delete the NSD from the file > system and cluster, recreate the NSD, add it back to the FS, then > again run the `mmrestripefs -c` operation to restore the replication? > - As Kevin mentions this will end up being a FULL file system scan > vs. a block-based scan and replication. That could take a long time > depending on number of inodes and type of storage! > > Thanks for any insight, > > -Bryan > > *From:* gpfsug-discuss-bounces at spectrumscale.org > *On Behalf Of *Buterbaugh, > Kevin L > *Sent:* Thursday, September 6, 2018 9:59 AM > *To:* gpfsug main discussion list > *Subject:* Re: [gpfsug-discuss] RAID type for system pool > > /Note: External Email/ > > ------------------------------------------------------------------------ > > Hi All, > > Wow - my query got more responses than I expected and my sincere thanks > to all who took the time to respond! > > At this point in time we do have two GPFS filesystems ? one which is > basically ?/home? and some software installations and the other which is > ?/scratch? and ?/data? (former backed up, latter not). Both of them > have their metadata on SSDs set up as RAID 1 mirrors and replication set > to two. But at this point in time all of the SSDs are in a single > storage array (albeit with dual redundant controllers) ? so the storage > array itself is my only SPOF. > > As part of the hardware purchase we are in the process of making we will > be buying a 2nd storage array that can house 2.5? SSDs. Therefore, we > will be splitting our SSDs between chassis and eliminating that last > SPOF. Of course, this includes the new SSDs we are getting for our new > /home filesystem. > > Our plan right now is to buy 10 SSDs, which will allow us to test 3 > configurations: > > 1) two 4+1P RAID 5 LUNs split up into a total of 8 LV?s (with each of my > 8 NSD servers as primary for one of those LV?s and the other 7 as > backups) and GPFS metadata replication set to 2. > > 2) four RAID 1 mirrors (which obviously leaves 2 SSDs unused) and GPFS > metadata replication set to 2. This would mean that only 4 of my 8 NSD > servers would be a primary. > > 3) nine RAID 0 / bare drives with GPFS metadata replication set to 3 > (which leaves 1 SSD unused). All 8 NSD servers primary for one SSD and > 1 serving up two. > > The responses I received concerning RAID 5 and performance were not a > surprise to me. The main advantage that option gives is the most usable > storage space for the money (in fact, it gives us way more storage space > than we currently need) ? but if it tanks performance, then that?s a > deal breaker. > > Personally, I like the four RAID 1 mirrors config like we?ve been using > for years, but it has the disadvantage of giving us the least usable > storage space ? that config would give us the minimum we need for right > now, but doesn?t really allow for much future growth. > > I have no experience with metadata replication of 3 (but had actually > thought of that option, so feel good that others suggested it) so option > 3 will be a brand new experience for us. It is the most optimal in > terms of meeting current needs plus allowing for future growth without > giving us way more space than we are likely to need). I will be curious > to see how long it takes GPFS to re-replicate the data when we simulate > a drive failure as opposed to how long a RAID rebuild takes. > > I am a big believer in Murphy?s Law (Sunday I paid off a bill, Wednesday > my refrigerator died!) ? and also believe that the definition of a > pessimist is ?someone with experience? ? so we will definitely > not set GPFS metadata replication to less than two, nor will we use > non-Enterprise class SSDs for metadata ? but I do still appreciate the > suggestions. > > If there is interest, I will report back on our findings. If anyone has > any additional thoughts or suggestions, I?d also appreciate hearing > them. Again, thank you! > > Kevin > > ? > > Kevin Buterbaugh - Senior System Administrator > > Vanderbilt University - Advanced Computing Center for Research and Education > > Kevin.Buterbaugh at vanderbilt.edu > ?- (615)875-9633 > > > ------------------------------------------------------------------------ > > Note: This email is for the confidential use of the named addressee(s) > only and may contain proprietary, confidential, or privileged > information and/or personal data. If you are not the intended recipient, > you are hereby notified that any review, dissemination, or copying of > this email is strictly prohibited, and requested to notify the sender > immediately and destroy this email and any attachments. Email > transmission cannot be guaranteed to be secure or error-free. The > Company, therefore, does not make any guarantees as to the completeness > or accuracy of this email or any attachments. This email is for > informational purposes only and does not constitute a recommendation, > offer, request, or solicitation of any kind to buy, sell, subscribe, > redeem, or perform any type of transaction of a financial product. > Personal data, as defined by applicable data privacy laws, contained in > this email may be processed by the Company, and any of its affiliated or > related companies, for potential ongoing compliance and/or > business-related purposes. You may have rights regarding your personal > data; for information on exercising these rights or the Company?s > treatment of personal data, please email datarequests at jumptrading.com. > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From chair at spectrumscale.org Fri Sep 7 11:00:08 2018 From: chair at spectrumscale.org (Simon Thompson (Spectrum Scale User Group Chair)) Date: Fri, 07 Sep 2018 11:00:08 +0100 Subject: [gpfsug-discuss] Request for Enhancements (RFE) Forum - Submission Deadline October 1 In-Reply-To: <263e53c18647421f8b3cd936da0075df@jumptrading.com> References: <52220937-CE0A-4949-89A0-6EA41D5ECF93@lbl.gov> <263e53c18647421f8b3cd936da0075df@jumptrading.com> Message-ID: <0341213A-6CB7-434F-A575-9099C2D0D703@spectrumscale.org> GPFS/Spectrum Scale Users, Here?s a long-ish note about our plans to try and improve the RFE process. We?ve tried to include a tl;dr version if you just read the headers. You?ll find the details underneath ;-) and reading to the end is ideal. IMPROVING THE RFE PROCESS As you?ve heard on the list, and at some of the in-person User Group events, we?ve been talking about ways we can improve the RFE process. We?d like to begin having an RFE forum, and have it be de-coupled from the in-person events because we know not everyone can travel. LIGHTNING PRESENTATIONS ON-LINE In general terms, we?d have regular on-line events, where RFEs could be very briefly (5 minutes, lightning talk) presented by the requester. There would then be time for brief follow-on discussion and questions. The session would be recorded to deal with large time zone differences. The live meeting is planned for October 10th 2018, at 4PM BST (that should be 11am EST if we worked is out right!) FOLLOW UP POLL A poll, independent of current RFE voting, would be conducted a couple days after the recording was available to gather votes and feedback on the RFEs submitted ?we may collect site name, to see how many votes are coming from a certain site. MAY NOT GET IT RIGHT THE FIRST TIME We view this supplemental RFE process as organic, that is, we?ll learn as we go and make modifications. The overall goal here is to highlight the RFEs that matter the most to the largest number of UG members by providing a venue for people to speak about their RFEs and collect feedback from fellow community members. RFE PRESENTERS WANTED, SUBMISSION DEADLINE OCTOBER 1ST We?d like to guide a small handful of RFE submitters through this process the first time around, so if you?re interested in being a presenter, let us know now. We?re planning on doing the online meeting and poll for the first time in mid-October, so the submission deadline for your RFE is October 1st. If it?s useful, when you?re drafting your RFE feel free to use the list as a sounding board for feedback. Often sites have similar needs and you may find someone to collaborate with on your RFE to make it useful to more sites, and thereby get more votes. Some guidelines are here: https://drive.google.com/file/d/1o8nN39DTU32qj_EFia5wRhnvfvNfr3cI/view?usp=sharing You can submit you RFE by email to: rfe at spectrumscaleug.org PARTICIPANTS (AKA YOU!!), VIEW AND VOTE We are seeking very good participation in the RFE on-line events needed to make this an effective method of Spectrum Scale Community and IBM Developer collaboration. It is to your benefit to participate and help set priorities on Spectrum Scale enhancements!! We want to make this process light lifting for you as a participant. We will limit the duration of the meeting to 1 hour to minimize the use of your valuable time. Please register for the online meeting via Eventbrite (https://www.eventbrite.com/e/spectrum-scale-request-for-enhancements-voting-tickets-49979954389) ? we?ll send details of how to join the online meeting nearer the time. Thanks! Simon, Kristy, Bob, Bryan and Carl! -------------- next part -------------- An HTML attachment was scrubbed... URL: From Matthias.Knigge at rohde-schwarz.com Fri Sep 7 12:51:15 2018 From: Matthias.Knigge at rohde-schwarz.com (Matthias Knigge) Date: Fri, 7 Sep 2018 11:51:15 +0000 Subject: [gpfsug-discuss] Problem with mmlscluster and callback scripts Message-ID: Hello together, I am using the version 5.0.2.0 of GPFS and have problems with the command mmlscluster and callback-scripts. It is a small cluster of two nodes only. If I shutdown one of the nodes sometimes mmlscluster reports the following output: [root at gpfs-tier1 gpfs5.2]# mmgetstate Node number Node name GPFS state ------------------------------------------- 1 gpfs-tier1 arbitrating [root at gpfs-tier1 gpfs5.2]# mmlscluster ssh: connect to host gpfs-tier2 port 22: No route to host mmlscluster: Unable to retrieve GPFS cluster files from node gpfs-tier2 mmlscluster: Command failed. Examine previous error messages to determine cause. Normally the output is like this: [root at gpfs-tier1 gpfs5.2]# mmlscluster GPFS cluster information ======================== GPFS cluster name: TIERCLUSTER.gpfs-tier1 GPFS cluster id: 12458173498278694815 GPFS UID domain: TIERCLUSTER.gpfs-tier1 Remote shell command: /usr/bin/ssh Remote file copy command: /usr/bin/scp Repository type: server-based GPFS cluster configuration servers: ----------------------------------- Primary server: gpfs-tier2 Secondary server: gpfs-tier1 Node Daemon node name IP address Admin node name Designation ---------------------------------------------------------------------- 1 gpfs-tier1 192.168.178.10 gpfs-tier1 quorum-manager 2 gpfs-tier2 192.168.178.11 gpfs-tier2 quorum-manager [root at gpfs-tier1 gpfs5.2]# mmlscallback NodeDownCallback command = /var/mmfs/rs/nodedown.ksh priority = 1 event = quorumNodeLeave parms = %eventNode %quorumNodes NodeUpCallback command = /var/mmfs/rs/nodeup.ksh priority = 1 event = quorumNodeJoin parms = %eventNode %quorumNodes If I shutdown the filesystem via mmshutdown the callback script works but if I shutdown the whole node the scripts does not run. The latest log-entry in mmfs.log.latest shows only this information: 2018-09-07_13:12:36.724+0200: [I] Cluster Manager connection broke. Probing cluster TIERCLUSTER.gpfs-tier1 2018-09-07_13:12:37.226+0200: [E] Unable to contact enough other quorum nodes during cluster probe. 2018-09-07_13:12:37.226+0200: [E] Lost membership in cluster TIERCLUSTER.gpfs-tier1. Unmounting file systems. 2018-09-07_13:12:38.448+0200: [N] Connecting to 192.168.178.11 gpfs-tier2 Could anybody help me in this case? I want to try to start a script if one node goes down or up to change the roles for starting the filesystem. The callback event NodeLeave and NodeJoin do not run too. Any more information required? If yes, please let me know! Many thanks in advance and a nice weekend! Matthias Best Regards Matthias Knigge R&D File Based Media Solutions Rohde & Schwarz GmbH & Co. KG Hanomaghof 1 30449 Hannover Telefon +49 511 67 80 7 213 Fax +49 511 37 19 74 Internet: Matthias.Knigge at rohde-schwarz.com ------------------------------------------------------------ Gesch?ftsf?hrung / Executive Board: Christian Leicher (Vorsitzender / Chairman), Peter Riedel, Sitz der Gesellschaft / Company's Place of Business: M?nchen, Registereintrag / Commercial Register No.: HRA 16 270, Pers?nlich haftender Gesellschafter / Personally Liable Partner: RUSEG Verwaltungs-GmbH, Sitz der Gesellschaft / Company's Place of Business: M?nchen, Registereintrag / Commercial Register No.: HRB 7 534, Umsatzsteuer-Identifikationsnummer (USt-IdNr.) / VAT Identification No.: DE 130 256 683, Elektro-Altger?te Register (EAR) / WEEE Register No.: DE 240 437 86 -------------- next part -------------- An HTML attachment was scrubbed... URL: From stockf at us.ibm.com Fri Sep 7 14:19:51 2018 From: stockf at us.ibm.com (Frederick Stock) Date: Fri, 7 Sep 2018 09:19:51 -0400 Subject: [gpfsug-discuss] Problem with mmlscluster and callback scripts In-Reply-To: References: Message-ID: Are you really running version 5.0.2? If so then I presume you have a beta version since it has not yet been released. For beta problems there is a specific feedback mechanism that should be used to report problems. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: Matthias Knigge To: "gpfsug-discuss at spectrumscale.org" Date: 09/07/2018 08:08 AM Subject: [gpfsug-discuss] Problem with mmlscluster and callback scripts Sent by: gpfsug-discuss-bounces at spectrumscale.org Hello together, I am using the version 5.0.2.0 of GPFS and have problems with the command mmlscluster and callback-scripts. It is a small cluster of two nodes only. If I shutdown one of the nodes sometimes mmlscluster reports the following output: [root at gpfs-tier1 gpfs5.2]# mmgetstate Node number Node name GPFS state ------------------------------------------- 1 gpfs-tier1 arbitrating [root at gpfs-tier1 gpfs5.2]# mmlscluster ssh: connect to host gpfs-tier2 port 22: No route to host mmlscluster: Unable to retrieve GPFS cluster files from node gpfs-tier2 mmlscluster: Command failed. Examine previous error messages to determine cause. Normally the output is like this: [root at gpfs-tier1 gpfs5.2]# mmlscluster GPFS cluster information ======================== GPFS cluster name: TIERCLUSTER.gpfs-tier1 GPFS cluster id: 12458173498278694815 GPFS UID domain: TIERCLUSTER.gpfs-tier1 Remote shell command: /usr/bin/ssh Remote file copy command: /usr/bin/scp Repository type: server-based GPFS cluster configuration servers: ----------------------------------- Primary server: gpfs-tier2 Secondary server: gpfs-tier1 Node Daemon node name IP address Admin node name Designation ---------------------------------------------------------------------- 1 gpfs-tier1 192.168.178.10 gpfs-tier1 quorum-manager 2 gpfs-tier2 192.168.178.11 gpfs-tier2 quorum-manager [root at gpfs-tier1 gpfs5.2]# mmlscallback NodeDownCallback command = /var/mmfs/rs/nodedown.ksh priority = 1 event = quorumNodeLeave parms = %eventNode %quorumNodes NodeUpCallback command = /var/mmfs/rs/nodeup.ksh priority = 1 event = quorumNodeJoin parms = %eventNode %quorumNodes If I shutdown the filesystem via mmshutdown the callback script works but if I shutdown the whole node the scripts does not run. The latest log-entry in mmfs.log.latest shows only this information: 2018-09-07_13:12:36.724+0200: [I] Cluster Manager connection broke. Probing cluster TIERCLUSTER.gpfs-tier1 2018-09-07_13:12:37.226+0200: [E] Unable to contact enough other quorum nodes during cluster probe. 2018-09-07_13:12:37.226+0200: [E] Lost membership in cluster TIERCLUSTER.gpfs-tier1. Unmounting file systems. 2018-09-07_13:12:38.448+0200: [N] Connecting to 192.168.178.11 gpfs-tier2 Could anybody help me in this case? I want to try to start a script if one node goes down or up to change the roles for starting the filesystem. The callback event NodeLeave and NodeJoin do not run too. Any more information required? If yes, please let me know! Many thanks in advance and a nice weekend! Matthias Best Regards Matthias Knigge R&D File Based Media Solutions Rohde & Schwarz GmbH & Co. KG Hanomaghof 1 30449 Hannover Telefon +49 511 67 80 7 213 Fax +49 511 37 19 74 Internet: Matthias.Knigge at rohde-schwarz.com ------------------------------------------------------------ Gesch?ftsf?hrung / Executive Board: Christian Leicher (Vorsitzender / Chairman), Peter Riedel, Sitz der Gesellschaft / Company's Place of Business: M?nchen, Registereintrag / Commercial Register No.: HRA 16 270, Pers?nlich haftender Gesellschafter / Personally Liable Partner: RUSEG Verwaltungs-GmbH, Sitz der Gesellschaft / Company's Place of Business: M?nchen, Registereintrag / Commercial Register No.: HRB 7 534, Umsatzsteuer-Identifikationsnummer (USt-IdNr.) / VAT Identification No.: DE 130 256 683, Elektro-Altger?te Register (EAR) / WEEE Register No.: DE 240 437 86 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Fri Sep 7 14:35:24 2018 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Fri, 7 Sep 2018 09:35:24 -0400 Subject: [gpfsug-discuss] Problem with mmlscluster and callback scripts In-Reply-To: References: Message-ID: <44c29793-ef1a-8b58-2ad0-75c8328d9364@nasa.gov> Hi Matthias, Looks like you lost quorum in the cluster (you've got to have (n/2+1) quorum nodes up if you're using node-based quorum). Do you have a tiebreaker disk defined? (i.e. mmlsconfig tiebreakerdisk). -Aaron On 9/7/18 7:51 AM, Matthias Knigge wrote: > Hello together, > > I am using the version 5.0.2.0 of GPFS and have problems with the > command mmlscluster and callback-scripts. It is a small cluster of two > nodes only. If I shutdown one of the nodes sometimes mmlscluster reports > the following output: > > [root at gpfs-tier1 gpfs5.2]# mmgetstate > > Node number? Node name??????? GPFS state > > ------------------------------------------- > > ?????? 1????? gpfs-tier1?????? arbitrating > > [root at gpfs-tier1 gpfs5.2]# mmlscluster > > ssh: connect to host gpfs-tier2 port 22: No route to host > > mmlscluster: Unable to retrieve GPFS cluster files from node gpfs-tier2 > > mmlscluster: Command failed. Examine previous error messages to > determine cause. > > Normally the output is like this: > > [root at gpfs-tier1 gpfs5.2]# mmlscluster > > GPFS cluster information > > ======================== > > ? GPFS cluster name:???????? TIERCLUSTER.gpfs-tier1 > > ? GPFS cluster id:?????????? 12458173498278694815 > > ? GPFS UID domain:?????????? TIERCLUSTER.gpfs-tier1 > > ? Remote shell command:????? /usr/bin/ssh > > ? Remote file copy command:? /usr/bin/scp > > ? Repository type:?????????? server-based > > GPFS cluster configuration servers: > > ----------------------------------- > > ? Primary server:??? gpfs-tier2 > > ? Secondary server:? gpfs-tier1 > > Node? Daemon node name? IP address????? Admin node name? Designation > > ---------------------------------------------------------------------- > > ?? 1?? gpfs-tier1??????? 192.168.178.10? gpfs-tier1?????? quorum-manager > > ?? 2?? gpfs-tier2??????? 192.168.178.11? gpfs-tier2?????? quorum-manager > > [root at gpfs-tier1 gpfs5.2]# mmlscallback > > NodeDownCallback > > ??????? command?????? = /var/mmfs/rs/nodedown.ksh > > ??????? priority????? = 1 > > ??????? event???????? = quorumNodeLeave > > ??????? parms???????? = %eventNode %quorumNodes > > NodeUpCallback > > ??????? command?????? = /var/mmfs/rs/nodeup.ksh > > ??????? priority????? = 1 > > ??????? event???????? = quorumNodeJoin > > ??????? parms???????? = %eventNode %quorumNodes > > If I shutdown the filesystem via mmshutdown the callback script works > but if I shutdown the whole node the scripts does not run. > > The latest log-entry in mmfs.log.latest shows only this information: > > 2018-09-07_13:12:36.724+0200: [I] Cluster Manager connection broke. > Probing cluster TIERCLUSTER.gpfs-tier1 > > 2018-09-07_13:12:37.226+0200: [E] Unable to contact enough other quorum > nodes during cluster probe. > > 2018-09-07_13:12:37.226+0200: [E] Lost membership in cluster > TIERCLUSTER.gpfs-tier1. Unmounting file systems. > > 2018-09-07_13:12:38.448+0200: [N] Connecting to 192.168.178.11 > gpfs-tier2 > > Could anybody help me in this case? I want to try to start a script if > one node goes down or up to change the roles for starting the > filesystem. The callback event NodeLeave and NodeJoin do not run too. > > Any more information required? If yes, please let me know! > > Many thanks in advance and a nice weekend! > > Matthias > > Best Regards > > Matthias Knigge > R&D File Based Media Solutions > > Rohde & Schwarz > GmbH & Co. KG > Hanomaghof 1 > 30449 Hannover > Telefon +49 511 67 80 7 213 > Fax +49 511 37 19 74 > Internet: Matthias.Knigge at rohde-schwarz.com > ------------------------------------------------------------ > Gesch?ftsf?hrung / Executive Board: Christian Leicher (Vorsitzender / > Chairman), Peter Riedel, Sitz der Gesellschaft / Company's Place of > Business: M?nchen, Registereintrag / Commercial Register No.: HRA 16 > 270, Pers?nlich haftender Gesellschafter / Personally Liable Partner: > RUSEG Verwaltungs-GmbH, Sitz der Gesellschaft / Company's Place of > Business: M?nchen, Registereintrag / Commercial Register No.: HRB 7 534, > Umsatzsteuer-Identifikationsnummer (USt-IdNr.) / VAT Identification No.: > DE 130 256 683, Elektro-Altger?te Register (EAR) / WEEE Register No.: DE > 240 437 86 > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From aaron.knister at gmail.com Fri Sep 7 18:27:04 2018 From: aaron.knister at gmail.com (Aaron Knister) Date: Fri, 7 Sep 2018 13:27:04 -0400 Subject: [gpfsug-discuss] mmfsadm dump condvar event blocks Message-ID: Looking at the output of mmfsadm dump condvar I see that the various condvar entries are grouped into event blocks. I?m curious of the significance of that. If you?ve got say two sets of condvars in the same event block what does that mean? Is there necessarily any relation between them? -Aaron From ty.tran at applieddatasystems.com Fri Sep 7 18:34:06 2018 From: ty.tran at applieddatasystems.com (Ty Tran) Date: Fri, 7 Sep 2018 17:34:06 +0000 Subject: [gpfsug-discuss] Spectrum V5.0.0 and CentOS 7.5 Message-ID: Good Morning ? We have been trying to install V5.0.0 and CentOS 7.5 but it doesn?t seem to like the new Kernel. Does anyone have this running and do we need to do anything special? Or must we go to V5.0.1? TQT Ty Q. Tran Managing Partner Applied Data Systems 12180 Dearborn Place Poway, CA 92064 (714) 392- 6690 (Cell) (844) 371- 4949 x100 (Work) (858) 842- 4678 (Fax) ty.tran at applieddatasystems.com www.applieddatasystems.com [cid:image001.png at 01D44696.4DB62B50] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 18825 bytes Desc: image001.png URL: From knop at us.ibm.com Fri Sep 7 23:08:02 2018 From: knop at us.ibm.com (Felipe Knop) Date: Fri, 7 Sep 2018 18:08:02 -0400 Subject: [gpfsug-discuss] Spectrum V5.0.0 and CentOS 7.5 In-Reply-To: References: Message-ID: Ty, For queries on Scale versions and specific distros, please refer to the FAQ: https://www.ibm.com/support/knowledgecenter/en/STXKQY/gpfsclustersfaq.html Table 33. IBM Spectrum Scale for Linux RedHat kernel support |---+----------+----------+-----------------------+-----------------------| | 7.| 3.10.0-86| 3.10.0-86| From V4.1.1.20 in the | From V4.1.1.20 in the | | 5 | 2.el7 | 2.el7 | 4.1.1 release | 4.1.1 release | | | | | | | | | | | | | | | | | From V4.2.3.9 in the | From V4.2.3.9 in the | | | | | 4.2 release | 4.2 release | | | | | | | | | | | | | | | | | From V5.0.1.1 in the | From V5.0.1.1 in the | | | | | 5.0 release | 5.0 release | |---+----------+----------+-----------------------+-----------------------| Assuming the levels of CentOS and RHEL are the same (they are supposed to be?), then CentOS 7.5 should work with Scale V5.0.1.1 or later. Felipe ---- Felipe Knop knop at us.ibm.com GPFS Development and Security IBM Systems IBM Building 008 2455 South Rd, Poughkeepsie, NY 12601 (845) 433-9314 T/L 293-9314 From: Ty Tran To: "gpfsug-discuss at spectrumscale.org" Date: 09/07/2018 05:05 PM Subject: [gpfsug-discuss] Spectrum V5.0.0 and CentOS 7.5 Sent by: gpfsug-discuss-bounces at spectrumscale.org Good Morning ? We have been trying to install V5.0.0 and CentOS 7.5 but it doesn?t seem to like the new Kernel. Does anyone have this running and do we need to do anything special? Or must we go to V5.0.1? TQT Ty Q. Tran Managing Partner Applied Data Systems 12180 Dearborn Place Poway, CA 92064 (714) 392- 6690 (Cell) (844) 371- 4949 x100 (Work) (858) 842- 4678 (Fax) ty.tran at applieddatasystems.com www.applieddatasystems.com _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 1B307238.gif Type: image/gif Size: 18825 bytes Desc: not available URL: From kkr at lbl.gov Fri Sep 7 23:13:48 2018 From: kkr at lbl.gov (Kristy Kallback-Rose) Date: Fri, 7 Sep 2018 15:13:48 -0700 Subject: [gpfsug-discuss] SC18 Planning Message-ID: <26F104F4-C367-4F77-938D-BFB2937FBB2D@lbl.gov> Hi all, If you?re planning on going to SC18 in November, we?d love to hear how you?re using Scale (GPFS) at your site. If you?d be willing to give a 20-30 minute user talk about something you?re doing at your site, please let us know. We?ll be working to fill up the agenda soon. Thanks, Kristy From cblack at nygenome.org Sat Sep 8 01:09:00 2018 From: cblack at nygenome.org (Christopher Black) Date: Sat, 8 Sep 2018 00:09:00 +0000 Subject: [gpfsug-discuss] Spectrum V5.0.0 and CentOS 7.5 In-Reply-To: References: Message-ID: I can confirm gpfs 5.0.1.1 works with CentOS 7.5 for us (kernel package version 3.10.0-862.el7.x86_64). Best, Chris From: on behalf of Felipe Knop Reply-To: gpfsug main discussion list Date: Friday, September 7, 2018 at 6:08 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Spectrum V5.0.0 and CentOS 7.5 Ty, For queries on Scale versions and specific distros, please refer to the FAQ: https://www.ibm.com/support/knowledgecenter/en/STXKQY/gpfsclustersfaq.html Table 33. IBM Spectrum Scale for Linux RedHat kernel support 7.5 3.10.0-862.el7 3.10.0-862.el7 From V4.1.1.20 in the 4.1.1 release From V4.2.3.9 in the 4.2 release From V5.0.1.1 in the 5.0 release From V4.1.1.20 in the 4.1.1 release From V4.2.3.9 in the 4.2 release From V5.0.1.1 in the 5.0 release Assuming the levels of CentOS and RHEL are the same (they are supposed to be?), then CentOS 7.5 should work with Scale V5.0.1.1 or later. Felipe ---- Felipe Knop knop at us.ibm.com GPFS Development and Security IBM Systems IBM Building 008 2455 South Rd, Poughkeepsie, NY 12601 (845) 433-9314 T/L 293-9314 [Inactive hide details for Ty Tran ---09/07/2018 05:05:52 PM---Good Morning ? We have been trying to install V5.0.0 and CentOS]Ty Tran ---09/07/2018 05:05:52 PM---Good Morning ? We have been trying to install V5.0.0 and CentOS 7.5 but it doesn?t seem to like the From: Ty Tran To: "gpfsug-discuss at spectrumscale.org" Date: 09/07/2018 05:05 PM Subject: [gpfsug-discuss] Spectrum V5.0.0 and CentOS 7.5 Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Good Morning ? We have been trying to install V5.0.0 and CentOS 7.5 but it doesn?t seem to like the new Kernel. Does anyone have this running and do we need to do anything special? Or must we go to V5.0.1? TQT Ty Q. Tran Managing Partner Applied Data Systems 12180 Dearborn Place Poway, CA 92064 (714) 392- 6690 (Cell) (844) 371- 4949 x100 (Work) (858) 842- 4678 (Fax) ty.tran at applieddatasystems.com www.applieddatasystems.com [cid:2__=8FBB0992DFEAA5C28f9e8a93df938690918c8FB@] _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ This message is for the recipient?s use only, and may contain confidential, privileged or protected information. Any unauthorized use or dissemination of this communication is prohibited. If you received this message in error, please immediately notify the sender and destroy all copies of this message. The recipient should check this email and any attachments for the presence of viruses, as we accept no liability for any damage caused by any virus transmitted by this email. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 106 bytes Desc: image001.gif URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.png Type: image/png Size: 18826 bytes Desc: image002.png URL: From novosirj at rutgers.edu Sat Sep 8 03:13:18 2018 From: novosirj at rutgers.edu (Ryan Novosielski) Date: Sat, 8 Sep 2018 02:13:18 +0000 Subject: [gpfsug-discuss] Spectrum V5.0.0 and CentOS 7.5 In-Reply-To: References: Message-ID: Someone asked me this the other day and I wasn?t quite sure of the answer: how likely is it that we will ever see/have we ever seen a kernel update (eg. 862.9.1 to 862.11.6) that breaks GPFS compatibility, or can one generally expect it will continue to work for 862*? > On Sep 7, 2018, at 6:08 PM, Felipe Knop wrote: > > Ty, > > For queries on Scale versions and specific distros, please refer to the FAQ: > > https://www.ibm.com/support/knowledgecenter/en/STXKQY/gpfsclustersfaq.html > > Table 33. IBM Spectrum Scale for Linux RedHat kernel support > > > 7.5 3.10.0-862.el7 3.10.0-862.el7 From V4.1.1.20 in the 4.1.1 release > From V4.2.3.9 in the 4.2 release > > From V5.0.1.1 in the 5.0 release > > From V4.1.1.20 in the 4.1.1 release > From V4.2.3.9 in the 4.2 release > > From V5.0.1.1 in the 5.0 release > > > Assuming the levels of CentOS and RHEL are the same (they are supposed to be?), then CentOS 7.5 should work with Scale V5.0.1.1 or later. > > Felipe > > ---- > Felipe Knop knop at us.ibm.com > GPFS Development and Security > IBM Systems > IBM Building 008 > 2455 South Rd, Poughkeepsie, NY 12601 > (845) 433-9314 T/L 293-9314 > > > > Ty Tran ---09/07/2018 05:05:52 PM---Good Morning ? We have been trying to install V5.0.0 and CentOS 7.5 but it doesn?t seem to like the > > From: Ty Tran > To: "gpfsug-discuss at spectrumscale.org" > Date: 09/07/2018 05:05 PM > Subject: [gpfsug-discuss] Spectrum V5.0.0 and CentOS 7.5 > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > Good Morning ? > > We have been trying to install V5.0.0 and CentOS 7.5 but it doesn?t seem to like the new Kernel. Does anyone have this running and do we need to do anything special? Or must we go to V5.0.1? > > TQT > > Ty Q. Tran > Managing Partner > Applied Data Systems > 12180 Dearborn Place > Poway, CA 92064 > (714) 392- 6690 (Cell) > (844) 371- 4949 x100 (Work) > (858) 842- 4678 (Fax) > ty.tran at applieddatasystems.com > www.applieddatasystems.com > > <1B307238.gif> > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss From bachmann.f at gmail.com Sat Sep 8 15:54:07 2018 From: bachmann.f at gmail.com (Florian Bachmann) Date: Sat, 8 Sep 2018 16:54:07 +0200 Subject: [gpfsug-discuss] Spectrum V5.0.0 and CentOS 7.5 In-Reply-To: References: Message-ID: <73cf0385-a410-bba6-5338-71cab7ffe34f@gmail.com> From my experience you are better off with locking kernel packages at a known-to-work version in production (e.g. install yum-plugin-versionlock and do a yum versionlock "kernel*") and test new kernel versions in a test environment. You cannot rely on made up rules like "minor version updates will never break GPFS" or similiar; Linux kernel developers do not care if GPFS works or not. Kind Regards Florian On 08.09.2018 04:13, Ryan Novosielski wrote: > Someone asked me this the other day and I wasn?t quite sure of the answer: how likely is it that we will ever see/have we ever seen a kernel update (eg. 862.9.1 to 862.11.6) that breaks GPFS compatibility, or can one generally expect it will continue to work for 862*? From UWEFALKE at de.ibm.com Mon Sep 10 00:04:12 2018 From: UWEFALKE at de.ibm.com (Uwe Falke) Date: Mon, 10 Sep 2018 01:04:12 +0200 Subject: [gpfsug-discuss] RAID type for system pool In-Reply-To: References: <92924CCC-FF4A-4F14-BA19-D35BC93CE69F@vanderbilt.edu> Message-ID: Hi, Marc, I was clearly unaware of that function. If my understanding of parity-based redundancy is about correct, then that method would only work with RAID 5, because that is a simple XOR-based hash, but RAID 6, if used, would not allow that stripped-down RMW. Is that correct? Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: "Marc A Kaplan" To: gpfsug main discussion list Date: 06/09/2018 18:09 Subject: Re: [gpfsug-discuss] RAID type for system pool Sent by: gpfsug-discuss-bounces at spectrumscale.org A somewhat smarter RAID controller will "only" need to read the old values of the single changed segment of data and the corresponding parity segment, and know the new value of the data block. Then it can compute the new parity segment value. Not necessarily the entire stripe. Still 2 reads and 2 writes + access delay times ( guaranteed more than one full rotation time when on spinning disks, average something like 1.7x rotation time ). From: "Uwe Falke" To: gpfsug main discussion list Date: 09/05/2018 04:07 PM Subject: Re: [gpfsug-discuss] RAID type for system pool Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi, just think that your RAID controller on parity-backed redundancy needs to read the full stripe, modify it, and write it back (including parity) - the infamous Read-Modify-Write penalty. As long as your users don't bulk-create inodes and doo amend some metadata, (create a file sometimes, e.g.) The writing of a 4k inode, or the update of a 32k dir block causes your controller to read a full block (let's say you use 1MiB on MD) and write back the full block plus parity (on 4+1p RAID 5 at 1MiB that'll be 1.25MiB. Overhead two orders of magnitude above the payload. SSDs have become better now and expensive enterprise SSDs will endure quite a lot of full rewrites, but you need to estimate the MD change rate, apply the RMW overhead and see where you end WRT lifetime (and performance). Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 05/09/2018 17:35 Subject: [gpfsug-discuss] RAID type for system pool Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi All, We are in the process of finalizing the purchase of some new storage arrays (so no sales people who might be monitoring this list need contact me) to life-cycle some older hardware. One of the things we are considering is the purchase of some new SSD?s for our ?/home? filesystem and I have a question or two related to that. Currently, the existing home filesystem has it?s metadata on SSD?s ? two RAID 1 mirrors and metadata replication set to two. However, the filesystem itself is old enough that it uses 512 byte inodes. We have analyzed our users files and know that if we create a new filesystem with 4K inodes that a very significant portion of the files would now have their _data_ stored in the inode as well due to the files being 3.5K or smaller (currently all data is on spinning HD RAID 1 mirrors). Of course, if we increase the size of the inodes by a factor of 8 then we also need 8 times as much space to store those inodes. Given that Enterprise class SSDs are still very expensive and our budget is not unlimited, we?re trying to get the best bang for the buck. We have always - even back in the day when our metadata was on spinning disk and not SSD - used RAID 1 mirrors and metadata replication of two. However, we are wondering if it might be possible to switch to RAID 5? Specifically, what we are considering doing is buying 8 new SSDs and creating two 3+1P RAID 5 LUNs (metadata replication would stay at two). That would give us 50% more usable space than if we configured those same 8 drives as four RAID 1 mirrors. Unfortunately, unless I?m misunderstanding something, mean that the RAID stripe size and the GPFS block size could not match. Therefore, even though we don?t need the space, would we be much better off to buy 10 SSDs and create two 4+1P RAID 5 LUNs? I?ve searched the mailing list archives and scanned the DeveloperWorks wiki and even glanced at the GPFS documentation and haven?t found anything that says ?bad idea, Kevin?? ;-) Expanding on this further ? if we just present those two RAID 5 LUNs to GPFS as NSDs then we can only have two NSD servers as primary for them. So another thing we?re considering is to take those RAID 5 LUNs and further sub-divide them into a total of 8 logical volumes, each of which could be a GPFS NSD and therefore would allow us to have each of our 8 NSD servers be primary for one of them. Even worse idea?!? Good idea? Anybody have any better ideas??? ;-) Oh, and currently we?re on GPFS 4.2.3-10, but are also planning on moving to GPFS 5.0.1-x before creating the new filesystem. Thanks much? ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From spectrumscale at kiranghag.com Mon Sep 10 09:21:48 2018 From: spectrumscale at kiranghag.com (KG) Date: Mon, 10 Sep 2018 13:51:48 +0530 Subject: [gpfsug-discuss] Spectrum V5.0.0 and CentOS 7.5 In-Reply-To: References: Message-ID: If the release level is supported then all patch levels should work, just need to run mmbuildgpl to re-compile portability layer for new kernel revision. KG On Sat, Sep 8, 2018 at 7:43 AM, Ryan Novosielski wrote: > Someone asked me this the other day and I wasn?t quite sure of the answer: > how likely is it that we will ever see/have we ever seen a kernel update > (eg. 862.9.1 to 862.11.6) that breaks GPFS compatibility, or can one > generally expect it will continue to work for 862*? > > > On Sep 7, 2018, at 6:08 PM, Felipe Knop wrote: > > > > Ty, > > > > For queries on Scale versions and specific distros, please refer to the > FAQ: > > > > https://www.ibm.com/support/knowledgecenter/en/STXKQY/ > gpfsclustersfaq.html > > > > Table 33. IBM Spectrum Scale for Linux RedHat kernel support > > > > > > 7.5 3.10.0-862.el7 3.10.0-862.el7 From V4.1.1.20 in the 4.1.1 release > > From V4.2.3.9 in the 4.2 release > > > > From V5.0.1.1 in the 5.0 release > > > > From V4.1.1.20 in the 4.1.1 release > > From V4.2.3.9 in the 4.2 release > > > > From V5.0.1.1 in the 5.0 release > > > > > > Assuming the levels of CentOS and RHEL are the same (they are supposed > to be?), then CentOS 7.5 should work with Scale V5.0.1.1 or later. > > > > Felipe > > > > ---- > > Felipe Knop knop at us.ibm.com > > GPFS Development and Security > > IBM Systems > > IBM Building 008 > > 2455 South Rd, Poughkeepsie, NY 12601 > > (845) 433-9314 T/L 293-9314 > > > > > > > > Ty Tran ---09/07/2018 05:05:52 PM---Good Morning ? We have > been trying to install V5.0.0 and CentOS 7.5 but it doesn?t seem to like the > > > > From: Ty Tran > > To: "gpfsug-discuss at spectrumscale.org" org> > > Date: 09/07/2018 05:05 PM > > Subject: [gpfsug-discuss] Spectrum V5.0.0 and CentOS 7.5 > > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > > > > Good Morning ? > > > > We have been trying to install V5.0.0 and CentOS 7.5 but it doesn?t seem > to like the new Kernel. Does anyone have this running and do we need to do > anything special? Or must we go to V5.0.1? > > > > TQT > > > > Ty Q. Tran > > Managing Partner > > Applied Data Systems > > 12180 Dearborn Place > > Poway, CA 92064 > > (714) 392- 6690 (Cell) > > (844) 371- 4949 x100 (Work) > > (858) 842- 4678 (Fax) > > ty.tran at applieddatasystems.com > > www.applieddatasystems.com > > > > <1B307238.gif> > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Matthias.Knigge at rohde-schwarz.com Mon Sep 10 12:19:14 2018 From: Matthias.Knigge at rohde-schwarz.com (Matthias Knigge) Date: Mon, 10 Sep 2018 11:19:14 +0000 Subject: [gpfsug-discuss] [Newsletter] Re: Problem with mmlscluster and callback scripts Message-ID: <685e1f5c26c548ec85046e761563f583@rohde-schwarz.com> Hi Araon, in my setup I have no chance to define a tiebreaker disk. So if one node goes down I would change the role if this node. mmchnode --nonquorum -N nodename --force After that I can start the filesystem and mount it. Thanks, Matthias Best Regards Matthias Knigge R&D File Based Media Solutions Rohde & Schwarz GmbH & Co. KG Hanomaghof 1 30449 Hannover Telefon +49 511 67 80 7 213 Fax +49 511 37 19 74 Internet: Matthias.Knigge at rohde-schwarz.com ------------------------------------------------------------ Gesch?ftsf?hrung / Executive Board: Christian Leicher (Vorsitzender / Chairman), Peter Riedel, Sitz der Gesellschaft / Company's Place of Business: M?nchen, Registereintrag / Commercial Register No.: HRA 16 270, Pers?nlich haftender Gesellschafter / Personally Liable Partner: RUSEG Verwaltungs-GmbH, Sitz der Gesellschaft / Company's Place of Business: M?nchen, Registereintrag / Commercial Register No.: HRB 7 534, Umsatzsteuer-Identifikationsnummer (USt-IdNr.) / VAT Identification No.: DE 130 256 683, Elektro-Altger?te Register (EAR) / WEEE Register No.: DE 240 437 86 -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Aaron Knister Sent: Friday, September 07, 2018 3:35 PM To: gpfsug-discuss at spectrumscale.org Subject: *EXT* [Newsletter] Re: [gpfsug-discuss] Problem with mmlscluster and callback scripts Hi Matthias, Looks like you lost quorum in the cluster (you've got to have (n/2+1) quorum nodes up if you're using node-based quorum). Do you have a tiebreaker disk defined? (i.e. mmlsconfig tiebreakerdisk). -Aaron On 9/7/18 7:51 AM, Matthias Knigge wrote: > Hello together, > > I am using the version 5.0.2.0 of GPFS and have problems with the > command mmlscluster and callback-scripts. It is a small cluster of two > nodes only. If I shutdown one of the nodes sometimes mmlscluster > reports the following output: > > [root at gpfs-tier1 gpfs5.2]# mmgetstate > > Node number? Node name??????? GPFS state > > ------------------------------------------- > > ?????? 1????? gpfs-tier1?????? arbitrating > > [root at gpfs-tier1 gpfs5.2]# mmlscluster > > ssh: connect to host gpfs-tier2 port 22: No route to host > > mmlscluster: Unable to retrieve GPFS cluster files from node > gpfs-tier2 > > mmlscluster: Command failed. Examine previous error messages to > determine cause. > > Normally the output is like this: > > [root at gpfs-tier1 gpfs5.2]# mmlscluster > > GPFS cluster information > > ======================== > > ? GPFS cluster name:???????? TIERCLUSTER.gpfs-tier1 > > ? GPFS cluster id:?????????? 12458173498278694815 > > ? GPFS UID domain:?????????? TIERCLUSTER.gpfs-tier1 > > ? Remote shell command:????? /usr/bin/ssh > > ? Remote file copy command:? /usr/bin/scp > > ? Repository type:?????????? server-based > > GPFS cluster configuration servers: > > ----------------------------------- > > ? Primary server:??? gpfs-tier2 > > ? Secondary server:? gpfs-tier1 > > Node? Daemon node name? IP address????? Admin node name? Designation > > ---------------------------------------------------------------------- > > ?? 1?? gpfs-tier1??????? 192.168.178.10? gpfs-tier1?????? > quorum-manager > > ?? 2?? gpfs-tier2??????? 192.168.178.11? gpfs-tier2?????? > quorum-manager > > [root at gpfs-tier1 gpfs5.2]# mmlscallback > > NodeDownCallback > > ??????? command?????? = /var/mmfs/rs/nodedown.ksh > > ??????? priority????? = 1 > > ??????? event???????? = quorumNodeLeave > > ??????? parms???????? = %eventNode %quorumNodes > > NodeUpCallback > > ??????? command?????? = /var/mmfs/rs/nodeup.ksh > > ??????? priority????? = 1 > > ??????? event???????? = quorumNodeJoin > > ??????? parms???????? = %eventNode %quorumNodes > > If I shutdown the filesystem via mmshutdown the callback script works > but if I shutdown the whole node the scripts does not run. > > The latest log-entry in mmfs.log.latest shows only this information: > > 2018-09-07_13:12:36.724+0200: [I] Cluster Manager connection broke. > Probing cluster TIERCLUSTER.gpfs-tier1 > > 2018-09-07_13:12:37.226+0200: [E] Unable to contact enough other > quorum nodes during cluster probe. > > 2018-09-07_13:12:37.226+0200: [E] Lost membership in cluster > TIERCLUSTER.gpfs-tier1. Unmounting file systems. > > 2018-09-07_13:12:38.448+0200: [N] Connecting to 192.168.178.11 > gpfs-tier2 > > Could anybody help me in this case? I want to try to start a script if > one node goes down or up to change the roles for starting the > filesystem. The callback event NodeLeave and NodeJoin do not run too. > > Any more information required? If yes, please let me know! > > Many thanks in advance and a nice weekend! > > Matthias > > Best Regards > > Matthias Knigge > R&D File Based Media Solutions > > Rohde & Schwarz > GmbH & Co. KG > Hanomaghof 1 > 30449 Hannover > Telefon +49 511 67 80 7 213 > Fax +49 511 37 19 74 > Internet: Matthias.Knigge at rohde-schwarz.com > ------------------------------------------------------------ > Gesch?ftsf?hrung / Executive Board: Christian Leicher (Vorsitzender / > Chairman), Peter Riedel, Sitz der Gesellschaft / Company's Place of > Business: M?nchen, Registereintrag / Commercial Register No.: HRA 16 > 270, Pers?nlich haftender Gesellschafter / Personally Liable Partner: > RUSEG Verwaltungs-GmbH, Sitz der Gesellschaft / Company's Place of > Business: M?nchen, Registereintrag / Commercial Register No.: HRB 7 > 534, Umsatzsteuer-Identifikationsnummer (USt-IdNr.) / VAT Identification No.: > DE 130 256 683, Elektro-Altger?te Register (EAR) / WEEE Register No.: > DE > 240 437 86 > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From Matthias.Knigge at rohde-schwarz.com Mon Sep 10 12:21:21 2018 From: Matthias.Knigge at rohde-schwarz.com (Matthias Knigge) Date: Mon, 10 Sep 2018 11:21:21 +0000 Subject: [gpfsug-discuss] [Newsletter] Re: Problem with mmlscluster and callback scripts Message-ID: Hi Fred, I have the same problem with the version 5.0.1.0. Thanks, Matthias Best Regards Matthias Knigge R&D File Based Media Solutions Rohde & Schwarz GmbH & Co. KG Hanomaghof 1 30449 Hannover Telefon +49 511 67 80 7 213 Fax +49 511 37 19 74 Internet: Matthias.Knigge at rohde-schwarz.com ------------------------------------------------------------ Gesch?ftsf?hrung / Executive Board: Christian Leicher (Vorsitzender / Chairman), Peter Riedel, Sitz der Gesellschaft / Company's Place of Business: M?nchen, Registereintrag / Commercial Register No.: HRA 16 270, Pers?nlich haftender Gesellschafter / Personally Liable Partner: RUSEG Verwaltungs-GmbH, Sitz der Gesellschaft / Company's Place of Business: M?nchen, Registereintrag / Commercial Register No.: HRB 7 534, Umsatzsteuer-Identifikationsnummer (USt-IdNr.) / VAT Identification No.: DE 130 256 683, Elektro-Altger?te Register (EAR) / WEEE Register No.: DE 240 437 86 From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Frederick Stock Sent: Friday, September 07, 2018 3:20 PM To: gpfsug main discussion list Subject: *EXT* [Newsletter] Re: [gpfsug-discuss] Problem with mmlscluster and callback scripts Are you really running version 5.0.2? If so then I presume you have a beta version since it has not yet been released. For beta problems there is a specific feedback mechanism that should be used to report problems. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: Matthias Knigge > To: "gpfsug-discuss at spectrumscale.org" > Date: 09/07/2018 08:08 AM Subject: [gpfsug-discuss] Problem with mmlscluster and callback scripts Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hello together, I am using the version 5.0.2.0 of GPFS and have problems with the command mmlscluster and callback-scripts. It is a small cluster of two nodes only. If I shutdown one of the nodes sometimes mmlscluster reports the following output: [root at gpfs-tier1 gpfs5.2]# mmgetstate Node number Node name GPFS state ------------------------------------------- 1 gpfs-tier1 arbitrating [root at gpfs-tier1 gpfs5.2]# mmlscluster ssh: connect to host gpfs-tier2 port 22: No route to host mmlscluster: Unable to retrieve GPFS cluster files from node gpfs-tier2 mmlscluster: Command failed. Examine previous error messages to determine cause. Normally the output is like this: [root at gpfs-tier1 gpfs5.2]# mmlscluster GPFS cluster information ======================== GPFS cluster name: TIERCLUSTER.gpfs-tier1 GPFS cluster id: 12458173498278694815 GPFS UID domain: TIERCLUSTER.gpfs-tier1 Remote shell command: /usr/bin/ssh Remote file copy command: /usr/bin/scp Repository type: server-based GPFS cluster configuration servers: ----------------------------------- Primary server: gpfs-tier2 Secondary server: gpfs-tier1 Node Daemon node name IP address Admin node name Designation ---------------------------------------------------------------------- 1 gpfs-tier1 192.168.178.10 gpfs-tier1 quorum-manager 2 gpfs-tier2 192.168.178.11 gpfs-tier2 quorum-manager [root at gpfs-tier1 gpfs5.2]# mmlscallback NodeDownCallback command = /var/mmfs/rs/nodedown.ksh priority = 1 event = quorumNodeLeave parms = %eventNode %quorumNodes NodeUpCallback command = /var/mmfs/rs/nodeup.ksh priority = 1 event = quorumNodeJoin parms = %eventNode %quorumNodes If I shutdown the filesystem via mmshutdown the callback script works but if I shutdown the whole node the scripts does not run. The latest log-entry in mmfs.log.latest shows only this information: 2018-09-07_13:12:36.724+0200: [I] Cluster Manager connection broke. Probing cluster TIERCLUSTER.gpfs-tier1 2018-09-07_13:12:37.226+0200: [E] Unable to contact enough other quorum nodes during cluster probe. 2018-09-07_13:12:37.226+0200: [E] Lost membership in cluster TIERCLUSTER.gpfs-tier1. Unmounting file systems. 2018-09-07_13:12:38.448+0200: [N] Connecting to 192.168.178.11 gpfs-tier2 Could anybody help me in this case? I want to try to start a script if one node goes down or up to change the roles for starting the filesystem. The callback event NodeLeave and NodeJoin do not run too. Any more information required? If yes, please let me know! Many thanks in advance and a nice weekend! Matthias Best Regards Matthias Knigge R&D File Based Media Solutions Rohde & Schwarz GmbH & Co. KG Hanomaghof 1 30449 Hannover Telefon +49 511 67 80 7 213 Fax +49 511 37 19 74 Internet: Matthias.Knigge at rohde-schwarz.com ------------------------------------------------------------ Gesch?ftsf?hrung / Executive Board: Christian Leicher (Vorsitzender / Chairman), Peter Riedel, Sitz der Gesellschaft / Company's Place of Business: M?nchen, Registereintrag / Commercial Register No.: HRA 16 270, Pers?nlich haftender Gesellschafter / Personally Liable Partner: RUSEG Verwaltungs-GmbH, Sitz der Gesellschaft / Company's Place of Business: M?nchen, Registereintrag / Commercial Register No.: HRB 7 534, Umsatzsteuer-Identifikationsnummer (USt-IdNr.) / VAT Identification No.: DE 130 256 683, Elektro-Altger?te Register (EAR) / WEEE Register No.: DE 240 437 86 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From abeattie at au1.ibm.com Mon Sep 10 13:19:31 2018 From: abeattie at au1.ibm.com (Andrew Beattie) Date: Mon, 10 Sep 2018 12:19:31 +0000 Subject: [gpfsug-discuss] [Newsletter] Re: Problem with mmlscluster and callback scripts In-Reply-To: Message-ID: An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Mon Sep 10 15:36:52 2018 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Mon, 10 Sep 2018 10:36:52 -0400 Subject: [gpfsug-discuss] RAID type for system pool In-Reply-To: References: <92924CCC-FF4A-4F14-BA19-D35BC93CE69F@vanderbilt.edu> Message-ID: No, but of course for a RAID with additional parity (error correction) bits the controller needs to read and write more. So if, for example, n+2 sub-blocks per stripe = n data and 2 error correction... Then the smallest update requires read-compute-write on 1 data and 2 ecc. = 3 reads and 3 writes. The calculation each parity block of the requires "subtracting" out the contribution of the old data value and adding in the contribution of the new data value Ref: http://igoro.com/archive/how-raid-6-dual-parity-calculation-works/ Look at it this way: The k'th parity value is Parityk= Ak*(data1) + Bk*(data2) + Ck*(data3) + ... (Ak, Bk, Ck, ... are coefficients for the computation of the k'th parity value) When updating data2 to data2x we update Parityk to Paritykx with Paritykx = Pariktyk - Bk*(data2) + Bk*(data2x) (Arithmetic done in a Galois Field chosen to make error correction practical.) From: "Uwe Falke" Hi, Marc, I was clearly unaware of that function. If my understanding of parity-based redundancy is about correct, then that method would only work with RAID 5, because that is a simple XOR-based hash, but RAID 6, if used, would not allow that stripped-down RMW. Is that correct? -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Mon Sep 10 19:26:34 2018 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Mon, 10 Sep 2018 18:26:34 +0000 Subject: [gpfsug-discuss] RAID type for system pool Message-ID: <6232FD43-12A5-49BA-84DA-B3801F42EAF6@vanderbilt.edu> Hi All, So while I?m waiting for the purchase of new hardware to go thru, I?m trying to gather more data about the current workload. One of the things I?m trying to do is get a handle on the ratio of reads versus writes for my metadata. I?m using ?mmdiag ?iohist? ? in this case ?dm-12? is one of my metadataOnly disks and I?m running this on the primary NSD server for that NSD. I?m seeing output like: 11:22:13.931117 W inode 4:299844163 1 0.448 srv dm-12 11:22:13.932344 R metadata 4:36659676 4 0.307 srv dm-12 11:22:13.932005 W logData 4:49676176 1 0.726 srv dm-12 And I?m confused as to the difference between ?inode? and ?metadata? (I at least _think_ I understand ?logData?)?!? The man page for mmdiag doesn?t help and I?ve not found anything useful yet in my Googling. This is on a filesystem that currently uses 512 byte inodes, if that matters. Thanks? Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Mon Sep 10 17:37:26 2018 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Mon, 10 Sep 2018 16:37:26 +0000 Subject: [gpfsug-discuss] RAID type for system pool References: Message-ID: From: gpfsug-discuss-owner at spectrumscale.org Subject: Re: [gpfsug-discuss] RAID type for system pool Date: September 10, 2018 at 11:35:05 AM CDT To: klb at accre.vanderbilt.edu Hi All, So while I?m waiting for the purchase of new hardware to go thru, I?m trying to gather more data about the current workload. One of the things I?m trying to do is get a handle on the ratio of reads versus writes for my metadata. I?m using ?mmdiag ?iohist? ? in this case ?dm-12? is one of my metadataOnly disks and I?m running this on the primary NSD server for that NSD. I?m seeing output like: 11:22:13.931117 W inode 4:299844163 1 0.448 srv dm-12 11:22:13.932344 R metadata 4:36659676 4 0.307 srv dm-12 11:22:13.932005 W logData 4:49676176 1 0.726 srv dm-12 And I?m confused as to the difference between ?inode? and ?metadata? (I at least _think_ I understand ?logData?)?!? The man page for mmdiag doesn?t help and I?ve not found anything useful yet in my Googling. This is on a filesystem that currently uses 512 byte inodes, if that matters. Thanks? Kevin -------------- next part -------------- An HTML attachment was scrubbed... URL: From stockf at us.ibm.com Mon Sep 10 20:49:36 2018 From: stockf at us.ibm.com (Frederick Stock) Date: Mon, 10 Sep 2018 15:49:36 -0400 Subject: [gpfsug-discuss] RAID type for system pool In-Reply-To: <6232FD43-12A5-49BA-84DA-B3801F42EAF6@vanderbilt.edu> References: <6232FD43-12A5-49BA-84DA-B3801F42EAF6@vanderbilt.edu> Message-ID: My guess is that the "metadata" IO is for either for directory data since directories are considered metadata, or fileset metadata. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 09/10/2018 02:27 PM Subject: [gpfsug-discuss] RAID type for system pool Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi All, So while I?m waiting for the purchase of new hardware to go thru, I?m trying to gather more data about the current workload. One of the things I?m trying to do is get a handle on the ratio of reads versus writes for my metadata. I?m using ?mmdiag ?iohist? ? in this case ?dm-12? is one of my metadataOnly disks and I?m running this on the primary NSD server for that NSD. I?m seeing output like: 11:22:13.931117 W inode 4:299844163 1 0.448 srv dm-12 11:22:13.932344 R metadata 4:36659676 4 0.307 srv dm-12 11:22:13.932005 W logData 4:49676176 1 0.726 srv dm-12 And I?m confused as to the difference between ?inode? and ?metadata? (I at least _think_ I understand ?logData?)?!? The man page for mmdiag doesn?t help and I?ve not found anything useful yet in my Googling. This is on a filesystem that currently uses 512 byte inodes, if that matters. Thanks? Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From alvise.dorigo at psi.ch Tue Sep 11 10:05:23 2018 From: alvise.dorigo at psi.ch (Dorigo Alvise (PSI)) Date: Tue, 11 Sep 2018 09:05:23 +0000 Subject: [gpfsug-discuss] mmperfmon report some "null" data Message-ID: <83A6EEB0EC738F459A39439733AE80452675425E@MBX214.d.ethz.ch> Dear experts, during a intensive writing into a GPFS FS (~9.5 GB/s), if I run mmperfmon to collect performance data I get many "null" strings instead of real data:: [root at sf-dss-1 ~]# date;mmperfmon query 'sf-dssio-.*.psi.ch|GPFSNSDFS|RAW|gpfs_nsdfs_bytes_written' --short --number-buckets 10 -b 1 Tue Sep 11 10:57:06 CEST 2018 Legend: 1: sf-dssio-1.psi.ch|GPFSNSDFS|RAW|gpfs_nsdfs_bytes_written 2: sf-dssio-2.psi.ch|GPFSNSDFS|RAW|gpfs_nsdfs_bytes_written Row Timestamp _1 _2 1 2018-09-11-10:56:57 4135583744 4329193472 2 2018-09-11-10:56:58 4799332352 4697755648 3 2018-09-11-10:56:59 4799332352 4697755648 4 2018-09-11-10:57:00 null null 5 2018-09-11-10:57:01 null null 6 2018-09-11-10:57:02 null null 7 2018-09-11-10:57:03 null null 8 2018-09-11-10:57:04 null null 9 2018-09-11-10:57:05 null null 10 2018-09-11-10:57:06 null null Even worse if I reduce the number of buckets: [root at sf-dss-1 ~]# date;mmperfmon query 'sf-dssio-.*.psi.ch|GPFSNSDFS|RAW|gpfs_nsdfs_bytes_written' --short --number-buckets 5 -b 1 Tue Sep 11 10:59:26 CEST 2018 Legend: 1: sf-dssio-1.psi.ch|GPFSNSDFS|RAW|gpfs_nsdfs_bytes_written 2: sf-dssio-2.psi.ch|GPFSNSDFS|RAW|gpfs_nsdfs_bytes_written Row Timestamp _1 _2 1 2018-09-11-10:59:21 null null 2 2018-09-11-10:59:22 null null 3 2018-09-11-10:59:23 null null 4 2018-09-11-10:59:24 null null 5 2018-09-11-10:59:25 null null To get real data the number of buckets must be at least 6, but sometime it is better to set it to 10 otherwise there's the risk to get only "null" data anyway. The question is: which particular configuration can be wrong in my mmperfmon's configuration file (see below for the dump of "config show") that produces those null data ? My system is a Lenovo DSS-G220 updated to version dss-g-2.0a (gpfs version 4.2.3-7). thanks, Alvise ------------------------------------ cephMon = "/opt/IBM/zimon/CephMonProxy" cephRados = "/opt/IBM/zimon/CephRadosProxy" colCandidates = "sf-dss-1", "daas-mon.psi.ch" colRedundancy = 2 collectors = { host = "" port = "4739" } config = "/opt/IBM/zimon/ZIMonSensors.cfg" ctdbstat = "" daemonize = T hostname = "" ipfixinterface = "0.0.0.0" logfile = "/var/log/zimon/ZIMonSensors.log" loglevel = "info" mmcmd = "/opt/IBM/zimon/MMCmdProxy" mmdfcmd = "/opt/IBM/zimon/MMDFProxy" mmpmon = "/opt/IBM/zimon/MmpmonSockProxy" piddir = "/var/run" release = "4.2.3-4" sensors = { name = "CPU" period = 5 }, { name = "Load" period = 5 }, { name = "Memory" period = 5 }, { name = "Network" period = 1 }, { name = "Netstat" period = 0 }, { name = "Diskstat" period = 0 }, { name = "DiskFree" period = 60 restrict = "sf-dss-1.psi.ch" }, { name = "Infiniband" period = 1 }, { name = "GPFSDisk" period = 1 restrict = "nsdNodes" }, { name = "GPFSFilesystem" period = 1 }, { name = "GPFSNSDDisk" period = 1 restrict = "nsdNodes" }, { name = "GPFSNSDFS" period = 1 restrict = "nsdNodes" }, { name = "GPFSPoolIO" period = 1 }, { name = "GPFSVFS" period = 1 }, { name = "GPFSIOC" period = 1 }, { name = "GPFSVIO" period = 1 }, { name = "GPFSPDDisk" period = 1 restrict = "nsdNodes" }, { name = "GPFSvFLUSH" period = 1 }, { name = "GPFSNode" period = 1 }, { name = "GPFSNodeAPI" period = 1 }, { name = "GPFSFilesystemAPI" period = 1 }, { name = "GPFSLROC" period = 1 }, { name = "GPFSCHMS" period = 1 }, { name = "GPFSAFM" period = 5 }, { name = "GPFSAFMFS" period = 5 }, { name = "GPFSAFMFSET" period = 5 }, { name = "GPFSRPCS" period = 1 }, { name = "GPFSWaiters" period = 5 }, { name = "GPFSFilesetQuota" period = 60 restrict = "sf-dss-1" }, { name = "GPFSFileset" period = 60 restrict = "sf-dss-1" }, { name = "GPFSPool" period = 60 restrict = "sf-dss-1" }, { name = "GPFSDiskCap" period = 0 } smbstat = "" -------------- next part -------------- An HTML attachment was scrubbed... URL: From Michael.Dutchak at ibm.com Tue Sep 11 14:20:19 2018 From: Michael.Dutchak at ibm.com (Michael Dutchak) Date: Tue, 11 Sep 2018 09:20:19 -0400 Subject: [gpfsug-discuss] Optimal range on inode count for a single folder Message-ID: I would like to find out what the limitation, or optimal range on inode count for a single folder is in GPFS. We have several users that have caused issues with our current files system by adding up to a million small files (1 ~ 40k) to a single directory. This causes issues during system remount where restarting the system can take excessive amounts of time. -------------- next part -------------- An HTML attachment was scrubbed... URL: From stockf at us.ibm.com Tue Sep 11 15:04:33 2018 From: stockf at us.ibm.com (Frederick Stock) Date: Tue, 11 Sep 2018 10:04:33 -0400 Subject: [gpfsug-discuss] Optimal range on inode count for a single folder In-Reply-To: References: Message-ID: I am not sure I can provide you an optimal range but I can list some factors to consider. In general the guideline is to keep directories to 500K files or so. Keeping your metadata on separate NSDs, and preferably fast NSDs, helps especially with directory listings. And running the latest version of Scale also helps. It is unclear to me why the number of files in a directory would impact remount unless these are exported directories and the remount is occurring on a user node that also attempts to scan through the directory. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: "Michael Dutchak" To: gpfsug-discuss at spectrumscale.org Date: 09/11/2018 09:21 AM Subject: [gpfsug-discuss] Optimal range on inode count for a single folder Sent by: gpfsug-discuss-bounces at spectrumscale.org I would like to find out what the limitation, or optimal range on inode count for a single folder is in GPFS. We have several users that have caused issues with our current files system by adding up to a million small files (1 ~ 40k) to a single directory. This causes issues during system remount where restarting the system can take excessive amounts of time. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Tue Sep 11 15:03:24 2018 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Tue, 11 Sep 2018 10:03:24 -0400 Subject: [gpfsug-discuss] Optimal range on inode count for a single folder In-Reply-To: References: Message-ID: There is no single "optimal" number of files per directory. GPFS can handle millions of files in a directory, rather efficiently. It uses fairly modern extensible hashing and caching techniques that makes lookup, insertions and deletions go fast. But of course, reading or "listing" all directory entries is going to require reading all the disk sectors that contain the directory... "during system remount... restarting the system" -- NO! There is no relation between directory sizes and mount and startup times... If you are experiencing long mount times, something else is happening. IF restart is after a crash of some kind, then it is possible GPFS may need to process many log entries -- but that would be proportional to the number of directory updates "in flight" at the time of the crash... Having said that there are some changeover conditions in the way directories are stored, as one adds more and more entries. Since directory entries are of variable size, varying with the size of the file names, the exact numbers depend on file name length, inode size and (meta)data block size: A) All directory entries fit in the directory inode. Best performance! But I do not recommend deliberately changing apps to avoid spilling to ... B) All directory entries fit in one metadata block. C) Directory entries are spread over several blocks. You can determine how much storage a directory is using by a `stat /path` command or equivalent. From: "Michael Dutchak" To: gpfsug-discuss at spectrumscale.org Date: 09/11/2018 09:21 AM Subject: [gpfsug-discuss] Optimal range on inode count for a single folder Sent by: gpfsug-discuss-bounces at spectrumscale.org I would like to find out what the limitation, or optimal range on inode count for a single folder is in GPFS. We have several users that have caused issues with our current files system by adding up to a million small files (1 ~ 40k) to a single directory. This causes issues during system remount where restarting the system can take excessive amounts of time. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 21994 bytes Desc: not available URL: From makaplan at us.ibm.com Tue Sep 11 15:12:39 2018 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Tue, 11 Sep 2018 10:12:39 -0400 Subject: [gpfsug-discuss] RAID type for system pool In-Reply-To: References: Message-ID: Metadata is anything besides the data contents of your files. Inodes, directories, indirect blocks, allocation maps, log data ... are the biggies. Apparently, --iohist may sometimes distinguish some metadata as "inode", "logData", ... that doesn't mean those aren't metadata also. From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 09/10/2018 03:12 PM Subject: [gpfsug-discuss] RAID type for system pool Sent by: gpfsug-discuss-bounces at spectrumscale.org From: gpfsug-discuss-owner at spectrumscale.org Subject: Re: [gpfsug-discuss] RAID type for system pool Date: September 10, 2018 at 11:35:05 AM CDT To: klb at accre.vanderbilt.edu Hi All, So while I?m waiting for the purchase of new hardware to go thru, I?m trying to gather more data about the current workload. One of the things I?m trying to do is get a handle on the ratio of reads versus writes for my metadata. I?m using ?mmdiag ?iohist? ? in this case ?dm-12? is one of my metadataOnly disks and I?m running this on the primary NSD server for that NSD. I?m seeing output like: 11:22:13.931117 W inode 4:299844163 1 0.448 srv dm-12 11:22:13.932344 R metadata 4:36659676 4 0.307 srv dm-12 11:22:13.932005 W logData 4:49676176 1 0.726 srv dm-12 And I?m confused as to the difference between ?inode? and ?metadata? (I at least _think_ I understand ?logData?)?!? The man page for mmdiag doesn?t help and I?ve not found anything useful yet in my Googling. This is on a filesystem that currently uses 512 byte inodes, if that matters. Thanks? Kevin _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Tue Sep 11 16:48:51 2018 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Tue, 11 Sep 2018 15:48:51 +0000 Subject: [gpfsug-discuss] RAID type for system pool In-Reply-To: References: Message-ID: Hi Marc, Understood ? I?m just trying to understand why some I/O?s are flagged as metadata, while others are flagged as inode?!? Since this filesystem uses 512 byte inodes, there is no data content from any files involved (for a metadata only disk), correct? Thanks? Kevin On Sep 11, 2018, at 9:12 AM, Marc A Kaplan > wrote: Metadata is anything besides the data contents of your files. Inodes, directories, indirect blocks, allocation maps, log data ... are the biggies. Apparently, --iohist may sometimes distinguish some metadata as "inode", "logData", ... that doesn't mean those aren't metadata also. From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 09/10/2018 03:12 PM Subject: [gpfsug-discuss] RAID type for system pool Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ From: gpfsug-discuss-owner at spectrumscale.org Subject: Re: [gpfsug-discuss] RAID type for system pool Date: September 10, 2018 at 11:35:05 AM CDT To: klb at accre.vanderbilt.edu Hi All, So while I?m waiting for the purchase of new hardware to go thru, I?m trying to gather more data about the current workload. One of the things I?m trying to do is get a handle on the ratio of reads versus writes for my metadata. I?m using ?mmdiag ?iohist? ? in this case ?dm-12? is one of my metadataOnly disks and I?m running this on the primary NSD server for that NSD. I?m seeing output like: 11:22:13.931117 W inode 4:299844163 1 0.448 srv dm-12 11:22:13.932344 R metadata 4:36659676 4 0.307 srv dm-12 11:22:13.932005 W logData 4:49676176 1 0.726 srv dm-12 And I?m confused as to the difference between ?inode? and ?metadata? (I at least _think_ I understand ?logData?)?!? The man page for mmdiag doesn?t help and I?ve not found anything useful yet in my Googling. This is on a filesystem that currently uses 512 byte inodes, if that matters. Thanks? Kevin _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7C2dbbb1fe9f5a4b80aa6b08d617f0a664%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636722719686369131&sdata=uOPawxUhx4Wvxja5%2FLvJJMpAHj3uRb0Q1eiogmRXGgw%3D&reserved=0 -------------- next part -------------- An HTML attachment was scrubbed... URL: From valdis.kletnieks at vt.edu Tue Sep 11 17:31:46 2018 From: valdis.kletnieks at vt.edu (valdis.kletnieks at vt.edu) Date: Tue, 11 Sep 2018 12:31:46 -0400 Subject: [gpfsug-discuss] RAID type for system pool In-Reply-To: References: <6232FD43-12A5-49BA-84DA-B3801F42EAF6@vanderbilt.edu> Message-ID: <9333.1536683506@turing-police.cc.vt.edu> On Mon, 10 Sep 2018 15:49:36 -0400, "Frederick Stock" said: > My guess is that the "metadata" IO is for either for directory data since > directories are considered metadata, or fileset metadata. Plus things like free block lists, etc... -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 486 bytes Desc: not available URL: From makaplan at us.ibm.com Tue Sep 11 19:00:15 2018 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Tue, 11 Sep 2018 14:00:15 -0400 Subject: [gpfsug-discuss] RAID type for system pool In-Reply-To: References: Message-ID: 1. (Guessing) perhaps it was considered useful to distinguish inode traffic from log traffic and just lump other metadata together. 2. A 512 byte inode has space for up to 384 bytes of data-in-inode data. (Says tsdbfs ) From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 09/11/2018 11:49 AM Subject: Re: [gpfsug-discuss] RAID type for system pool Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Marc, Understood ? I?m just trying to understand why some I/O?s are flagged as metadata, while others are flagged as inode?!? Since this filesystem uses 512 byte inodes, there is no data content from any files involved (for a metadata only disk), correct? Thanks? Kevin On Sep 11, 2018, at 9:12 AM, Marc A Kaplan wrote: Metadata is anything besides the data contents of your files. Inodes, directories, indirect blocks, allocation maps, log data ... are the biggies. Apparently, --iohist may sometimes distinguish some metadata as "inode", "logData", ... that doesn't mean those aren't metadata also. From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 09/10/2018 03:12 PM Subject: [gpfsug-discuss] RAID type for system pool Sent by: gpfsug-discuss-bounces at spectrumscale.org From: gpfsug-discuss-owner at spectrumscale.org Subject: Re: [gpfsug-discuss] RAID type for system pool Date: September 10, 2018 at 11:35:05 AM CDT To: klb at accre.vanderbilt.edu Hi All, So while I?m waiting for the purchase of new hardware to go thru, I?m trying to gather more data about the current workload. One of the things I?m trying to do is get a handle on the ratio of reads versus writes for my metadata. I?m using ?mmdiag ?iohist? ? in this case ?dm-12? is one of my metadataOnly disks and I?m running this on the primary NSD server for that NSD. I?m seeing output like: 11:22:13.931117 W inode 4:299844163 1 0.448 srv dm-12 11:22:13.932344 R metadata 4:36659676 4 0.307 srv dm-12 11:22:13.932005 W logData 4:49676176 1 0.726 srv dm-12 And I?m confused as to the difference between ?inode? and ?metadata? (I at least _think_ I understand ?logData?)?!? The man page for mmdiag doesn?t help and I?ve not found anything useful yet in my Googling. This is on a filesystem that currently uses 512 byte inodes, if that matters. Thanks? Kevin _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7C2dbbb1fe9f5a4b80aa6b08d617f0a664%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636722719686369131&sdata=uOPawxUhx4Wvxja5%2FLvJJMpAHj3uRb0Q1eiogmRXGgw%3D&reserved=0 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From ewahl at osc.edu Tue Sep 11 19:48:26 2018 From: ewahl at osc.edu (Ed Wahl) Date: Tue, 11 Sep 2018 14:48:26 -0400 Subject: [gpfsug-discuss] RAID type for system pool In-Reply-To: References: Message-ID: <1f36431b-823b-aeae-205d-f213cca7c737@osc.edu> That isn't necessarily true.? The ONLY way to ensure you aren't saving tiny little <~400 byte files is to have encryption enabled with the advanced license.?? There are QUITE a few types of MD that can be saved, just look at the output from a file using mmlsattr -L... Ed On 09/11/2018 11:48 AM, Buterbaugh, Kevin L wrote: > Hi Marc, > > Understood ? I?m just trying to understand why some I/O?s are flagged > as metadata, while others are flagged as inode?!? ?Since this > filesystem uses 512 byte inodes, there is no data content from any > files involved (for a metadata only disk), correct? ?Thanks? > > Kevin > >> On Sep 11, 2018, at 9:12 AM, Marc A Kaplan > > wrote: >> >> Metadata is anything besides the data contents of your files. >> Inodes, directories, indirect blocks, allocation maps, log data ... >> ?are the biggies. >> >> Apparently, --iohist may sometimes distinguish some metadata as >> "inode", "logData", ... ?that doesn't mean those aren't metadata also. >> >> >> >> >> From: "Buterbaugh, Kevin L" > > >> To: gpfsug main discussion list > > >> Date: 09/10/2018 03:12 PM >> Subject: [gpfsug-discuss] ?RAID type for system pool >> Sent by: gpfsug-discuss-bounces at spectrumscale.org >> >> ------------------------------------------------------------------------ >> >> >> >> >> *From: * _gpfsug-discuss-owner at spectrumscale.org_ >> >> *Subject: Re: [gpfsug-discuss] RAID type for system pool* >> *Date: * September 10, 2018 at 11:35:05 AM CDT >> *To: * _klb at accre.vanderbilt.edu_ >> >> Hi All, >> >> So while I?m waiting for the purchase of new hardware to go thru, I?m >> trying to gather more data about the current workload. ?One of the >> things I?m trying to do is get a handle on the ratio of reads versus >> writes for my metadata. >> >> I?m using ?mmdiag ?iohist? ? in this case ?dm-12? is one of my >> metadataOnly disks and I?m running this on the primary NSD server for >> that NSD. ?I?m seeing output like: >> >> 11:22:13.931117 ?W ? ? ? inode ? ?4:299844163 ? ? ?1 ? ?0.448 ?srv ? >> dm-12 >> 11:22:13.932344 ?R ? ?metadata ? ?4:36659676 ? ? ? 4 ? ?0.307 ?srv ? >> dm-12 >> 11:22:13.932005 ?W ? ? logData ? ?4:49676176 ? ? ? 1 ? ?0.726 ?srv ? >> dm-12 >> >> And I?m confused as to the difference between ?inode? and ?metadata? >> (I at least _think_ I understand ?logData?)?!? ?The man page for >> mmdiag doesn?t help and I?ve not found anything useful yet in my >> Googling. >> >> This is on a filesystem that currently uses 512 byte inodes, if that >> matters. ?Thanks? >> >> Kevin >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7C2dbbb1fe9f5a4b80aa6b08d617f0a664%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636722719686369131&sdata=uOPawxUhx4Wvxja5%2FLvJJMpAHj3uRb0Q1eiogmRXGgw%3D&reserved=0 > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss From aaron.s.knister at nasa.gov Wed Sep 12 23:23:58 2018 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Wed, 12 Sep 2018 18:23:58 -0400 Subject: [gpfsug-discuss] RAID type for system pool In-Reply-To: References: <92924CCC-FF4A-4F14-BA19-D35BC93CE69F@vanderbilt.edu> <3B1C05ED-3058-4644-BC54-5BD1ED583C88@vanderbilt.edu> <269a9d7aeb3f4597adb22dfb3c2d8365@jumptrading.com> Message-ID: <9d8d93a6-9bbd-fbc0-b8f7-2f651ca64312@nasa.gov> It's a good question, Simon. I don't know the answer. At least, when I started composing this e-mail what, 5 days ago now, I didn't. I did a little test using dd to write directly to the NSD (not in production just to be clear...I've got co-workers on this list ;-) ). Here's a partial dump of the inode prior: # /usr/lpp/mmfs/bin/tsdbfs fs1 inode 23808 Inode 23808 [23808] snap 0 (index 1280 in block 11): Inode address: 1:4207872 2:4207872 size 512 nAddrs 25 indirectionLevel=INDIRECT status=USERFILE objectVersion=103 generation=0x6E256E16 nlink=1 owner uid=0 gid=0 mode=0200100644: -rw-r--r-- blocksize code=5 (32 subblocks) lastBlockSubblocks=32 checksum=0xF74A31AA is Valid This is me writing junk to that sector of the NSD: # dd if=/dev/urandom bs=512 of=/dev/sda seek=4207872 count=1 Post-junkifying: # /usr/lpp/mmfs/bin/tsdbfs fs1 sector 1:4207872 Contents of 1 sector(s) from 1:4207872 = 0x1:403500, width1 0000000000000000: 4FA27C86 5D2076BB 6CD011DE D582F7CE *O.|.].v.l.......* 0000000000000010: 60A708F1 A3C60FCD 7D796E3D CC97F586 *`.......}yn=....* 0000000000000020: 57B643A7 FABD7235 A2BD9B75 6DDA0771 *W.C...r5...um..q* 0000000000000030: 6A818411 0D59D1D3 2C4C7F39 2B2B529D *j....Y..,L.9++R.* 0000000000000040: 9AE06C7D A8FB1DC9 7E783DB4 90A9E9E4 *..l}....~x=.....* 0000000000000050: B2D0E9C9 CC7FEBC0 85F23DF8 F18D19C0 *..........=.....* 0000000000000060: DA9C817C D20C0FB2 F30AAF55 C86D4155 *...|.......U.mAU* Dump of the inode post-junkifying: # /usr/lpp/mmfs/bin/tsdbfs fs1 inode 23808 Inode 23808 [23808] snap 0 (index 1280 in block 11): Inode address: 1:4207872 2:4207872 size 512 nAddrs 0 indirectionLevel=13 status=4 objectVersion=5738285791753303739 generation=0x9AE06C7D nlink=3955281023 owner uid=2121809332 gid=-1867912732 mode=025076616711: prws--s--x flags set: exposed illCompressed dataUpdateMissRRPlus metaUpdateMiss blocksize code=8 (256 subblocks) lastBlockSubblocks=15582 checksum=0xD582F7CE is INVALID (computed checksum=0x2A2FA283) Attempts to access the file succeed but I get an fsstruct error: # /usr/lpp/mmfs/samples/debugtools/fsstructlx.awk /var/log/messages 09/12 at 17:38:03 gpfs-adm1 FSSTRUCT fs1 108 FSErrValidate type=inode da=00000001:0000000000403500(1:4207872) sectors=0001 repda=[nVal=2 00000001:0000000000403500(1:4207872) 00000002:0000000000403500(2:4207872)] data=(len=00000200) 4FA27C86 5D2076BB 6CD011DE D582F7CE 60A708F1 A3C60FCD 7D796E3D CC97F586 57B643A7 FABD7235 A2BD9B75 6DDA0771 6A818411 0D59D1D3 2C4C7F39 2B2B529D 9AE06C7D A8FB1DC9 7E783DB4 90A9E9E4 B2D0E9C9 CC7FEBC0 85F23DF8 F18D19C0 DA9C817C D20C0FB2 F30AAF55 It *didn't* automatically repair it, it seems. The restripe did pick it up: # /usr/lpp/mmfs/bin/mmrestripefs fs1 -c --read-only --metadata-only Scanning file system metadata, phase 1 ... Inode 0 [fileset 0, snapshot 0 ] has mismatch in replicated disk address 1:4206592 2:4206592 Scan completed successfully. Scanning file system metadata, phase 2 ... Scan completed successfully. Scanning file system metadata, phase 3 ... Scan completed successfully. Scanning file system metadata, phase 4 ... Scan completed successfully. Scanning user file metadata ... 100.00 % complete on Sun Aug 26 18:10:36 2018 ( 69632 inodes with total 406 MB data processed) Scan completed successfully. I ran this to fix it: # /usr/lpp/mmfs/bin/mmrestripefs fs1 -c --metadata-only And things appear better afterwards: # /usr/lpp/mmfs/bin/tsdbfs fs1 inode 23808 Inode 23808 [23808] snap 0 (index 1280 in block 11): Inode address: 1:4207872 2:4207872 size 512 nAddrs 25 indirectionLevel=INDIRECT status=USERFILE objectVersion=103 generation=0x6E256E16 nlink=1 owner uid=0 gid=0 mode=0200100644: -rw-r--r-- blocksize code=5 (32 subblocks) lastBlockSubblocks=32 checksum=0xF74A31AA is Valid This is with 4.2.3-10. -Aaron On 9/6/18 1:49 PM, Simon Thompson wrote: > I thought reads were always round robin's (in some form) unless you set readreplicapolicy. > > And I thought with fsstruct you had to use mmfsck offline to fix. > > Simon > ________________________________________ > From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Aaron Knister [aaron.s.knister at nasa.gov] > Sent: 06 September 2018 18:06 > To: gpfsug-discuss at spectrumscale.org > Subject: Re: [gpfsug-discuss] RAID type for system pool > > Answers inline based on my recollection of experiences we've had here: > > On 9/6/18 12:19 PM, Bryan Banister wrote: >> I have questions about how the GPFS metadata replication of 3 works. >> >> 1. Is it basically the same as replication of 2 but just have one more >> copy, making recovery much more likely? > > That's my understanding. > >> 2. If there is nothing that is checking that the data was correctly >> read off of the device (e.g. CRC checking ON READS like the DDNs do, >> T10PI or Data Integrity Field) then how does GPFS handle a corrupted >> read of the data? >> - unlikely with SSD but head could be off on a NLSAS read, no >> errors, but you get some garbage instead, plus no auto retries > > The inode itself is checksummed: > > # /usr/lpp/mmfs/bin/tsdbfs mysuperawesomespacefs > Enter command or null to read next sector. Type ? for help. > inode 20087366 > Inode 20087366 [20087366] snap 0 (index 582 in block 9808): > Inode address: 30:263275078 32:263264838 size 512 nAddrs 32 > indirectionLevel=3 status=USERFILE > objectVersion=49352 generation=0x2B519B3 nlink=1 > owner uid=8675309 gid=999 mode=0200100600: -rw------- > blocksize code=5 (32 subblocks) > lastBlockSubblocks=1 > checksum=0xF2EF3427 is Valid > ... > Disk pointers [32]: > 0: 31:217629376 1: 30:217632960 2: (null) ... > 31: (null) > > as are indirect blocks (I'm sure that's not an exhaustive list of > checksummed metadata structures): > > ind 31:217629376 > Indirect block starting in sector 31:217629376: > magic=0x112DF307 generation=0x2B519B3 blockNum=0 inodeNum=20087366 > indirection level=2 > checksum=0x6BDAA92A > CalcChecksum(0x5B6DC9FC000, 32768, 20)=0x6BDAA92A > Data pointers: > >> 3. Does GPFS read at least two of the three replicas and compares them >> to ensure the data is correct? >> - expensive operation, so very unlikely > > I don't know, but I do know it verifies the checksum and I believe if > that's wrong it will try another replica. > >> 4. If not reading multiple replicas for comparison, are reads round >> robin across all three copies? > > I feel like we see pretty even distribution of reads across all replicas > of our metadata LUNs, although this is looking overall at the array > level so it may be a red herring. > >> 5. If one replica is corrupted (bad blocks) what does GPFS do to >> recover this metadata copy? Is this automatic or does this require >> a manual `mmrestripefs -c` operation or something? >> - If not, seems like a pretty simple idea and maybe an RFE worthy >> submission > > My experience has been it will attempt to correct it (and maybe log an > fsstruct error?). This was in the 3.5 days, though. > >> 6. Would the idea of an option to run ?background scrub/verifies? of >> the data/metadata be worthwhile to ensure no hidden bad blocks? >> - Using QoS this should be relatively painless > > If you don't have array-level background scrubbing, this is what I'd > suggest. (e.g. mmrestripefs -c --metadata-only). > >> 7. With a drive failure do you have to delete the NSD from the file >> system and cluster, recreate the NSD, add it back to the FS, then >> again run the `mmrestripefs -c` operation to restore the replication? >> - As Kevin mentions this will end up being a FULL file system scan >> vs. a block-based scan and replication. That could take a long time >> depending on number of inodes and type of storage! >> >> Thanks for any insight, >> >> -Bryan >> >> *From:* gpfsug-discuss-bounces at spectrumscale.org >> *On Behalf Of *Buterbaugh, >> Kevin L >> *Sent:* Thursday, September 6, 2018 9:59 AM >> *To:* gpfsug main discussion list >> *Subject:* Re: [gpfsug-discuss] RAID type for system pool >> >> /Note: External Email/ >> >> ------------------------------------------------------------------------ >> >> Hi All, >> >> Wow - my query got more responses than I expected and my sincere thanks >> to all who took the time to respond! >> >> At this point in time we do have two GPFS filesystems ? one which is >> basically ?/home? and some software installations and the other which is >> ?/scratch? and ?/data? (former backed up, latter not). Both of them >> have their metadata on SSDs set up as RAID 1 mirrors and replication set >> to two. But at this point in time all of the SSDs are in a single >> storage array (albeit with dual redundant controllers) ? so the storage >> array itself is my only SPOF. >> >> As part of the hardware purchase we are in the process of making we will >> be buying a 2nd storage array that can house 2.5? SSDs. Therefore, we >> will be splitting our SSDs between chassis and eliminating that last >> SPOF. Of course, this includes the new SSDs we are getting for our new >> /home filesystem. >> >> Our plan right now is to buy 10 SSDs, which will allow us to test 3 >> configurations: >> >> 1) two 4+1P RAID 5 LUNs split up into a total of 8 LV?s (with each of my >> 8 NSD servers as primary for one of those LV?s and the other 7 as >> backups) and GPFS metadata replication set to 2. >> >> 2) four RAID 1 mirrors (which obviously leaves 2 SSDs unused) and GPFS >> metadata replication set to 2. This would mean that only 4 of my 8 NSD >> servers would be a primary. >> >> 3) nine RAID 0 / bare drives with GPFS metadata replication set to 3 >> (which leaves 1 SSD unused). All 8 NSD servers primary for one SSD and >> 1 serving up two. >> >> The responses I received concerning RAID 5 and performance were not a >> surprise to me. The main advantage that option gives is the most usable >> storage space for the money (in fact, it gives us way more storage space >> than we currently need) ? but if it tanks performance, then that?s a >> deal breaker. >> >> Personally, I like the four RAID 1 mirrors config like we?ve been using >> for years, but it has the disadvantage of giving us the least usable >> storage space ? that config would give us the minimum we need for right >> now, but doesn?t really allow for much future growth. >> >> I have no experience with metadata replication of 3 (but had actually >> thought of that option, so feel good that others suggested it) so option >> 3 will be a brand new experience for us. It is the most optimal in >> terms of meeting current needs plus allowing for future growth without >> giving us way more space than we are likely to need). I will be curious >> to see how long it takes GPFS to re-replicate the data when we simulate >> a drive failure as opposed to how long a RAID rebuild takes. >> >> I am a big believer in Murphy?s Law (Sunday I paid off a bill, Wednesday >> my refrigerator died!) ? and also believe that the definition of a >> pessimist is ?someone with experience? ? so we will definitely >> not set GPFS metadata replication to less than two, nor will we use >> non-Enterprise class SSDs for metadata ? but I do still appreciate the >> suggestions. >> >> If there is interest, I will report back on our findings. If anyone has >> any additional thoughts or suggestions, I?d also appreciate hearing >> them. Again, thank you! >> >> Kevin >> >> ? >> >> Kevin Buterbaugh - Senior System Administrator >> >> Vanderbilt University - Advanced Computing Center for Research and Education >> >> Kevin.Buterbaugh at vanderbilt.edu >> ?- (615)875-9633 >> >> >> ------------------------------------------------------------------------ >> >> Note: This email is for the confidential use of the named addressee(s) >> only and may contain proprietary, confidential, or privileged >> information and/or personal data. If you are not the intended recipient, >> you are hereby notified that any review, dissemination, or copying of >> this email is strictly prohibited, and requested to notify the sender >> immediately and destroy this email and any attachments. Email >> transmission cannot be guaranteed to be secure or error-free. The >> Company, therefore, does not make any guarantees as to the completeness >> or accuracy of this email or any attachments. This email is for >> informational purposes only and does not constitute a recommendation, >> offer, request, or solicitation of any kind to buy, sell, subscribe, >> redeem, or perform any type of transaction of a financial product. >> Personal data, as defined by applicable data privacy laws, contained in >> this email may be processed by the Company, and any of its affiliated or >> related companies, for potential ongoing compliance and/or >> business-related purposes. You may have rights regarding your personal >> data; for information on exercising these rights or the Company?s >> treatment of personal data, please email datarequests at jumptrading.com. >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From bbanister at jumptrading.com Fri Sep 14 00:33:08 2018 From: bbanister at jumptrading.com (Bryan Banister) Date: Thu, 13 Sep 2018 23:33:08 +0000 Subject: [gpfsug-discuss] replicating ACLs across GPFS's? Message-ID: I'm checking in on this thread. Is this patch still working for people with the latest rsync releases? https://github.com/gpfsug/gpfsug-tools/tree/master/bin/rsync Thanks! -Bryan ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential, or privileged information and/or personal data. If you are not the intended recipient, you are hereby notified that any review, dissemination, or copying of this email is strictly prohibited, and requested to notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request, or solicitation of any kind to buy, sell, subscribe, redeem, or perform any type of transaction of a financial product. Personal data, as defined by applicable data privacy laws, contained in this email may be processed by the Company, and any of its affiliated or related companies, for potential ongoing compliance and/or business-related purposes. You may have rights regarding your personal data; for information on exercising these rights or the Company's treatment of personal data, please email datarequests at jumptrading.com. -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Fri Sep 14 09:41:38 2018 From: S.J.Thompson at bham.ac.uk (Simon Thompson) Date: Fri, 14 Sep 2018 08:41:38 +0000 Subject: [gpfsug-discuss] replicating ACLs across GPFS's? Message-ID: Last time I built was still against 3.0.9, note there is also a PR in there which fixes the bug with symlinks. If anyone wants to rebase the patches against 3.1.3, I?ll happily take a PR ? Simon From: on behalf of "bbanister at jumptrading.com" Reply-To: "gpfsug-discuss at spectrumscale.org" Date: Friday, 14 September 2018 at 00:33 To: "gpfsug-discuss at spectrumscale.org" Subject: [gpfsug-discuss] replicating ACLs across GPFS's? I?m checking in on this thread. Is this patch still working for people with the latest rsync releases? https://github.com/gpfsug/gpfsug-tools/tree/master/bin/rsync Thanks! -Bryan ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential, or privileged information and/or personal data. If you are not the intended recipient, you are hereby notified that any review, dissemination, or copying of this email is strictly prohibited, and requested to notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request, or solicitation of any kind to buy, sell, subscribe, redeem, or perform any type of transaction of a financial product. Personal data, as defined by applicable data privacy laws, contained in this email may be processed by the Company, and any of its affiliated or related companies, for potential ongoing compliance and/or business-related purposes. You may have rights regarding your personal data; for information on exercising these rights or the Company?s treatment of personal data, please email datarequests at jumptrading.com. -------------- next part -------------- An HTML attachment was scrubbed... URL: From kevindjo at us.ibm.com Fri Sep 14 14:24:04 2018 From: kevindjo at us.ibm.com (Kevin D Johnson) Date: Fri, 14 Sep 2018 13:24:04 +0000 Subject: [gpfsug-discuss] replicating ACLs across GPFS's? In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Fri Sep 14 14:36:38 2018 From: S.J.Thompson at bham.ac.uk (Simon Thompson) Date: Fri, 14 Sep 2018 13:36:38 +0000 Subject: [gpfsug-discuss] replicating ACLs across GPFS's? In-Reply-To: References: Message-ID: <14417376-CF84-45F8-9461-DBE1A86777D8@bham.ac.uk> Oh I also heard a rumour of some sort of mmcopy type sample script, but I can?t see it in samples on 5.0.1-2? Simon From: on behalf of Simon Thompson Reply-To: "gpfsug-discuss at spectrumscale.org" Date: Friday, 14 September 2018 at 09:41 To: "gpfsug-discuss at spectrumscale.org" Subject: Re: [gpfsug-discuss] replicating ACLs across GPFS's? Last time I built was still against 3.0.9, note there is also a PR in there which fixes the bug with symlinks. If anyone wants to rebase the patches against 3.1.3, I?ll happily take a PR ? Simon From: on behalf of "bbanister at jumptrading.com" Reply-To: "gpfsug-discuss at spectrumscale.org" Date: Friday, 14 September 2018 at 00:33 To: "gpfsug-discuss at spectrumscale.org" Subject: [gpfsug-discuss] replicating ACLs across GPFS's? I?m checking in on this thread. Is this patch still working for people with the latest rsync releases? https://github.com/gpfsug/gpfsug-tools/tree/master/bin/rsync Thanks! -Bryan ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential, or privileged information and/or personal data. If you are not the intended recipient, you are hereby notified that any review, dissemination, or copying of this email is strictly prohibited, and requested to notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request, or solicitation of any kind to buy, sell, subscribe, redeem, or perform any type of transaction of a financial product. Personal data, as defined by applicable data privacy laws, contained in this email may be processed by the Company, and any of its affiliated or related companies, for potential ongoing compliance and/or business-related purposes. You may have rights regarding your personal data; for information on exercising these rights or the Company?s treatment of personal data, please email datarequests at jumptrading.com. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Fri Sep 14 20:36:57 2018 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Fri, 14 Sep 2018 19:36:57 +0000 Subject: [gpfsug-discuss] FYI - Spectrum Scale 5.0.2 Available Message-ID: <2F377221-7DF6-4C71-892B-7FE8B5D4AFE4@nuance.com> I did not see this on Fix Central yet (I?m assuming you can upgrade from 5.0.1,X) , but it is on Passport Advantage. Looks like a nice set of changes, which is summarized here: https://www.ibm.com/support/knowledgecenter/STXKQY_5.0.2/com.ibm.spectrum.scale.v5r02.doc/bl1xx_soc.htm Here are few highlights I think are worth noting (not all of them) ? I beta tested watch folders and It?s a great addition to Scale. File system core improvements * Combined gpfs.base and gpfs.ext into a single package on Linux * File system maintenance mode provides a safe access window for file system maintenance * The maxActiveIallocSegs attribute improves the performance of deletes and unlinks * The mmnetverify command checks the connectivity of remote clusters * The GPFS portability layer (GPL) can be rebuilt automatically * You can now configure a cluster to automatically rebuild the GPL whenever a new level of the Linux kernel is installed or whenever a new level of IBM Spectrum Scale is installed. For more information, see the description of the autoBuildGPL attribute in the topic mmchconfig command. * Two features cope with long I/O waits on directly attached disks * The diskIOHang callback event allows you to add notification and data collection scripts to analyze the cause of a local I/O request that has been pending in the node kernel for more than 5 minutes. For more information, see mmaddcallback command. * The panicOnIOHang attribute controls whether the GPFS daemon panics the node kernel when a local I/O request has been pending in the kernel for more than five minutes. For more information, see mmchconfig command. Watch folder Watch folder is a flexible API that allows programmatic actions to be taken based on file system events. For more information, see Introduction to watch folder. It has the following features: * Watch folder can be run against folders, filesets, and inode spaces. * Watch folder is modeled after Linux inotify, but works with clustered file systems and supports recursive watches for filesets and inode spaces. * Watch folder has two primary components: o The GPFS programming interfaces, which are included within . For more information, see Watch folder API. o The mmwatch command, which provides information for all of the watches running within a cluster. For more information, see mmwatch command. * A watch folder application uses the API to run on a node within an IBM Spectrum Scale cluster. o It utilizes the message queue to receive events from multiple nodes and consume from the node that is running the application. o Lightweight events come in from all eligible nodes within a cluster and from accessing clusters. * Watch folder is integrated into call home, IBM Spectrum Scale snap log collection, and IBM Spectrum Scale trace. Bob Oesterlin Sr Principal Storage Engineer, Nuance 507-269-0413 -------------- next part -------------- An HTML attachment was scrubbed... URL: From mutantllama at gmail.com Mon Sep 17 05:04:03 2018 From: mutantllama at gmail.com (Carl) Date: Mon, 17 Sep 2018 14:04:03 +1000 Subject: [gpfsug-discuss] SC18 UG meeting date Message-ID: Hi folks, Im just starting the process of obtaining the various travel approvals that I need to attend SC this year. Does anyone have the dates/times for the usergroup meeting? Thanks, Carl. -------------- next part -------------- An HTML attachment was scrubbed... URL: From kkr at lbl.gov Mon Sep 17 07:31:59 2018 From: kkr at lbl.gov (Kristy Kallback-Rose) Date: Sun, 16 Sep 2018 23:31:59 -0700 Subject: [gpfsug-discuss] SC18 UG meeting date In-Reply-To: References: Message-ID: Date will be Sunday the 11th. Will check on time. Kristy Sent from my iPhone > On Sep 16, 2018, at 9:04 PM, Carl wrote: > > Hi folks, > > Im just starting the process of obtaining the various travel approvals that I need to attend SC this year. > > Does anyone have the dates/times for the usergroup meeting? > > Thanks, > > Carl. > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss From S.J.Thompson at bham.ac.uk Mon Sep 17 11:10:16 2018 From: S.J.Thompson at bham.ac.uk (Simon Thompson) Date: Mon, 17 Sep 2018 10:10:16 +0000 Subject: [gpfsug-discuss] FYI - Spectrum Scale 5.0.2 Available Message-ID: <17CD664C-6034-4D39-B612-3FE862C45051@bham.ac.uk> Looks like its also moved to a unified installer ? i.e. no separate download for protocols anymore. Simon From: on behalf of "Robert.Oesterlin at nuance.com" Reply-To: "gpfsug-discuss at spectrumscale.org" Date: Friday, 14 September 2018 at 20:37 To: "gpfsug-discuss at spectrumscale.org" Subject: [gpfsug-discuss] FYI - Spectrum Scale 5.0.2 Available I did not see this on Fix Central yet (I?m assuming you can upgrade from 5.0.1,X) , but it is on Passport Advantage. Looks like a nice set of changes, which is summarized here: https://www.ibm.com/support/knowledgecenter/STXKQY_5.0.2/com.ibm.spectrum.scale.v5r02.doc/bl1xx_soc.htm Here are few highlights I think are worth noting (not all of them) ? I beta tested watch folders and It?s a great addition to Scale. File system core improvements * Combined gpfs.base and gpfs.ext into a single package on Linux * File system maintenance mode provides a safe access window for file system maintenance * The maxActiveIallocSegs attribute improves the performance of deletes and unlinks * The mmnetverify command checks the connectivity of remote clusters * The GPFS portability layer (GPL) can be rebuilt automatically * You can now configure a cluster to automatically rebuild the GPL whenever a new level of the Linux kernel is installed or whenever a new level of IBM Spectrum Scale is installed. For more information, see the description of the autoBuildGPL attribute in the topic mmchconfig command. * Two features cope with long I/O waits on directly attached disks * The diskIOHang callback event allows you to add notification and data collection scripts to analyze the cause of a local I/O request that has been pending in the node kernel for more than 5 minutes. For more information, see mmaddcallback command. * The panicOnIOHang attribute controls whether the GPFS daemon panics the node kernel when a local I/O request has been pending in the kernel for more than five minutes. For more information, see mmchconfig command. Watch folder Watch folder is a flexible API that allows programmatic actions to be taken based on file system events. For more information, see Introduction to watch folder. It has the following features: * Watch folder can be run against folders, filesets, and inode spaces. * Watch folder is modeled after Linux inotify, but works with clustered file systems and supports recursive watches for filesets and inode spaces. * Watch folder has two primary components: o The GPFS programming interfaces, which are included within . For more information, see Watch folder API. o The mmwatch command, which provides information for all of the watches running within a cluster. For more information, see mmwatch command. * A watch folder application uses the API to run on a node within an IBM Spectrum Scale cluster. o It utilizes the message queue to receive events from multiple nodes and consume from the node that is running the application. o Lightweight events come in from all eligible nodes within a cluster and from accessing clusters. * Watch folder is integrated into call home, IBM Spectrum Scale snap log collection, and IBM Spectrum Scale trace. Bob Oesterlin Sr Principal Storage Engineer, Nuance 507-269-0413 -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Mon Sep 17 13:02:49 2018 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Mon, 17 Sep 2018 12:02:49 +0000 Subject: [gpfsug-discuss] SC18 UG meeting date Message-ID: <21EBAFDD-A1A5-454D-B103-8745EEAF9CF1@nuance.com> As Kristy stated, it?s Sunday November 11th, Approximate times are 12PM-5PM (we?re still working on the schedule details) Bob Oesterlin Sr Principal Storage Engineer, Nuance From: on behalf of Carl Reply-To: gpfsug main discussion list Date: Sunday, September 16, 2018 at 11:04 PM To: gpfsug main discussion list Subject: [EXTERNAL] [gpfsug-discuss] SC18 UG meeting date Hi folks, Im just starting the process of obtaining the various travel approvals that I need to attend SC this year. Does anyone have the dates/times for the usergroup meeting? Thanks, Carl. -------------- next part -------------- An HTML attachment was scrubbed... URL: From carlz at us.ibm.com Mon Sep 17 16:46:56 2018 From: carlz at us.ibm.com (Carl Zetie) Date: Mon, 17 Sep 2018 15:46:56 +0000 Subject: [gpfsug-discuss] FYI - Spectrum Scale 5.0.2 Available Message-ID: ST>Looks like its also moved to a unified installer ? i.e. no separate download for protocols anymore. Correct. And to be absolutely clear, you still have the choice whether or not to install protocols. We just simplified it all to a common download image. (We did a bit of surveying beforehand including the UG, and responses ranged from disinterest to strong enthusiasm.) regards, Carl Zetie Program Director Offering Management for Spectrum Scale, IBM ---- (540) 882 9353 ][ Research Triangle Park carlz at us.ibm.com From xhejtman at ics.muni.cz Mon Sep 17 16:54:26 2018 From: xhejtman at ics.muni.cz (Lukas Hejtmanek) Date: Mon, 17 Sep 2018 17:54:26 +0200 Subject: [gpfsug-discuss] mmfsd and oom settings Message-ID: <20180917155426.fi54lkmduizegpow@ics.muni.cz> Hello, I accidentally got killed mmfsd by oom killer because of pagepool size which is normally ok but there was memory leak in smbd process so system run out of memory. (64gb pagepool but also two 32GB smbd processes) Shouldn't gpfs startup script set oom_score_adj to some proper value so that mmfsd is never get killed? It basically ruins the things.. -- Luk?? Hejtm?nek Linux Administrator only because Full Time Multitasking Ninja is not an official job title From Robert.Oesterlin at nuance.com Mon Sep 17 17:59:35 2018 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Mon, 17 Sep 2018 16:59:35 +0000 Subject: [gpfsug-discuss] Scale 5.0.2.0 - Now available on Fix Central Message-ID: <576A5F3D-E867-48C0-B95F-AF5ECB68BE99@nuance.com> A follow-up to my previous posting ? the update images are now available on Fix Central ? as pointed out by Simon and Carl ? packages now reflect a unified install. Bob Oesterlin Sr Principal Storage Engineer, Nuance -------------- next part -------------- An HTML attachment was scrubbed... URL: From valleru at cbio.mskcc.org Tue Sep 18 18:13:44 2018 From: valleru at cbio.mskcc.org (valleru at cbio.mskcc.org) Date: Tue, 18 Sep 2018 13:13:44 -0400 Subject: [gpfsug-discuss] GPFS, Pagepool and Block size -> Perfomance reduces with larger block size In-Reply-To: References: Message-ID: <6bb509b7-b7c5-422d-8e27-599333b6b7c4@Spark> Hello All, This is a continuation to the previous discussion that i had with Sven. However against what i had mentioned previously - i realize that this is ?not? related to mmap, and i see it when doing random freads. I see that block-size of the filesystem matters when reading from Page pool. I see a major difference in performance when compared 1M to 16M, when doing lot of random small freads with all of the data in pagepool. Performance for 1M is a magnitude ?more? than the performance that i see for 16M. The GPFS that we have currently is : Version :?5.0.1-0.5 Filesystem version:?19.01 (5.0.1.0) Block-size : 16M I had made the filesystem block-size to be 16M, thinking that i would get the most performance for both random/sequential reads from 16M than the smaller block-sizes. With GPFS 5.0, i made use the 1024 sub-blocks instead of 32 and thus not loose lot of storage space even with 16M. I had run few benchmarks and i did see that 16M was performing better ?when hitting storage/disks? with respect to bandwidth for random/sequential on small/large reads. However, with this particular workload - where it freads a chunk of data randomly from hundreds of files -> I see that the number of page-faults increase with block-size and actually reduce the performance. 1M performs a lot better than 16M, and may be i will get better performance with less than 1M. It gives the best performance when reading from local disk, with 4K block size filesystem. What i mean by performance when it comes to this workload - is not the bandwidth but the amount of time that it takes to do each iteration/read batch of data. I figure what is happening is: fread is trying to read a full block size of 16M - which is good in a way, when it hits the hard disk. But the application could be using just a small part of that 16M. Thus when randomly reading(freads) lot of data of 16M chunk size - it is page faulting a lot more and causing the performance to drop . I could try to make the application do read instead of freads, but i fear that could be bad too since it might be hitting the disk with a very small block size and that is not good. With the way i see things now - I believe it could be best if the application does random reads of 4k/1M from pagepool but some how does 16M from rotating disks. I don?t see any way of doing the above other than following a different approach where i create a filesystem with a smaller block size ( 1M or less than 1M ), on SSDs as a tier. May i please ask for advise, if what i am understanding/seeing is right and the best solution possible for the above scenario. Regards, Lohit On Apr 11, 2018, 10:36 AM -0400, Lohit Valleru , wrote: > Hey Sven, > > This is regarding mmap issues and GPFS. > We had discussed previously of experimenting with GPFS 5. > > I now have upgraded all of compute nodes and NSD nodes to GPFS 5.0.0.2 > > I am yet to experiment with mmap performance, but before that - I am seeing weird hangs with GPFS 5 and I think it could be related to mmap. > > Have you seen GPFS ever hang on this syscall? > [Tue Apr 10 04:20:13 2018] [] _ZN10gpfsNode_t8mmapLockEiiPKj+0xb5/0x140 [mmfs26] > > I see the above ,when kernel hangs and throws out a series of trace calls. > > I somehow think the above trace is related to processes hanging on GPFS forever. There are no errors in GPFS however. > > Also, I think the above happens only when the mmap threads go above a particular number. > > We had faced a similar issue in 4.2.3 and it was resolved in a patch to 4.2.3.2 . At that time , the issue happened when mmap threads go more than worker1threads. According to the ticket - it was a mmap race condition that GPFS was not handling well. > > I am not sure if this issue is a repeat and I am yet to isolate the incident and test with increasing number of mmap threads. > > I am not 100 percent sure if this is related to mmap yet but just wanted to ask you if you have seen anything like above. > > Thanks, > > Lohit > > On Feb 22, 2018, 3:59 PM -0500, Sven Oehme , wrote: > > Hi Lohit, > > > > i am working with ray on a mmap performance improvement right now, which most likely has the same root cause as yours , see -->??http://gpfsug.org/pipermail/gpfsug-discuss/2018-January/004411.html > > the thread above is silent after a couple of back and rorth, but ray and i have active communication in the background and will repost as soon as there is something new to share. > > i am happy to look at this issue after we finish with ray's workload if there is something missing, but first let's finish his, get you try the same fix and see if there is something missing. > > > > btw. if people would share their use of MMAP , what applications they use (home grown, just use lmdb which uses mmap under the cover, etc) please let me know so i get a better picture on how wide the usage is with GPFS. i know a lot of the ML/DL workloads are using it, but i would like to know what else is out there i might not think about. feel free to drop me a personal note, i might not reply to it right away, but eventually. > > > > thx. sven > > > > > > > On Thu, Feb 22, 2018 at 12:33 PM wrote: > > > > Hi all, > > > > > > > > I wanted to know, how does mmap interact with GPFS pagepool with respect to filesystem block-size? > > > > Does the efficiency depend on the mmap read size and the block-size of the filesystem even if all the data is cached in pagepool? > > > > > > > > GPFS 4.2.3.2 and CentOS7. > > > > > > > > Here is what i observed: > > > > > > > > I was testing a user script that uses mmap to read from 100M to 500MB files. > > > > > > > > The above files are stored on 3 different filesystems. > > > > > > > > Compute nodes - 10G pagepool and 5G seqdiscardthreshold. > > > > > > > > 1. 4M block size GPFS filesystem, with separate metadata and data. Data on Near line and metadata on SSDs > > > > 2. 1M block size GPFS filesystem as a AFM cache cluster, "with all the required files fully cached" from the above GPFS cluster as home. Data and Metadata together on SSDs > > > > 3. 16M block size GPFS filesystem, with separate metadata and data. Data on Near line and metadata on SSDs > > > > > > > > When i run the script first time for ?each" filesystem: > > > > I see that GPFS reads from the files, and caches into the pagepool as it reads, from mmdiag -- iohist > > > > > > > > When i run the second time, i see that there are no IO requests from the compute node to GPFS NSD servers, which is expected since all the data from the 3 filesystems is cached. > > > > > > > > However - the time taken for the script to run for the files in the 3 different filesystems is different - although i know that they are just "mmapping"/reading from pagepool/cache and not from disk. > > > > > > > > Here is the difference in time, for IO just from pagepool: > > > > > > > > 20s 4M block size > > > > 15s 1M block size > > > > 40S 16M block size. > > > > > > > > Why do i see a difference when trying to mmap reads from different block-size filesystems, although i see that the IO requests are not hitting disks and just the pagepool? > > > > > > > > I am willing to share the strace output and mmdiag outputs if needed. > > > > > > > > Thanks, > > > > Lohit > > > > > > > > _______________________________________________ > > > > gpfsug-discuss mailing list > > > > gpfsug-discuss at spectrumscale.org > > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From oehmes at gmail.com Tue Sep 18 22:23:09 2018 From: oehmes at gmail.com (Sven Oehme) Date: Tue, 18 Sep 2018 14:23:09 -0700 Subject: [gpfsug-discuss] GPFS, Pagepool and Block size -> Perfomance reduces with larger block size In-Reply-To: <6bb509b7-b7c5-422d-8e27-599333b6b7c4@Spark> References: <6bb509b7-b7c5-422d-8e27-599333b6b7c4@Spark> Message-ID: Hi, taking a trace would tell for sure, but i suspect what you might be hitting one or even multiple issues which have similar negative performance impacts but different root causes. 1. this could be serialization around buffer locks. as larger your blocksize gets as larger is the amount of data one of this pagepool buffers will maintain, if there is a lot of concurrency on smaller amount of data more threads potentially compete for the same buffer lock to copy stuff in and out of a particular buffer, hence things go slower compared to the same amount of data spread across more buffers, each of smaller size. 2. your data set is small'ish, lets say a couple of time bigger than the pagepool and you random access it with multiple threads. what will happen is that because it doesn't fit into the cache it will be read from the backend. if multiple threads hit the same 16 mb block at once with multiple 4k random reads, it will read the whole 16mb block because it thinks it will benefit from it later on out of cache, but because it fully random the same happens with the next block and the next and so on and before you get back to this block it was pushed out of the cache because of lack of enough pagepool. i could think of multiple other scenarios , which is why its so hard to accurately benchmark an application because you will design a benchmark to test an application, but it actually almost always behaves different then you think it does :-) so best is to run the real application and see under which configuration it works best. you could also take a trace with trace=io and then look at TRACE_VNOP: READ: TRACE_VNOP: WRITE: and compare them to TRACE_IO: QIO: read TRACE_IO: QIO: write and see if the numbers summed up for both are somewhat equal. if TRACE_VNOP is significant smaller than TRACE_IO you most likely do more i/o than you should and turning prefetching off might actually make things faster . keep in mind i am no longer working for IBM so all i say might be obsolete by now, i no longer have access to the one and only truth aka the source code ... but if i am wrong i am sure somebody will point this out soon ;-) sven On Tue, Sep 18, 2018 at 10:31 AM wrote: > Hello All, > > This is a continuation to the previous discussion that i had with Sven. > However against what i had mentioned previously - i realize that this is > ?not? related to mmap, and i see it when doing random freads. > > I see that block-size of the filesystem matters when reading from Page > pool. > I see a major difference in performance when compared 1M to 16M, when > doing lot of random small freads with all of the data in pagepool. > > Performance for 1M is a magnitude ?more? than the performance that i see > for 16M. > > The GPFS that we have currently is : > Version : 5.0.1-0.5 > Filesystem version: 19.01 (5.0.1.0) > Block-size : 16M > > I had made the filesystem block-size to be 16M, thinking that i would get > the most performance for both random/sequential reads from 16M than the > smaller block-sizes. > With GPFS 5.0, i made use the 1024 sub-blocks instead of 32 and thus not > loose lot of storage space even with 16M. > I had run few benchmarks and i did see that 16M was performing better > ?when hitting storage/disks? with respect to bandwidth for > random/sequential on small/large reads. > > However, with this particular workload - where it freads a chunk of data > randomly from hundreds of files -> I see that the number of page-faults > increase with block-size and actually reduce the performance. > 1M performs a lot better than 16M, and may be i will get better > performance with less than 1M. > It gives the best performance when reading from local disk, with 4K block > size filesystem. > > What i mean by performance when it comes to this workload - is not the > bandwidth but the amount of time that it takes to do each iteration/read > batch of data. > > I figure what is happening is: > fread is trying to read a full block size of 16M - which is good in a way, > when it hits the hard disk. > But the application could be using just a small part of that 16M. Thus > when randomly reading(freads) lot of data of 16M chunk size - it is page > faulting a lot more and causing the performance to drop . > I could try to make the application do read instead of freads, but i fear > that could be bad too since it might be hitting the disk with a very small > block size and that is not good. > > With the way i see things now - > I believe it could be best if the application does random reads of 4k/1M > from pagepool but some how does 16M from rotating disks. > > I don?t see any way of doing the above other than following a different > approach where i create a filesystem with a smaller block size ( 1M or less > than 1M ), on SSDs as a tier. > > May i please ask for advise, if what i am understanding/seeing is right > and the best solution possible for the above scenario. > > Regards, > Lohit > > On Apr 11, 2018, 10:36 AM -0400, Lohit Valleru , > wrote: > > Hey Sven, > > This is regarding mmap issues and GPFS. > We had discussed previously of experimenting with GPFS 5. > > I now have upgraded all of compute nodes and NSD nodes to GPFS 5.0.0.2 > > I am yet to experiment with mmap performance, but before that - I am > seeing weird hangs with GPFS 5 and I think it could be related to mmap. > > Have you seen GPFS ever hang on this syscall? > [Tue Apr 10 04:20:13 2018] [] > _ZN10gpfsNode_t8mmapLockEiiPKj+0xb5/0x140 [mmfs26] > > I see the above ,when kernel hangs and throws out a series of trace calls. > > I somehow think the above trace is related to processes hanging on GPFS > forever. There are no errors in GPFS however. > > Also, I think the above happens only when the mmap threads go above a > particular number. > > We had faced a similar issue in 4.2.3 and it was resolved in a patch to > 4.2.3.2 . At that time , the issue happened when mmap threads go more than > worker1threads. According to the ticket - it was a mmap race condition that > GPFS was not handling well. > > I am not sure if this issue is a repeat and I am yet to isolate the > incident and test with increasing number of mmap threads. > > I am not 100 percent sure if this is related to mmap yet but just wanted > to ask you if you have seen anything like above. > > Thanks, > > Lohit > > On Feb 22, 2018, 3:59 PM -0500, Sven Oehme , wrote: > > Hi Lohit, > > i am working with ray on a mmap performance improvement right now, which > most likely has the same root cause as yours , see --> > http://gpfsug.org/pipermail/gpfsug-discuss/2018-January/004411.html > the thread above is silent after a couple of back and rorth, but ray and i > have active communication in the background and will repost as soon as > there is something new to share. > i am happy to look at this issue after we finish with ray's workload if > there is something missing, but first let's finish his, get you try the > same fix and see if there is something missing. > > btw. if people would share their use of MMAP , what applications they use > (home grown, just use lmdb which uses mmap under the cover, etc) please let > me know so i get a better picture on how wide the usage is with GPFS. i > know a lot of the ML/DL workloads are using it, but i would like to know > what else is out there i might not think about. feel free to drop me a > personal note, i might not reply to it right away, but eventually. > > thx. sven > > > On Thu, Feb 22, 2018 at 12:33 PM wrote: > >> Hi all, >> >> I wanted to know, how does mmap interact with GPFS pagepool with respect >> to filesystem block-size? >> Does the efficiency depend on the mmap read size and the block-size of >> the filesystem even if all the data is cached in pagepool? >> >> GPFS 4.2.3.2 and CentOS7. >> >> Here is what i observed: >> >> I was testing a user script that uses mmap to read from 100M to 500MB >> files. >> >> The above files are stored on 3 different filesystems. >> >> Compute nodes - 10G pagepool and 5G seqdiscardthreshold. >> >> 1. 4M block size GPFS filesystem, with separate metadata and data. Data >> on Near line and metadata on SSDs >> 2. 1M block size GPFS filesystem as a AFM cache cluster, "with all the >> required files fully cached" from the above GPFS cluster as home. Data and >> Metadata together on SSDs >> 3. 16M block size GPFS filesystem, with separate metadata and data. Data >> on Near line and metadata on SSDs >> >> When i run the script first time for ?each" filesystem: >> I see that GPFS reads from the files, and caches into the pagepool as it >> reads, from mmdiag -- iohist >> >> When i run the second time, i see that there are no IO requests from the >> compute node to GPFS NSD servers, which is expected since all the data from >> the 3 filesystems is cached. >> >> However - the time taken for the script to run for the files in the 3 >> different filesystems is different - although i know that they are just >> "mmapping"/reading from pagepool/cache and not from disk. >> >> Here is the difference in time, for IO just from pagepool: >> >> 20s 4M block size >> 15s 1M block size >> 40S 16M block size. >> >> Why do i see a difference when trying to mmap reads from different >> block-size filesystems, although i see that the IO requests are not hitting >> disks and just the pagepool? >> >> I am willing to share the strace output and mmdiag outputs if needed. >> >> Thanks, >> Lohit >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From valleru at cbio.mskcc.org Wed Sep 19 19:05:24 2018 From: valleru at cbio.mskcc.org (valleru at cbio.mskcc.org) Date: Wed, 19 Sep 2018 14:05:24 -0400 Subject: [gpfsug-discuss] GPFS, Pagepool and Block size -> Perfomance reduces with larger block size In-Reply-To: References: <6bb509b7-b7c5-422d-8e27-599333b6b7c4@Spark> Message-ID: Thank you Sven. I mostly think it could be 1. or some other issue. I don?t think it could be 2. , because i can replicate this issue no matter what is the size of the dataset. It happens for few files that could easily fit in the page pool too. I do see a lot more page faults for 16M compared to 1M, so it could be related to many threads trying to compete for the same buffer space. I will try to take the trace with trace=io option and see if can find something. How do i turn of prefetching? Can i turn it off for a single node/client? Regards, Lohit On Sep 18, 2018, 5:23 PM -0400, Sven Oehme , wrote: > Hi, > > taking a trace would tell for sure, but i suspect what you might be hitting one or even multiple issues which have similar negative performance impacts but different root causes. > > 1. this could be serialization around buffer locks. as larger your blocksize gets as larger is the amount of data one of this pagepool buffers will maintain, if there is a lot of concurrency on smaller amount of data more threads potentially compete for the same buffer lock to copy stuff in and out of a particular buffer, hence things go slower compared to the same amount of data spread across more buffers, each of smaller size. > > 2. your data set is small'ish, lets say a couple of time bigger than the pagepool and you random access it with multiple threads. what will happen is that because it doesn't fit into the cache it will be read from the backend. if multiple threads hit the same 16 mb block at once with multiple 4k random reads, it will read the whole 16mb block because it thinks it will benefit from it later on out of cache, but because it fully random the same happens with the next block and the next and so on and before you get back to this block it was pushed out of the cache because of lack of enough pagepool. > > i could think?of multiple other scenarios , which is why its so hard to accurately benchmark an application because you will design a benchmark to test an application, but it actually almost always behaves different then you think it does :-) > > so best is to run the real application and see under which configuration it works best. > > you could also take a trace with trace=io and then look at > > TRACE_VNOP: READ: > TRACE_VNOP: WRITE: > > and compare them to > > TRACE_IO: QIO: read > TRACE_IO: QIO: write > > and see if the numbers summed up for both are somewhat equal. if TRACE_VNOP is significant smaller than TRACE_IO you most likely do more i/o than you should and turning prefetching off might actually make things faster . > > keep in mind i am no longer working for IBM so all i say might be obsolete by now, i no longer have access to the one and only truth aka the source code ... but if i am wrong i am sure somebody will point this out soon ;-) > > sven > > > > > > On Tue, Sep 18, 2018 at 10:31 AM wrote: > > > Hello All, > > > > > > This is a continuation to the previous discussion that i had with Sven. > > > However against what i had mentioned previously - i realize that this is ?not? related to mmap, and i see it when doing random freads. > > > > > > I see that block-size of the filesystem matters when reading from Page pool. > > > I see a major difference in performance when compared 1M to 16M, when doing lot of random small freads with all of the data in pagepool. > > > > > > Performance for 1M is a magnitude ?more? than the performance that i see for 16M. > > > > > > The GPFS that we have currently is : > > > Version :?5.0.1-0.5 > > > Filesystem version:?19.01 (5.0.1.0) > > > Block-size : 16M > > > > > > I had made the filesystem block-size to be 16M, thinking that i would get the most performance for both random/sequential reads from 16M than the smaller block-sizes. > > > With GPFS 5.0, i made use the 1024 sub-blocks instead of 32 and thus not loose lot of storage space even with 16M. > > > I had run few benchmarks and i did see that 16M was performing better ?when hitting storage/disks? with respect to bandwidth for random/sequential on small/large reads. > > > > > > However, with this particular workload - where it freads a chunk of data randomly from hundreds of files -> I see that the number of page-faults increase with block-size and actually reduce the performance. > > > 1M performs a lot better than 16M, and may be i will get better performance with less than 1M. > > > It gives the best performance when reading from local disk, with 4K block size filesystem. > > > > > > What i mean by performance when it comes to this workload - is not the bandwidth but the amount of time that it takes to do each iteration/read batch of data. > > > > > > I figure what is happening is: > > > fread is trying to read a full block size of 16M - which is good in a way, when it hits the hard disk. > > > But the application could be using just a small part of that 16M. Thus when randomly reading(freads) lot of data of 16M chunk size - it is page faulting a lot more and causing the performance to drop . > > > I could try to make the application do read instead of freads, but i fear that could be bad too since it might be hitting the disk with a very small block size and that is not good. > > > > > > With the way i see things now - > > > I believe it could be best if the application does random reads of 4k/1M from pagepool but some how does 16M from rotating disks. > > > > > > I don?t see any way of doing the above other than following a different approach where i create a filesystem with a smaller block size ( 1M or less than 1M ), on SSDs as a tier. > > > > > > May i please ask for advise, if what i am understanding/seeing is right and the best solution possible for the above scenario. > > > > > > Regards, > > > Lohit > > > > > > On Apr 11, 2018, 10:36 AM -0400, Lohit Valleru , wrote: > > > > Hey Sven, > > > > > > > > This is regarding mmap issues and GPFS. > > > > We had discussed previously of experimenting with GPFS 5. > > > > > > > > I now have upgraded all of compute nodes and NSD nodes to GPFS 5.0.0.2 > > > > > > > > I am yet to experiment with mmap performance, but before that - I am seeing weird hangs with GPFS 5 and I think it could be related to mmap. > > > > > > > > Have you seen GPFS ever hang on this syscall? > > > > [Tue Apr 10 04:20:13 2018] [] _ZN10gpfsNode_t8mmapLockEiiPKj+0xb5/0x140 [mmfs26] > > > > > > > > I see the above ,when kernel hangs and throws out a series of trace calls. > > > > > > > > I somehow think the above trace is related to processes hanging on GPFS forever. There are no errors in GPFS however. > > > > > > > > Also, I think the above happens only when the mmap threads go above a particular number. > > > > > > > > We had faced a similar issue in 4.2.3 and it was resolved in a patch to 4.2.3.2 . At that time , the issue happened when mmap threads go more than worker1threads. According to the ticket - it was a mmap race condition that GPFS was not handling well. > > > > > > > > I am not sure if this issue is a repeat and I am yet to isolate the incident and test with increasing number of mmap threads. > > > > > > > > I am not 100 percent sure if this is related to mmap yet but just wanted to ask you if you have seen anything like above. > > > > > > > > Thanks, > > > > > > > > Lohit > > > > > > > > On Feb 22, 2018, 3:59 PM -0500, Sven Oehme , wrote: > > > > > Hi Lohit, > > > > > > > > > > i am working with ray on a mmap performance improvement right now, which most likely has the same root cause as yours , see -->??http://gpfsug.org/pipermail/gpfsug-discuss/2018-January/004411.html > > > > > the thread above is silent after a couple of back and rorth, but ray and i have active communication in the background and will repost as soon as there is something new to share. > > > > > i am happy to look at this issue after we finish with ray's workload if there is something missing, but first let's finish his, get you try the same fix and see if there is something missing. > > > > > > > > > > btw. if people would share their use of MMAP , what applications they use (home grown, just use lmdb which uses mmap under the cover, etc) please let me know so i get a better picture on how wide the usage is with GPFS. i know a lot of the ML/DL workloads are using it, but i would like to know what else is out there i might not think about. feel free to drop me a personal note, i might not reply to it right away, but eventually. > > > > > > > > > > thx. sven > > > > > > > > > > > > > > > > On Thu, Feb 22, 2018 at 12:33 PM wrote: > > > > > > > Hi all, > > > > > > > > > > > > > > I wanted to know, how does mmap interact with GPFS pagepool with respect to filesystem block-size? > > > > > > > Does the efficiency depend on the mmap read size and the block-size of the filesystem even if all the data is cached in pagepool? > > > > > > > > > > > > > > GPFS 4.2.3.2 and CentOS7. > > > > > > > > > > > > > > Here is what i observed: > > > > > > > > > > > > > > I was testing a user script that uses mmap to read from 100M to 500MB files. > > > > > > > > > > > > > > The above files are stored on 3 different filesystems. > > > > > > > > > > > > > > Compute nodes - 10G pagepool and 5G seqdiscardthreshold. > > > > > > > > > > > > > > 1. 4M block size GPFS filesystem, with separate metadata and data. Data on Near line and metadata on SSDs > > > > > > > 2. 1M block size GPFS filesystem as a AFM cache cluster, "with all the required files fully cached" from the above GPFS cluster as home. Data and Metadata together on SSDs > > > > > > > 3. 16M block size GPFS filesystem, with separate metadata and data. Data on Near line and metadata on SSDs > > > > > > > > > > > > > > When i run the script first time for ?each" filesystem: > > > > > > > I see that GPFS reads from the files, and caches into the pagepool as it reads, from mmdiag -- iohist > > > > > > > > > > > > > > When i run the second time, i see that there are no IO requests from the compute node to GPFS NSD servers, which is expected since all the data from the 3 filesystems is cached. > > > > > > > > > > > > > > However - the time taken for the script to run for the files in the 3 different filesystems is different - although i know that they are just "mmapping"/reading from pagepool/cache and not from disk. > > > > > > > > > > > > > > Here is the difference in time, for IO just from pagepool: > > > > > > > > > > > > > > 20s 4M block size > > > > > > > 15s 1M block size > > > > > > > 40S 16M block size. > > > > > > > > > > > > > > Why do i see a difference when trying to mmap reads from different block-size filesystems, although i see that the IO requests are not hitting disks and just the pagepool? > > > > > > > > > > > > > > I am willing to share the strace output and mmdiag outputs if needed. > > > > > > > > > > > > > > Thanks, > > > > > > > Lohit > > > > > > > > > > > > > > _______________________________________________ > > > > > > > gpfsug-discuss mailing list > > > > > > > gpfsug-discuss at spectrumscale.org > > > > > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > _______________________________________________ > > > > > gpfsug-discuss mailing list > > > > > gpfsug-discuss at spectrumscale.org > > > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > > > > gpfsug-discuss mailing list > > > > gpfsug-discuss at spectrumscale.org > > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From oehmes at gmail.com Wed Sep 19 19:11:42 2018 From: oehmes at gmail.com (Sven Oehme) Date: Wed, 19 Sep 2018 11:11:42 -0700 Subject: [gpfsug-discuss] GPFS, Pagepool and Block size -> Perfomance reduces with larger block size In-Reply-To: References: <6bb509b7-b7c5-422d-8e27-599333b6b7c4@Spark> Message-ID: seem like you never read my performance presentation from a few years ago ;-) you can control this on a per node basis , either for all i/o : prefetchAggressiveness = X or individual for reads or writes : prefetchAggressivenessRead = X prefetchAggressivenessWrite = X for a start i would turn it off completely via : mmchconfig prefetchAggressiveness=0 -I -N nodename that will turn it off only for that node and only until you restart the node. then see what happens sven On Wed, Sep 19, 2018 at 11:07 AM wrote: > Thank you Sven. > > I mostly think it could be 1. or some other issue. > I don?t think it could be 2. , because i can replicate this issue no > matter what is the size of the dataset. It happens for few files that could > easily fit in the page pool too. > > I do see a lot more page faults for 16M compared to 1M, so it could be > related to many threads trying to compete for the same buffer space. > > I will try to take the trace with trace=io option and see if can find > something. > > How do i turn of prefetching? Can i turn it off for a single node/client? > > Regards, > Lohit > > On Sep 18, 2018, 5:23 PM -0400, Sven Oehme , wrote: > > Hi, > > taking a trace would tell for sure, but i suspect what you might be > hitting one or even multiple issues which have similar negative performance > impacts but different root causes. > > 1. this could be serialization around buffer locks. as larger your > blocksize gets as larger is the amount of data one of this pagepool buffers > will maintain, if there is a lot of concurrency on smaller amount of data > more threads potentially compete for the same buffer lock to copy stuff in > and out of a particular buffer, hence things go slower compared to the same > amount of data spread across more buffers, each of smaller size. > > 2. your data set is small'ish, lets say a couple of time bigger than the > pagepool and you random access it with multiple threads. what will happen > is that because it doesn't fit into the cache it will be read from the > backend. if multiple threads hit the same 16 mb block at once with multiple > 4k random reads, it will read the whole 16mb block because it thinks it > will benefit from it later on out of cache, but because it fully random the > same happens with the next block and the next and so on and before you get > back to this block it was pushed out of the cache because of lack of enough > pagepool. > > i could think of multiple other scenarios , which is why its so hard to > accurately benchmark an application because you will design a benchmark to > test an application, but it actually almost always behaves different then > you think it does :-) > > so best is to run the real application and see under which configuration > it works best. > > you could also take a trace with trace=io and then look at > > TRACE_VNOP: READ: > TRACE_VNOP: WRITE: > > and compare them to > > TRACE_IO: QIO: read > TRACE_IO: QIO: write > > and see if the numbers summed up for both are somewhat equal. if > TRACE_VNOP is significant smaller than TRACE_IO you most likely do more i/o > than you should and turning prefetching off might actually make things > faster . > > keep in mind i am no longer working for IBM so all i say might be obsolete > by now, i no longer have access to the one and only truth aka the source > code ... but if i am wrong i am sure somebody will point this out soon ;-) > > sven > > > > > On Tue, Sep 18, 2018 at 10:31 AM wrote: > >> Hello All, >> >> This is a continuation to the previous discussion that i had with Sven. >> However against what i had mentioned previously - i realize that this is >> ?not? related to mmap, and i see it when doing random freads. >> >> I see that block-size of the filesystem matters when reading from Page >> pool. >> I see a major difference in performance when compared 1M to 16M, when >> doing lot of random small freads with all of the data in pagepool. >> >> Performance for 1M is a magnitude ?more? than the performance that i see >> for 16M. >> >> The GPFS that we have currently is : >> Version : 5.0.1-0.5 >> Filesystem version: 19.01 (5.0.1.0) >> Block-size : 16M >> >> I had made the filesystem block-size to be 16M, thinking that i would get >> the most performance for both random/sequential reads from 16M than the >> smaller block-sizes. >> With GPFS 5.0, i made use the 1024 sub-blocks instead of 32 and thus not >> loose lot of storage space even with 16M. >> I had run few benchmarks and i did see that 16M was performing better >> ?when hitting storage/disks? with respect to bandwidth for >> random/sequential on small/large reads. >> >> However, with this particular workload - where it freads a chunk of data >> randomly from hundreds of files -> I see that the number of page-faults >> increase with block-size and actually reduce the performance. >> 1M performs a lot better than 16M, and may be i will get better >> performance with less than 1M. >> It gives the best performance when reading from local disk, with 4K block >> size filesystem. >> >> What i mean by performance when it comes to this workload - is not the >> bandwidth but the amount of time that it takes to do each iteration/read >> batch of data. >> >> I figure what is happening is: >> fread is trying to read a full block size of 16M - which is good in a >> way, when it hits the hard disk. >> But the application could be using just a small part of that 16M. Thus >> when randomly reading(freads) lot of data of 16M chunk size - it is page >> faulting a lot more and causing the performance to drop . >> I could try to make the application do read instead of freads, but i fear >> that could be bad too since it might be hitting the disk with a very small >> block size and that is not good. >> >> With the way i see things now - >> I believe it could be best if the application does random reads of 4k/1M >> from pagepool but some how does 16M from rotating disks. >> >> I don?t see any way of doing the above other than following a different >> approach where i create a filesystem with a smaller block size ( 1M or less >> than 1M ), on SSDs as a tier. >> >> May i please ask for advise, if what i am understanding/seeing is right >> and the best solution possible for the above scenario. >> >> Regards, >> Lohit >> >> On Apr 11, 2018, 10:36 AM -0400, Lohit Valleru , >> wrote: >> >> Hey Sven, >> >> This is regarding mmap issues and GPFS. >> We had discussed previously of experimenting with GPFS 5. >> >> I now have upgraded all of compute nodes and NSD nodes to GPFS 5.0.0.2 >> >> I am yet to experiment with mmap performance, but before that - I am >> seeing weird hangs with GPFS 5 and I think it could be related to mmap. >> >> Have you seen GPFS ever hang on this syscall? >> [Tue Apr 10 04:20:13 2018] [] >> _ZN10gpfsNode_t8mmapLockEiiPKj+0xb5/0x140 [mmfs26] >> >> I see the above ,when kernel hangs and throws out a series of trace calls. >> >> I somehow think the above trace is related to processes hanging on GPFS >> forever. There are no errors in GPFS however. >> >> Also, I think the above happens only when the mmap threads go above a >> particular number. >> >> We had faced a similar issue in 4.2.3 and it was resolved in a patch to >> 4.2.3.2 . At that time , the issue happened when mmap threads go more than >> worker1threads. According to the ticket - it was a mmap race condition that >> GPFS was not handling well. >> >> I am not sure if this issue is a repeat and I am yet to isolate the >> incident and test with increasing number of mmap threads. >> >> I am not 100 percent sure if this is related to mmap yet but just wanted >> to ask you if you have seen anything like above. >> >> Thanks, >> >> Lohit >> >> On Feb 22, 2018, 3:59 PM -0500, Sven Oehme , wrote: >> >> Hi Lohit, >> >> i am working with ray on a mmap performance improvement right now, which >> most likely has the same root cause as yours , see --> >> http://gpfsug.org/pipermail/gpfsug-discuss/2018-January/004411.html >> the thread above is silent after a couple of back and rorth, but ray and >> i have active communication in the background and will repost as soon as >> there is something new to share. >> i am happy to look at this issue after we finish with ray's workload if >> there is something missing, but first let's finish his, get you try the >> same fix and see if there is something missing. >> >> btw. if people would share their use of MMAP , what applications they use >> (home grown, just use lmdb which uses mmap under the cover, etc) please let >> me know so i get a better picture on how wide the usage is with GPFS. i >> know a lot of the ML/DL workloads are using it, but i would like to know >> what else is out there i might not think about. feel free to drop me a >> personal note, i might not reply to it right away, but eventually. >> >> thx. sven >> >> >> On Thu, Feb 22, 2018 at 12:33 PM wrote: >> >>> Hi all, >>> >>> I wanted to know, how does mmap interact with GPFS pagepool with respect >>> to filesystem block-size? >>> Does the efficiency depend on the mmap read size and the block-size of >>> the filesystem even if all the data is cached in pagepool? >>> >>> GPFS 4.2.3.2 and CentOS7. >>> >>> Here is what i observed: >>> >>> I was testing a user script that uses mmap to read from 100M to 500MB >>> files. >>> >>> The above files are stored on 3 different filesystems. >>> >>> Compute nodes - 10G pagepool and 5G seqdiscardthreshold. >>> >>> 1. 4M block size GPFS filesystem, with separate metadata and data. Data >>> on Near line and metadata on SSDs >>> 2. 1M block size GPFS filesystem as a AFM cache cluster, "with all the >>> required files fully cached" from the above GPFS cluster as home. Data and >>> Metadata together on SSDs >>> 3. 16M block size GPFS filesystem, with separate metadata and data. Data >>> on Near line and metadata on SSDs >>> >>> When i run the script first time for ?each" filesystem: >>> I see that GPFS reads from the files, and caches into the pagepool as it >>> reads, from mmdiag -- iohist >>> >>> When i run the second time, i see that there are no IO requests from the >>> compute node to GPFS NSD servers, which is expected since all the data from >>> the 3 filesystems is cached. >>> >>> However - the time taken for the script to run for the files in the 3 >>> different filesystems is different - although i know that they are just >>> "mmapping"/reading from pagepool/cache and not from disk. >>> >>> Here is the difference in time, for IO just from pagepool: >>> >>> 20s 4M block size >>> 15s 1M block size >>> 40S 16M block size. >>> >>> Why do i see a difference when trying to mmap reads from different >>> block-size filesystems, although i see that the IO requests are not hitting >>> disks and just the pagepool? >>> >>> I am willing to share the strace output and mmdiag outputs if needed. >>> >>> Thanks, >>> Lohit >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From valleru at cbio.mskcc.org Wed Sep 19 19:15:17 2018 From: valleru at cbio.mskcc.org (valleru at cbio.mskcc.org) Date: Wed, 19 Sep 2018 14:15:17 -0400 Subject: [gpfsug-discuss] GPFS, Pagepool and Block size -> Perfomance reduces with larger block size In-Reply-To: References: <6bb509b7-b7c5-422d-8e27-599333b6b7c4@Spark> Message-ID: <013aeb31-ebd2-4cc7-97d1-06883d9569f7@Spark> Thanks Sven. I will disable it completely and see how it behaves. Is this the presentation? http://files.gpfsug.org/presentations/2014/UG10_GPFS_Performance_Session_v10.pdf I guess i read it, but it did not strike me at this situation. I will try to read it again and see if i could make use of it. Regards, Lohit On Sep 19, 2018, 2:12 PM -0400, Sven Oehme , wrote: > seem like you never read my performance presentation from a few years ago ;-) > > you can control this on a per node basis , either for all i/o : > > ? ?prefetchAggressiveness = X > > or individual for reads or writes : > > ? ?prefetchAggressivenessRead = X > ? ?prefetchAggressivenessWrite = X > > for a start i would turn it off completely via : > > mmchconfig prefetchAggressiveness=0 -I -N nodename > > that will turn it off only for that node and only until you restart the node. > then see what happens > > sven > > > > On Wed, Sep 19, 2018 at 11:07 AM wrote: > > > Thank you Sven. > > > > > > I mostly think it could be 1. or some other issue. > > > I don?t think it could be 2. , because i can replicate this issue no matter what is the size of the dataset. It happens for few files that could easily fit in the page pool too. > > > > > > I do see a lot more page faults for 16M compared to 1M, so it could be related to many threads trying to compete for the same buffer space. > > > > > > I will try to take the trace with trace=io option and see if can find something. > > > > > > How do i turn of prefetching? Can i turn it off for a single node/client? > > > > > > Regards, > > > Lohit > > > > > > On Sep 18, 2018, 5:23 PM -0400, Sven Oehme , wrote: > > > > Hi, > > > > > > > > taking a trace would tell for sure, but i suspect what you might be hitting one or even multiple issues which have similar negative performance impacts but different root causes. > > > > > > > > 1. this could be serialization around buffer locks. as larger your blocksize gets as larger is the amount of data one of this pagepool buffers will maintain, if there is a lot of concurrency on smaller amount of data more threads potentially compete for the same buffer lock to copy stuff in and out of a particular buffer, hence things go slower compared to the same amount of data spread across more buffers, each of smaller size. > > > > > > > > 2. your data set is small'ish, lets say a couple of time bigger than the pagepool and you random access it with multiple threads. what will happen is that because it doesn't fit into the cache it will be read from the backend. if multiple threads hit the same 16 mb block at once with multiple 4k random reads, it will read the whole 16mb block because it thinks it will benefit from it later on out of cache, but because it fully random the same happens with the next block and the next and so on and before you get back to this block it was pushed out of the cache because of lack of enough pagepool. > > > > > > > > i could think?of multiple other scenarios , which is why its so hard to accurately benchmark an application because you will design a benchmark to test an application, but it actually almost always behaves different then you think it does :-) > > > > > > > > so best is to run the real application and see under which configuration it works best. > > > > > > > > you could also take a trace with trace=io and then look at > > > > > > > > TRACE_VNOP: READ: > > > > TRACE_VNOP: WRITE: > > > > > > > > and compare them to > > > > > > > > TRACE_IO: QIO: read > > > > TRACE_IO: QIO: write > > > > > > > > and see if the numbers summed up for both are somewhat equal. if TRACE_VNOP is significant smaller than TRACE_IO you most likely do more i/o than you should and turning prefetching off might actually make things faster . > > > > > > > > keep in mind i am no longer working for IBM so all i say might be obsolete by now, i no longer have access to the one and only truth aka the source code ... but if i am wrong i am sure somebody will point this out soon ;-) > > > > > > > > sven > > > > > > > > > > > > > > > > > > > > > On Tue, Sep 18, 2018 at 10:31 AM wrote: > > > > > > Hello All, > > > > > > > > > > > > This is a continuation to the previous discussion that i had with Sven. > > > > > > However against what i had mentioned previously - i realize that this is ?not? related to mmap, and i see it when doing random freads. > > > > > > > > > > > > I see that block-size of the filesystem matters when reading from Page pool. > > > > > > I see a major difference in performance when compared 1M to 16M, when doing lot of random small freads with all of the data in pagepool. > > > > > > > > > > > > Performance for 1M is a magnitude ?more? than the performance that i see for 16M. > > > > > > > > > > > > The GPFS that we have currently is : > > > > > > Version :?5.0.1-0.5 > > > > > > Filesystem version:?19.01 (5.0.1.0) > > > > > > Block-size : 16M > > > > > > > > > > > > I had made the filesystem block-size to be 16M, thinking that i would get the most performance for both random/sequential reads from 16M than the smaller block-sizes. > > > > > > With GPFS 5.0, i made use the 1024 sub-blocks instead of 32 and thus not loose lot of storage space even with 16M. > > > > > > I had run few benchmarks and i did see that 16M was performing better ?when hitting storage/disks? with respect to bandwidth for random/sequential on small/large reads. > > > > > > > > > > > > However, with this particular workload - where it freads a chunk of data randomly from hundreds of files -> I see that the number of page-faults increase with block-size and actually reduce the performance. > > > > > > 1M performs a lot better than 16M, and may be i will get better performance with less than 1M. > > > > > > It gives the best performance when reading from local disk, with 4K block size filesystem. > > > > > > > > > > > > What i mean by performance when it comes to this workload - is not the bandwidth but the amount of time that it takes to do each iteration/read batch of data. > > > > > > > > > > > > I figure what is happening is: > > > > > > fread is trying to read a full block size of 16M - which is good in a way, when it hits the hard disk. > > > > > > But the application could be using just a small part of that 16M. Thus when randomly reading(freads) lot of data of 16M chunk size - it is page faulting a lot more and causing the performance to drop . > > > > > > I could try to make the application do read instead of freads, but i fear that could be bad too since it might be hitting the disk with a very small block size and that is not good. > > > > > > > > > > > > With the way i see things now - > > > > > > I believe it could be best if the application does random reads of 4k/1M from pagepool but some how does 16M from rotating disks. > > > > > > > > > > > > I don?t see any way of doing the above other than following a different approach where i create a filesystem with a smaller block size ( 1M or less than 1M ), on SSDs as a tier. > > > > > > > > > > > > May i please ask for advise, if what i am understanding/seeing is right and the best solution possible for the above scenario. > > > > > > > > > > > > Regards, > > > > > > Lohit > > > > > > > > > > > > On Apr 11, 2018, 10:36 AM -0400, Lohit Valleru , wrote: > > > > > > > Hey Sven, > > > > > > > > > > > > > > This is regarding mmap issues and GPFS. > > > > > > > We had discussed previously of experimenting with GPFS 5. > > > > > > > > > > > > > > I now have upgraded all of compute nodes and NSD nodes to GPFS 5.0.0.2 > > > > > > > > > > > > > > I am yet to experiment with mmap performance, but before that - I am seeing weird hangs with GPFS 5 and I think it could be related to mmap. > > > > > > > > > > > > > > Have you seen GPFS ever hang on this syscall? > > > > > > > [Tue Apr 10 04:20:13 2018] [] _ZN10gpfsNode_t8mmapLockEiiPKj+0xb5/0x140 [mmfs26] > > > > > > > > > > > > > > I see the above ,when kernel hangs and throws out a series of trace calls. > > > > > > > > > > > > > > I somehow think the above trace is related to processes hanging on GPFS forever. There are no errors in GPFS however. > > > > > > > > > > > > > > Also, I think the above happens only when the mmap threads go above a particular number. > > > > > > > > > > > > > > We had faced a similar issue in 4.2.3 and it was resolved in a patch to 4.2.3.2 . At that time , the issue happened when mmap threads go more than worker1threads. According to the ticket - it was a mmap race condition that GPFS was not handling well. > > > > > > > > > > > > > > I am not sure if this issue is a repeat and I am yet to isolate the incident and test with increasing number of mmap threads. > > > > > > > > > > > > > > I am not 100 percent sure if this is related to mmap yet but just wanted to ask you if you have seen anything like above. > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > Lohit > > > > > > > > > > > > > > On Feb 22, 2018, 3:59 PM -0500, Sven Oehme , wrote: > > > > > > > > Hi Lohit, > > > > > > > > > > > > > > > > i am working with ray on a mmap performance improvement right now, which most likely has the same root cause as yours , see -->??http://gpfsug.org/pipermail/gpfsug-discuss/2018-January/004411.html > > > > > > > > the thread above is silent after a couple of back and rorth, but ray and i have active communication in the background and will repost as soon as there is something new to share. > > > > > > > > i am happy to look at this issue after we finish with ray's workload if there is something missing, but first let's finish his, get you try the same fix and see if there is something missing. > > > > > > > > > > > > > > > > btw. if people would share their use of MMAP , what applications they use (home grown, just use lmdb which uses mmap under the cover, etc) please let me know so i get a better picture on how wide the usage is with GPFS. i know a lot of the ML/DL workloads are using it, but i would like to know what else is out there i might not think about. feel free to drop me a personal note, i might not reply to it right away, but eventually. > > > > > > > > > > > > > > > > thx. sven > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Feb 22, 2018 at 12:33 PM wrote: > > > > > > > > > > Hi all, > > > > > > > > > > > > > > > > > > > > I wanted to know, how does mmap interact with GPFS pagepool with respect to filesystem block-size? > > > > > > > > > > Does the efficiency depend on the mmap read size and the block-size of the filesystem even if all the data is cached in pagepool? > > > > > > > > > > > > > > > > > > > > GPFS 4.2.3.2 and CentOS7. > > > > > > > > > > > > > > > > > > > > Here is what i observed: > > > > > > > > > > > > > > > > > > > > I was testing a user script that uses mmap to read from 100M to 500MB files. > > > > > > > > > > > > > > > > > > > > The above files are stored on 3 different filesystems. > > > > > > > > > > > > > > > > > > > > Compute nodes - 10G pagepool and 5G seqdiscardthreshold. > > > > > > > > > > > > > > > > > > > > 1. 4M block size GPFS filesystem, with separate metadata and data. Data on Near line and metadata on SSDs > > > > > > > > > > 2. 1M block size GPFS filesystem as a AFM cache cluster, "with all the required files fully cached" from the above GPFS cluster as home. Data and Metadata together on SSDs > > > > > > > > > > 3. 16M block size GPFS filesystem, with separate metadata and data. Data on Near line and metadata on SSDs > > > > > > > > > > > > > > > > > > > > When i run the script first time for ?each" filesystem: > > > > > > > > > > I see that GPFS reads from the files, and caches into the pagepool as it reads, from mmdiag -- iohist > > > > > > > > > > > > > > > > > > > > When i run the second time, i see that there are no IO requests from the compute node to GPFS NSD servers, which is expected since all the data from the 3 filesystems is cached. > > > > > > > > > > > > > > > > > > > > However - the time taken for the script to run for the files in the 3 different filesystems is different - although i know that they are just "mmapping"/reading from pagepool/cache and not from disk. > > > > > > > > > > > > > > > > > > > > Here is the difference in time, for IO just from pagepool: > > > > > > > > > > > > > > > > > > > > 20s 4M block size > > > > > > > > > > 15s 1M block size > > > > > > > > > > 40S 16M block size. > > > > > > > > > > > > > > > > > > > > Why do i see a difference when trying to mmap reads from different block-size filesystems, although i see that the IO requests are not hitting disks and just the pagepool? > > > > > > > > > > > > > > > > > > > > I am willing to share the strace output and mmdiag outputs if needed. > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > Lohit > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > > > gpfsug-discuss mailing list > > > > > > > > > > gpfsug-discuss at spectrumscale.org > > > > > > > > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > _______________________________________________ > > > > > > > > gpfsug-discuss mailing list > > > > > > > > gpfsug-discuss at spectrumscale.org > > > > > > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > _______________________________________________ > > > > > > > gpfsug-discuss mailing list > > > > > > > gpfsug-discuss at spectrumscale.org > > > > > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > _______________________________________________ > > > > > > gpfsug-discuss mailing list > > > > > > gpfsug-discuss at spectrumscale.org > > > > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > > > > gpfsug-discuss mailing list > > > > gpfsug-discuss at spectrumscale.org > > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From oehmes at gmail.com Wed Sep 19 20:11:07 2018 From: oehmes at gmail.com (Sven Oehme) Date: Wed, 19 Sep 2018 12:11:07 -0700 Subject: [gpfsug-discuss] GPFS, Pagepool and Block size -> Perfomance reduces with larger block size In-Reply-To: <013aeb31-ebd2-4cc7-97d1-06883d9569f7@Spark> References: <6bb509b7-b7c5-422d-8e27-599333b6b7c4@Spark> <013aeb31-ebd2-4cc7-97d1-06883d9569f7@Spark> Message-ID: the document primarily explains all performance specific knobs. general advice would be to longer set anything beside workerthreads, pagepool and filecache on 5.X systems as most other settings are no longer relevant (thats a client side statement) . thats is true until you hit strange workloads , which is why all the knobs are still there :-) sven On Wed, Sep 19, 2018 at 11:17 AM wrote: > Thanks Sven. > I will disable it completely and see how it behaves. > > Is this the presentation? > > http://files.gpfsug.org/presentations/2014/UG10_GPFS_Performance_Session_v10.pdf > > I guess i read it, but it did not strike me at this situation. I will try > to read it again and see if i could make use of it. > > Regards, > Lohit > > On Sep 19, 2018, 2:12 PM -0400, Sven Oehme , wrote: > > seem like you never read my performance presentation from a few years ago > ;-) > > you can control this on a per node basis , either for all i/o : > > prefetchAggressiveness = X > > or individual for reads or writes : > > prefetchAggressivenessRead = X > prefetchAggressivenessWrite = X > > for a start i would turn it off completely via : > > mmchconfig prefetchAggressiveness=0 -I -N nodename > > that will turn it off only for that node and only until you restart the > node. > then see what happens > > sven > > > On Wed, Sep 19, 2018 at 11:07 AM wrote: > >> Thank you Sven. >> >> I mostly think it could be 1. or some other issue. >> I don?t think it could be 2. , because i can replicate this issue no >> matter what is the size of the dataset. It happens for few files that could >> easily fit in the page pool too. >> >> I do see a lot more page faults for 16M compared to 1M, so it could be >> related to many threads trying to compete for the same buffer space. >> >> I will try to take the trace with trace=io option and see if can find >> something. >> >> How do i turn of prefetching? Can i turn it off for a single node/client? >> >> Regards, >> Lohit >> >> On Sep 18, 2018, 5:23 PM -0400, Sven Oehme , wrote: >> >> Hi, >> >> taking a trace would tell for sure, but i suspect what you might be >> hitting one or even multiple issues which have similar negative performance >> impacts but different root causes. >> >> 1. this could be serialization around buffer locks. as larger your >> blocksize gets as larger is the amount of data one of this pagepool buffers >> will maintain, if there is a lot of concurrency on smaller amount of data >> more threads potentially compete for the same buffer lock to copy stuff in >> and out of a particular buffer, hence things go slower compared to the same >> amount of data spread across more buffers, each of smaller size. >> >> 2. your data set is small'ish, lets say a couple of time bigger than the >> pagepool and you random access it with multiple threads. what will happen >> is that because it doesn't fit into the cache it will be read from the >> backend. if multiple threads hit the same 16 mb block at once with multiple >> 4k random reads, it will read the whole 16mb block because it thinks it >> will benefit from it later on out of cache, but because it fully random the >> same happens with the next block and the next and so on and before you get >> back to this block it was pushed out of the cache because of lack of enough >> pagepool. >> >> i could think of multiple other scenarios , which is why its so hard to >> accurately benchmark an application because you will design a benchmark to >> test an application, but it actually almost always behaves different then >> you think it does :-) >> >> so best is to run the real application and see under which configuration >> it works best. >> >> you could also take a trace with trace=io and then look at >> >> TRACE_VNOP: READ: >> TRACE_VNOP: WRITE: >> >> and compare them to >> >> TRACE_IO: QIO: read >> TRACE_IO: QIO: write >> >> and see if the numbers summed up for both are somewhat equal. if >> TRACE_VNOP is significant smaller than TRACE_IO you most likely do more i/o >> than you should and turning prefetching off might actually make things >> faster . >> >> keep in mind i am no longer working for IBM so all i say might be >> obsolete by now, i no longer have access to the one and only truth aka the >> source code ... but if i am wrong i am sure somebody will point this out >> soon ;-) >> >> sven >> >> >> >> >> On Tue, Sep 18, 2018 at 10:31 AM wrote: >> >>> Hello All, >>> >>> This is a continuation to the previous discussion that i had with Sven. >>> However against what i had mentioned previously - i realize that this is >>> ?not? related to mmap, and i see it when doing random freads. >>> >>> I see that block-size of the filesystem matters when reading from Page >>> pool. >>> I see a major difference in performance when compared 1M to 16M, when >>> doing lot of random small freads with all of the data in pagepool. >>> >>> Performance for 1M is a magnitude ?more? than the performance that i see >>> for 16M. >>> >>> The GPFS that we have currently is : >>> Version : 5.0.1-0.5 >>> Filesystem version: 19.01 (5.0.1.0) >>> Block-size : 16M >>> >>> I had made the filesystem block-size to be 16M, thinking that i would >>> get the most performance for both random/sequential reads from 16M than the >>> smaller block-sizes. >>> With GPFS 5.0, i made use the 1024 sub-blocks instead of 32 and thus not >>> loose lot of storage space even with 16M. >>> I had run few benchmarks and i did see that 16M was performing better >>> ?when hitting storage/disks? with respect to bandwidth for >>> random/sequential on small/large reads. >>> >>> However, with this particular workload - where it freads a chunk of data >>> randomly from hundreds of files -> I see that the number of page-faults >>> increase with block-size and actually reduce the performance. >>> 1M performs a lot better than 16M, and may be i will get better >>> performance with less than 1M. >>> It gives the best performance when reading from local disk, with 4K >>> block size filesystem. >>> >>> What i mean by performance when it comes to this workload - is not the >>> bandwidth but the amount of time that it takes to do each iteration/read >>> batch of data. >>> >>> I figure what is happening is: >>> fread is trying to read a full block size of 16M - which is good in a >>> way, when it hits the hard disk. >>> But the application could be using just a small part of that 16M. Thus >>> when randomly reading(freads) lot of data of 16M chunk size - it is page >>> faulting a lot more and causing the performance to drop . >>> I could try to make the application do read instead of freads, but i >>> fear that could be bad too since it might be hitting the disk with a very >>> small block size and that is not good. >>> >>> With the way i see things now - >>> I believe it could be best if the application does random reads of 4k/1M >>> from pagepool but some how does 16M from rotating disks. >>> >>> I don?t see any way of doing the above other than following a different >>> approach where i create a filesystem with a smaller block size ( 1M or less >>> than 1M ), on SSDs as a tier. >>> >>> May i please ask for advise, if what i am understanding/seeing is right >>> and the best solution possible for the above scenario. >>> >>> Regards, >>> Lohit >>> >>> On Apr 11, 2018, 10:36 AM -0400, Lohit Valleru , >>> wrote: >>> >>> Hey Sven, >>> >>> This is regarding mmap issues and GPFS. >>> We had discussed previously of experimenting with GPFS 5. >>> >>> I now have upgraded all of compute nodes and NSD nodes to GPFS 5.0.0.2 >>> >>> I am yet to experiment with mmap performance, but before that - I am >>> seeing weird hangs with GPFS 5 and I think it could be related to mmap. >>> >>> Have you seen GPFS ever hang on this syscall? >>> [Tue Apr 10 04:20:13 2018] [] >>> _ZN10gpfsNode_t8mmapLockEiiPKj+0xb5/0x140 [mmfs26] >>> >>> I see the above ,when kernel hangs and throws out a series of trace >>> calls. >>> >>> I somehow think the above trace is related to processes hanging on GPFS >>> forever. There are no errors in GPFS however. >>> >>> Also, I think the above happens only when the mmap threads go above a >>> particular number. >>> >>> We had faced a similar issue in 4.2.3 and it was resolved in a patch to >>> 4.2.3.2 . At that time , the issue happened when mmap threads go more than >>> worker1threads. According to the ticket - it was a mmap race condition that >>> GPFS was not handling well. >>> >>> I am not sure if this issue is a repeat and I am yet to isolate the >>> incident and test with increasing number of mmap threads. >>> >>> I am not 100 percent sure if this is related to mmap yet but just wanted >>> to ask you if you have seen anything like above. >>> >>> Thanks, >>> >>> Lohit >>> >>> On Feb 22, 2018, 3:59 PM -0500, Sven Oehme , wrote: >>> >>> Hi Lohit, >>> >>> i am working with ray on a mmap performance improvement right now, which >>> most likely has the same root cause as yours , see --> >>> http://gpfsug.org/pipermail/gpfsug-discuss/2018-January/004411.html >>> the thread above is silent after a couple of back and rorth, but ray and >>> i have active communication in the background and will repost as soon as >>> there is something new to share. >>> i am happy to look at this issue after we finish with ray's workload if >>> there is something missing, but first let's finish his, get you try the >>> same fix and see if there is something missing. >>> >>> btw. if people would share their use of MMAP , what applications they >>> use (home grown, just use lmdb which uses mmap under the cover, etc) please >>> let me know so i get a better picture on how wide the usage is with GPFS. i >>> know a lot of the ML/DL workloads are using it, but i would like to know >>> what else is out there i might not think about. feel free to drop me a >>> personal note, i might not reply to it right away, but eventually. >>> >>> thx. sven >>> >>> >>> On Thu, Feb 22, 2018 at 12:33 PM wrote: >>> >>>> Hi all, >>>> >>>> I wanted to know, how does mmap interact with GPFS pagepool with >>>> respect to filesystem block-size? >>>> Does the efficiency depend on the mmap read size and the block-size of >>>> the filesystem even if all the data is cached in pagepool? >>>> >>>> GPFS 4.2.3.2 and CentOS7. >>>> >>>> Here is what i observed: >>>> >>>> I was testing a user script that uses mmap to read from 100M to 500MB >>>> files. >>>> >>>> The above files are stored on 3 different filesystems. >>>> >>>> Compute nodes - 10G pagepool and 5G seqdiscardthreshold. >>>> >>>> 1. 4M block size GPFS filesystem, with separate metadata and data. Data >>>> on Near line and metadata on SSDs >>>> 2. 1M block size GPFS filesystem as a AFM cache cluster, "with all the >>>> required files fully cached" from the above GPFS cluster as home. Data and >>>> Metadata together on SSDs >>>> 3. 16M block size GPFS filesystem, with separate metadata and data. >>>> Data on Near line and metadata on SSDs >>>> >>>> When i run the script first time for ?each" filesystem: >>>> I see that GPFS reads from the files, and caches into the pagepool as >>>> it reads, from mmdiag -- iohist >>>> >>>> When i run the second time, i see that there are no IO requests from >>>> the compute node to GPFS NSD servers, which is expected since all the data >>>> from the 3 filesystems is cached. >>>> >>>> However - the time taken for the script to run for the files in the 3 >>>> different filesystems is different - although i know that they are just >>>> "mmapping"/reading from pagepool/cache and not from disk. >>>> >>>> Here is the difference in time, for IO just from pagepool: >>>> >>>> 20s 4M block size >>>> 15s 1M block size >>>> 40S 16M block size. >>>> >>>> Why do i see a difference when trying to mmap reads from different >>>> block-size filesystems, although i see that the IO requests are not hitting >>>> disks and just the pagepool? >>>> >>>> I am willing to share the strace output and mmdiag outputs if needed. >>>> >>>> Thanks, >>>> Lohit >>>> >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Thu Sep 20 18:20:01 2018 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Thu, 20 Sep 2018 17:20:01 +0000 Subject: [gpfsug-discuss] CES: ldap_down event Message-ID: <72E63738-2F05-4DE5-BCC6-7AA4543D6161@nuance.com> What caused this error, and how do clear this error? The LDAP server is fine as far as I can tell? Scale version 5.0.1-2 Event Parameter Severity Active Since Event Message ----------------------------------------------------------------------------------------------------------------- ldap_down AUTH ERROR 1 hour ago external LDAP server ldap://10.30.21.11 is unresponsive Bob Oesterlin Sr Principal Storage Engineer, Nuance -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Thu Sep 20 18:29:05 2018 From: S.J.Thompson at bham.ac.uk (Simon Thompson) Date: Thu, 20 Sep 2018 17:29:05 +0000 Subject: [gpfsug-discuss] Metadata with GNR code Message-ID: Just wondering if anyone has any strong views/recommendations with metadata when using GNR code? I know in ?san? based GPFS, there is a recommendation to have data and metadata split with the metadata on SSD. I?ve also heard that with GNR there isn?t much difference in splitting data and metadata. We?re looking at two systems and want to replicate metadata, but not data (mostly) between them, so I?m not really sure how we?d do this without having separate system pool (and then NSDs in different failure groups)?. If we used 8+2P vdisks for metadata only, would we still see no difference in performance compared to mixed (I guess the 8+2P is still spread over a DA so we?d get half the drives in the GNR system active?). Or should we stick SSD based storage in as well for the metadata pool? (Which brings an interesting question about RAID code related to the recent discussions on mirroring vs RAID5?) Thoughts welcome! Simon -------------- next part -------------- An HTML attachment was scrubbed... URL: From will.schmied at stjude.org Thu Sep 20 18:33:59 2018 From: will.schmied at stjude.org (Schmied, Will) Date: Thu, 20 Sep 2018 17:33:59 +0000 Subject: [gpfsug-discuss] CES: ldap_down event Message-ID: <025BC4D1-1D71-430F-B278-ED1D10FB5386@stjude.org> You can either reboot that node, or try: # systemctl restart gpfs-winbind.service # systemctl status gpfs-winbind.service The LDAP server was found unresponsive at some point. Why, that?s the million dollar question as always. What?s the output of ?mmces state show -a? look like? Thanks, Will From: on behalf of "Oesterlin, Robert" Reply-To: gpfsug main discussion list Date: Thursday, September 20, 2018 at 12:20 PM To: gpfsug main discussion list Subject: [gpfsug-discuss] CES: ldap_down event What caused this error, and how do clear this error? The LDAP server is fine as far as I can tell? Scale version 5.0.1-2 Event Parameter Severity Active Since Event Message ----------------------------------------------------------------------------------------------------------------- ldap_down AUTH ERROR 1 hour ago external LDAP server ldap://10.30.21.11 is unresponsive Bob Oesterlin Sr Principal Storage Engineer, Nuance ________________________________ Email Disclaimer: www.stjude.org/emaildisclaimer Consultation Disclaimer: www.stjude.org/consultationdisclaimer -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Thu Sep 20 19:05:43 2018 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Thu, 20 Sep 2018 18:05:43 +0000 Subject: [gpfsug-discuss] CES: ldap_down event In-Reply-To: <025BC4D1-1D71-430F-B278-ED1D10FB5386@stjude.org> References: <025BC4D1-1D71-430F-B278-ED1D10FB5386@stjude.org> Message-ID: <0B7FA2C9-8645-42AF-953D-AEFCB114B081@nuance.com> We?re not using LDAP for SMB, we?re using it for NFS ? so gpfs.windbind isn?t active. As for the state of things: [root at nrg-ces01 ~]# mmces state show -a NODE AUTH BLOCK NETWORK AUTH_OBJ NFS OBJ SMB CES xxxxxxxxx.xxxx.us.grid.nuance.com DEGRADED DISABLED HEALTHY DISABLED DEGRADED DISABLED DISABLED DEGRADED yyyyyyyyy.yyyy.us.grid.nuance.com HEALTHY DISABLED HEALTHY DISABLED HEALTHY DISABLED DISABLED HEALTHY NFS is showing degraded because AUTH is degraded due to the LDAP error. However, I think the error is erroneous because the NFS service on that node is fine. Bob Oesterlin Sr Principal Storage Engineer, Nuance 507-269-0413 From: on behalf of "Schmied, Will" Reply-To: gpfsug main discussion list Date: Thursday, September 20, 2018 at 12:46 PM To: gpfsug main discussion list Subject: [EXTERNAL] Re: [gpfsug-discuss] CES: ldap_down event You can either reboot that node, or try: # systemctl restart gpfs-winbind.service # systemctl status gpfs-winbind.service The LDAP server was found unresponsive at some point. Why, that?s the million dollar question as always. What?s the output of ?mmces state show -a? look like? Thanks, Will From: on behalf of "Oesterlin, Robert" Reply-To: gpfsug main discussion list Date: Thursday, September 20, 2018 at 12:20 PM To: gpfsug main discussion list Subject: [gpfsug-discuss] CES: ldap_down event What caused this error, and how do clear this error? The LDAP server is fine as far as I can tell? Scale version 5.0.1-2 Event Parameter Severity Active Since Event Message ----------------------------------------------------------------------------------------------------------------- ldap_down AUTH ERROR 1 hour ago external LDAP server ldap://10.30.21.11 is unresponsive Bob Oesterlin Sr Principal Storage Engineer, Nuance ________________________________ Email Disclaimer: www.stjude.org/emaildisclaimer Consultation Disclaimer: www.stjude.org/consultationdisclaimer -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Thu Sep 20 19:20:03 2018 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Thu, 20 Sep 2018 18:20:03 +0000 Subject: [gpfsug-discuss] CES: ldap_down event In-Reply-To: <0B7FA2C9-8645-42AF-953D-AEFCB114B081@nuance.com> References: <025BC4D1-1D71-430F-B278-ED1D10FB5386@stjude.org> <0B7FA2C9-8645-42AF-953D-AEFCB114B081@nuance.com> Message-ID: Simply restarting mmsysmon ?mmsysmoncontrol restart? cleared the error ? so I?m back to my basic question ? why did this not clear on its own? Bob Oesterlin Sr Principal Storage Engineer, Nuance From: "Oesterlin, Robert" Date: Thursday, September 20, 2018 at 1:05 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES: ldap_down event We?re not using LDAP for SMB, we?re using it for NFS ? so gpfs.windbind isn?t active. As for the state of things: [root at nrg-ces01 ~]# mmces state show -a NODE AUTH BLOCK NETWORK AUTH_OBJ NFS OBJ SMB CES xxxxxxxxx.xxxx.us.grid.nuance.com DEGRADED DISABLED HEALTHY DISABLED DEGRADED DISABLED DISABLED DEGRADED yyyyyyyyy.yyyy.us.grid.nuance.com HEALTHY DISABLED HEALTHY DISABLED HEALTHY DISABLED DISABLED HEALTHY NFS is showing degraded because AUTH is degraded due to the LDAP error. However, I think the error is erroneous because the NFS service on that node is fine. Bob Oesterlin Sr Principal Storage Engineer, Nuance 507-269-0413 From: on behalf of "Schmied, Will" Reply-To: gpfsug main discussion list Date: Thursday, September 20, 2018 at 12:46 PM To: gpfsug main discussion list Subject: [EXTERNAL] Re: [gpfsug-discuss] CES: ldap_down event You can either reboot that node, or try: # systemctl restart gpfs-winbind.service # systemctl status gpfs-winbind.service The LDAP server was found unresponsive at some point. Why, that?s the million dollar question as always. What?s the output of ?mmces state show -a? look like? Thanks, Will From: on behalf of "Oesterlin, Robert" Reply-To: gpfsug main discussion list Date: Thursday, September 20, 2018 at 12:20 PM To: gpfsug main discussion list Subject: [gpfsug-discuss] CES: ldap_down event What caused this error, and how do clear this error? The LDAP server is fine as far as I can tell? Scale version 5.0.1-2 Event Parameter Severity Active Since Event Message ----------------------------------------------------------------------------------------------------------------- ldap_down AUTH ERROR 1 hour ago external LDAP server ldap://10.30.21.11 is unresponsive Bob Oesterlin Sr Principal Storage Engineer, Nuance ________________________________ Email Disclaimer: www.stjude.org/emaildisclaimer Consultation Disclaimer: www.stjude.org/consultationdisclaimer -------------- next part -------------- An HTML attachment was scrubbed... URL: From abeattie at au1.ibm.com Fri Sep 21 01:33:48 2018 From: abeattie at au1.ibm.com (Andrew Beattie) Date: Fri, 21 Sep 2018 00:33:48 +0000 Subject: [gpfsug-discuss] Metadata with GNR code In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From olaf.weiser at de.ibm.com Fri Sep 21 09:22:27 2018 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Fri, 21 Sep 2018 10:22:27 +0200 Subject: [gpfsug-discuss] Metadata with GNR code In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From janfrode at tanso.net Fri Sep 21 11:13:51 2018 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Fri, 21 Sep 2018 12:13:51 +0200 Subject: [gpfsug-discuss] Metadata with GNR code In-Reply-To: References: Message-ID: That reminds me of a point Sven made when I was trying to optimize mdtest results with metadata on FlashSystem... He sent me the following: -- started at 11/15/2015 15:20:39 -- mdtest-1.9.3 was launched with 138 total task(s) on 23 node(s) Command line used: /ghome/oehmes/mpi/bin/mdtest-pcmpi9131-existingdir -d /ibm/fs2-4m-02/shared/mdtest-ec -i 1 -n 70000 -F -i 1 -w 0 -Z -u Path: /ibm/fs2-4m-02/ sharedFS: 32.0 TiB Used FS: 6.7% Inodes: 145.4 Mi Used Inodes: 22.0% 138 tasks, 9660000 files SUMMARY: (of 1 iterations) Operation Max Min Mean Std Dev --------- --- --- ---- ------- File creation : 650440.486 650440.486 650440.486 0.000 File stat : 23599134.618 23599134.618 23599134.618 0.000 File read : 2171391.097 2171391.097 2171391.097 0.000 File removal : 1007566.981 1007566.981 1007566.981 0.000 Tree creation : 3.072 3.072 3.072 0.000 Tree removal : 1.471 1.471 1.471 0.000 -- finished at 11/15/2015 15:21:10 -- from a GL6 -- only spinning disks -- pointing out that mdtest doesn't really require Flash/SSD. The key to good results are a) large GPFS log ( mmchfs -L 128m) b) high maxfilestocache (you need to be able to cache all entries , so for 10 million across 20 nodes you need to have at least 750k per node) c) fast network, thats key to handle the token requests and metadata operations that need to get over the network. -jf On Fri, Sep 21, 2018 at 10:22 AM Olaf Weiser wrote: > see a mdtest for a default block size file system ... > 4 MB blocksize.. > mdata is on SSD > data is on HDD ... which is not really relevant for this mdtest ;-) > > > -- started at 09/07/2018 06:54:54 -- > > mdtest-1.9.3 was launched with 40 total task(s) on 20 node(s) > Command line used: mdtest -n 25000 -i 3 -u -d > /homebrewed/gh24_4m_4m/mdtest > Path: /homebrewed/gh24_4m_4m > FS: 10.0 TiB Used FS: 0.0% Inodes: 12.0 Mi Used Inodes: 2.3% > > 40 tasks, 1000000 files/directories > > SUMMARY: (of 3 iterations) > Operation Max Min Mean > Std Dev > --------- --- --- ---- > ------- > Directory creation: 449160.409 430869.822 437002.187 > 8597.272 > Directory stat : 6664420.560 5785712.544 6324276.731 > 385192.527 > Directory removal : 398360.058 351503.369 371630.648 > 19690.580 > File creation : 288985.217 270550.129 279096.800 > 7585.659 > File stat : 6720685.117 6641301.499 6674123.407 > 33833.182 > File read : 3055661.372 2871044.881 2945513.966 > 79479.638 > File removal : 215187.602 146639.435 179898.441 > 28021.467 > Tree creation : 10.215 3.165 6.603 > 2.881 > Tree removal : 5.484 0.880 2.418 > 2.168 > > -- finished at 09/07/2018 06:55:42 -- > > > > > Mit freundlichen Gr??en / Kind regards > > > Olaf Weiser > > EMEA Storage Competence Center Mainz, German / IBM Systems, Storage > Platform, > > ------------------------------------------------------------------------------------------------------------------------------------------- > IBM Deutschland > IBM Allee 1 > 71139 Ehningen > Phone: +49-170-579-44-66 > E-Mail: olaf.weiser at de.ibm.com > > ------------------------------------------------------------------------------------------------------------------------------------------- > IBM Deutschland GmbH / Vorsitzender des Aufsichtsrats: Martin Jetter > Gesch?ftsf?hrung: Martina Koederitz (Vorsitzende), Susanne Peter, Norbert > Janzen, Dr. Christian Keller, Ivo Koerner, Markus Koerner > Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, > HRB 14562 / WEEE-Reg.-Nr. DE 99369940 > > > > From: "Andrew Beattie" > To: gpfsug-discuss at spectrumscale.org > Date: 09/21/2018 02:34 AM > Subject: Re: [gpfsug-discuss] Metadata with GNR code > Sent by: gpfsug-discuss-bounces at spectrumscale.org > ------------------------------ > > > > Simon, > > My recommendation is still very much to use SSD for Metadata and NL-SAS > for data and > the GH14 / GH24 Building blocks certainly make this much easier. > > Unless your filesystem is massive (Summit sized) you will typically still > continue to benefit from the Random IO performance of SSD (even RI SSD) in > comparison to NL-SAS. > > It still makes more sense to me to continue to use 2 copy or 3 copy for > Metadata even in ESS / GNR style environments. The read performance for > metadata using 3copy is still significantly better than any other scenario. > > As with anything there are exceptions to the rule, but my experiences with > ESS and ESS with SSD so far still maintain that the standard thoughts on > managing Metadata and Small file IO remain the same -- even with the > improvements around sub blocks with Scale V5. > > MDtest is still the typical benchmark for this comparison and MDTest shows > some very clear differences even on SSD when you use a large filesystem > block size with more sub blocks vs a smaller block size with 1/32 subblocks > > This only gets worse if you change the storage media from SSD to NL-SAS > *Andrew Beattie* > *Software Defined Storage - IT Specialist* > *Phone: *614-2133-7927 > *E-mail: **abeattie at au1.ibm.com* > > > ----- Original message ----- > From: Simon Thompson > Sent by: gpfsug-discuss-bounces at spectrumscale.org > To: "gpfsug-discuss at spectrumscale.org" > Cc: > Subject: [gpfsug-discuss] Metadata with GNR code > Date: Fri, Sep 21, 2018 3:29 AM > > Just wondering if anyone has any strong views/recommendations with > metadata when using GNR code? > > > > I know in ?san? based GPFS, there is a recommendation to have data and > metadata split with the metadata on SSD. > > > > I?ve also heard that with GNR there isn?t much difference in splitting > data and metadata. > > > > We?re looking at two systems and want to replicate metadata, but not data > (mostly) between them, so I?m not really sure how we?d do this without > having separate system pool (and then NSDs in different failure groups)?. > > > > If we used 8+2P vdisks for metadata only, would we still see no difference > in performance compared to mixed (I guess the 8+2P is still spread over a > DA so we?d get half the drives in the GNR system active?). > > > > Or should we stick SSD based storage in as well for the metadata pool? > (Which brings an interesting question about RAID code related to the recent > discussions on mirroring vs RAID5?) > > > > Thoughts welcome! > > > > Simon > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > *http://gpfsug.org/mailman/listinfo/gpfsug-discuss* > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From oehmes at gmail.com Fri Sep 21 17:45:34 2018 From: oehmes at gmail.com (Sven Oehme) Date: Fri, 21 Sep 2018 09:45:34 -0700 Subject: [gpfsug-discuss] Metadata with GNR code In-Reply-To: References: Message-ID: somebody did listen and remembered to what i did say :-D ... and you are absolute correct, there is no need for SSD's to get great zero length mdtest results, most people don't know that create workloads unless carefully executed in general almost exclusively stresses a filesystem client and has almost no load to the storage or the server side UNLESS any of the following is true : 1. your mdtest runs longer than the OS sync interval, exact duration depends on the OS, but is typically 60 or 5 seconds. 2. the amount of files you create exceed your file cache setting 3. you haven't filed your filesystem recovery log to the point where log wrap kicks in there are other possible reasons, but the above is the top 3 list. the network is kind of critical, but not as critical as long as there are enough tasks and nodes in parallel, but for small number of nodes you need some fast (fast as in low latency not throughput) network to handle the parallel token traffic. given that there is an unused inode/token prefetcher in Scale, means some inodes are already 'owned' by the client before you even create your first file network speed is less relevant as long as you stay under that limit. all above is obvious a burst, short period statement, if you have a sustained create workload then obvious all this needs to be written to the storage device in the right place, this is the point where the storage controller write cache followed by the sustained de-staging rate to media is the most critical piece, not the speed of the individual media e.g. NLSAS or SSD. as long as the write cache can coalesce data good enough and/or there is enough bandwidth to the media combined with the fact that Scale tries to align things in data and metadata blocks where possible NLSAS drives are just fine for many cases. But, there is one particular part where flash in any form is unbeatable and that is the main reason people should buy it for metadata - eventually one needs to read all that metadata back, does a directory listing, deletes a folder, etc. in all this cases you need to READ stuff back. while write caches and smart logic in a filesystem client can help significant with writes, on reads there is no magic that can be done (except massive caches again but thats unrealistic for larger systems), so you have to get the data from the media and now having 100 usec response time on flash vs 10 ms average will make a significant difference for real world applications. btw. i offered to speak at the SC18 event in Dallas about Scale related work, even i don't work for IBM anymore. if my talk gets accepted i will see some of you in Dallas :-D Sven On Fri, Sep 21, 2018 at 3:14 AM Jan-Frode Myklebust wrote: > That reminds me of a point Sven made when I was trying to optimize mdtest > results with metadata on FlashSystem... He sent me the following: > > -- started at 11/15/2015 15:20:39 -- > mdtest-1.9.3 was launched with 138 total task(s) on 23 node(s) > Command line used: /ghome/oehmes/mpi/bin/mdtest-pcmpi9131-existingdir -d > /ibm/fs2-4m-02/shared/mdtest-ec -i 1 -n 70000 -F -i 1 -w 0 -Z -u > Path: /ibm/fs2-4m-02/ > sharedFS: 32.0 TiB Used FS: 6.7% Inodes: 145.4 Mi Used Inodes: 22.0% > 138 tasks, 9660000 files > SUMMARY: (of 1 iterations) > Operation Max Min Mean > Std Dev > --------- --- --- ---- > ------- > File creation : 650440.486 650440.486 650440.486 > 0.000 > File stat : 23599134.618 23599134.618 23599134.618 > 0.000 > File read : 2171391.097 2171391.097 2171391.097 > 0.000 > File removal : 1007566.981 1007566.981 1007566.981 > 0.000 > Tree creation : 3.072 3.072 3.072 > 0.000 > Tree removal : 1.471 1.471 1.471 > 0.000 > -- finished at 11/15/2015 15:21:10 -- > > from a GL6 -- only spinning disks -- pointing out that mdtest doesn't > really require Flash/SSD. The key to good results are > > a) large GPFS log ( mmchfs -L 128m) > > b) high maxfilestocache (you need to be able to cache all entries , so for > 10 million across 20 nodes you need to have at least 750k per node) > > c) fast network, thats key to handle the token requests and metadata > operations that need to get over the network. > > > > -jf > > On Fri, Sep 21, 2018 at 10:22 AM Olaf Weiser > wrote: > >> see a mdtest for a default block size file system ... >> 4 MB blocksize.. >> mdata is on SSD >> data is on HDD ... which is not really relevant for this mdtest ;-) >> >> >> -- started at 09/07/2018 06:54:54 -- >> >> mdtest-1.9.3 was launched with 40 total task(s) on 20 node(s) >> Command line used: mdtest -n 25000 -i 3 -u -d >> /homebrewed/gh24_4m_4m/mdtest >> Path: /homebrewed/gh24_4m_4m >> FS: 10.0 TiB Used FS: 0.0% Inodes: 12.0 Mi Used Inodes: 2.3% >> >> 40 tasks, 1000000 files/directories >> >> SUMMARY: (of 3 iterations) >> Operation Max Min Mean >> Std Dev >> --------- --- --- ---- >> ------- >> Directory creation: 449160.409 430869.822 437002.187 >> 8597.272 >> Directory stat : 6664420.560 5785712.544 6324276 >> <(544)%20632-4276>.731 385192.527 >> Directory removal : 398360.058 351503.369 371630.648 >> 19690.580 >> File creation : 288985.217 270550.129 279096.800 >> 7585.659 >> File stat : 6720685.117 6641301.499 6674123.407 >> 33833.182 >> File read : 3055661.372 2871044.881 2945513.966 >> 79479.638 >> File removal : 215187.602 146639.435 179898.441 >> 28021.467 >> Tree creation : 10.215 3.165 6.603 >> 2.881 >> Tree removal : 5.484 0.880 2.418 >> 2.168 >> >> -- finished at 09/07/2018 06:55:42 -- >> >> >> >> >> Mit freundlichen Gr??en / Kind regards >> >> >> Olaf Weiser >> >> EMEA Storage Competence Center Mainz, German / IBM Systems, Storage >> Platform, >> >> ------------------------------------------------------------------------------------------------------------------------------------------- >> IBM Deutschland >> IBM Allee 1 >> 71139 Ehningen >> Phone: +49-170-579-44-66 <+49%20170%205794466> >> E-Mail: olaf.weiser at de.ibm.com >> >> ------------------------------------------------------------------------------------------------------------------------------------------- >> IBM Deutschland GmbH / Vorsitzender des Aufsichtsrats: Martin Jetter >> Gesch?ftsf?hrung: Martina Koederitz (Vorsitzende), Susanne Peter, Norbert >> Janzen, Dr. Christian Keller, Ivo Koerner, Markus Koerner >> Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, >> HRB 14562 / WEEE-Reg.-Nr. DE 99369940 >> >> >> >> From: "Andrew Beattie" >> To: gpfsug-discuss at spectrumscale.org >> Date: 09/21/2018 02:34 AM >> Subject: Re: [gpfsug-discuss] Metadata with GNR code >> Sent by: gpfsug-discuss-bounces at spectrumscale.org >> ------------------------------ >> >> >> >> Simon, >> >> My recommendation is still very much to use SSD for Metadata and NL-SAS >> for data and >> the GH14 / GH24 Building blocks certainly make this much easier. >> >> Unless your filesystem is massive (Summit sized) you will typically still >> continue to benefit from the Random IO performance of SSD (even RI SSD) in >> comparison to NL-SAS. >> >> It still makes more sense to me to continue to use 2 copy or 3 copy for >> Metadata even in ESS / GNR style environments. The read performance for >> metadata using 3copy is still significantly better than any other scenario. >> >> As with anything there are exceptions to the rule, but my experiences >> with ESS and ESS with SSD so far still maintain that the standard thoughts >> on managing Metadata and Small file IO remain the same -- even with the >> improvements around sub blocks with Scale V5. >> >> MDtest is still the typical benchmark for this comparison and MDTest >> shows some very clear differences even on SSD when you use a large >> filesystem block size with more sub blocks vs a smaller block size with >> 1/32 subblocks >> >> This only gets worse if you change the storage media from SSD to NL-SAS >> *Andrew Beattie* >> *Software Defined Storage - IT Specialist* >> *Phone: *614-2133-7927 >> *E-mail: **abeattie at au1.ibm.com* >> >> >> ----- Original message ----- >> From: Simon Thompson >> Sent by: gpfsug-discuss-bounces at spectrumscale.org >> To: "gpfsug-discuss at spectrumscale.org" >> Cc: >> Subject: [gpfsug-discuss] Metadata with GNR code >> Date: Fri, Sep 21, 2018 3:29 AM >> >> Just wondering if anyone has any strong views/recommendations with >> metadata when using GNR code? >> >> >> >> I know in ?san? based GPFS, there is a recommendation to have data and >> metadata split with the metadata on SSD. >> >> >> >> I?ve also heard that with GNR there isn?t much difference in splitting >> data and metadata. >> >> >> >> We?re looking at two systems and want to replicate metadata, but not data >> (mostly) between them, so I?m not really sure how we?d do this without >> having separate system pool (and then NSDs in different failure groups)?. >> >> >> >> If we used 8+2P vdisks for metadata only, would we still see no >> difference in performance compared to mixed (I guess the 8+2P is still >> spread over a DA so we?d get half the drives in the GNR system active?). >> >> >> >> Or should we stick SSD based storage in as well for the metadata pool? >> (Which brings an interesting question about RAID code related to the recent >> discussions on mirroring vs RAID5?) >> >> >> >> Thoughts welcome! >> >> >> >> Simon >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> *http://gpfsug.org/mailman/listinfo/gpfsug-discuss* >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Mon Sep 24 00:19:03 2018 From: aaron.s.knister at nasa.gov (Knister, Aaron S. (GSFC-606.2)[InuTeq, LLC]) Date: Sun, 23 Sep 2018 23:19:03 +0000 Subject: [gpfsug-discuss] P_Key question Message-ID: <91BB1F6E-487A-4740-9787-9A688631D851@nasa.gov> Dear GPFS?ers (or I guess Scalers...), How do P_Keys work with remote clusters? E.g if I have a cluster using the default pkey and a remote cluster with verbsRdmaPkey set can the two communicate without using rdma cm? To take it a step further, what if cluster A has verbsRdmaCm disabled but the remote cluster has it enabled? Do I have any hope of making that work without changing the rdmaCm setting? Trying to secure cross cluster RDMA communication in any reasonable way is making my head hurt... -Aaron -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Mon Sep 24 23:44:14 2018 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Mon, 24 Sep 2018 18:44:14 -0400 Subject: [gpfsug-discuss] [non-nasa source] P_Key question In-Reply-To: <91BB1F6E-487A-4740-9787-9A688631D851@nasa.gov> References: <91BB1F6E-487A-4740-9787-9A688631D851@nasa.gov> Message-ID: <610270d2-cbdf-7650-e70c-37fbde75f147@nasa.gov> Answered my own question here. In both cases (at least with GPFS 4.2) the answer is "it doesn't work" which is what I figured certainly within a single cluster but between clusters I had hoped there would be some smarts to deal with multi-cluster situations given that all the plumbing seems to be there to deal with it. -Aaron On 9/23/18 7:19 PM, Knister, Aaron S. (GSFC-606.2)[InuTeq, LLC] wrote: > Dear GPFS?ers (or I guess Scalers...), > > How do P_Keys work with remote clusters? E.g if I have a cluster using > the default pkey and a remote cluster with verbsRdmaPkey set can the two > communicate without using rdma cm? To take it a step further, what if > cluster A has verbsRdmaCm disabled but the remote cluster has it > enabled? Do I have any hope of making that work without changing the > rdmaCm setting? Trying to secure cross cluster RDMA communication in any > reasonable way is making my head hurt... > > -Aaron > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From bbanister at jumptrading.com Tue Sep 25 15:43:16 2018 From: bbanister at jumptrading.com (Bryan Banister) Date: Tue, 25 Sep 2018 14:43:16 +0000 Subject: [gpfsug-discuss] mmfsd and oom settings In-Reply-To: <20180917155426.fi54lkmduizegpow@ics.muni.cz> References: <20180917155426.fi54lkmduizegpow@ics.muni.cz> Message-ID: <2db81e09701f42b98b92f5dc30d2b712@jumptrading.com> The latest versions of the IBM provided gpfs.service systemd unit does set the oomscore so that the mmfsd process will not be killed. I would prefer to have a GPFS config option for the oomscore adjustment which I could set on a per node basis. There are times when you may want to set GPFS to be killed on a node in order to preserve the health of the overall cluster, helping to prevent deadlocks on a node that is under heavy memory pressure. Hope that helps, -Bryan -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Lukas Hejtmanek Sent: Monday, September 17, 2018 10:54 AM To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] mmfsd and oom settings [EXTERNAL EMAIL] Hello, I accidentally got killed mmfsd by oom killer because of pagepool size which is normally ok but there was memory leak in smbd process so system run out of memory. (64gb pagepool but also two 32GB smbd processes) Shouldn't gpfs startup script set oom_score_adj to some proper value so that mmfsd is never get killed? It basically ruins the things.. -- Luk?? Hejtm?nek Linux Administrator only because Full Time Multitasking Ninja is not an official job title _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=02%7C01%7Cbbanister%40jumptrading.com%7C623834dea5374a24f65208d61cb7a386%7C11f2af738873424085a3063ce66fc61c%7C0%7C0%7C636727972384810921&sdata=IyRUWZlnEuNRJsQZvPRhEq6xSQxLQ7%2FILUXE7rVUtNE%3D&reserved=0 ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential, or privileged information and/or personal data. If you are not the intended recipient, you are hereby notified that any review, dissemination, or copying of this email is strictly prohibited, and requested to notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request, or solicitation of any kind to buy, sell, subscribe, redeem, or perform any type of transaction of a financial product. Personal data, as defined by applicable data privacy laws, contained in this email may be processed by the Company, and any of its affiliated or related companies, for potential ongoing compliance and/or business-related purposes. You may have rights regarding your personal data; for information on exercising these rights or the Company?s treatment of personal data, please email datarequests at jumptrading.com. From bbanister at jumptrading.com Tue Sep 25 18:22:13 2018 From: bbanister at jumptrading.com (Bryan Banister) Date: Tue, 25 Sep 2018 17:22:13 +0000 Subject: [gpfsug-discuss] replicating ACLs across GPFS's? In-Reply-To: <14417376-CF84-45F8-9461-DBE1A86777D8@bham.ac.uk> References: <14417376-CF84-45F8-9461-DBE1A86777D8@bham.ac.uk> Message-ID: <3c036815d65d4ac0a061f3f1bec11159@jumptrading.com> Thanks Simon, I tried out the older patched version of rsync to see if that would work, but still not able to preserve ACLs from an non-GPFS source. There was another thread about this on the user group some time ago as well (2013!), but doesn?t look like any real solution was found (Copy ACLs from outside sources). I?ve also tried tar | tar, but not luck with that either. GPFS doesn?t support the nfs4_getacl, nfs4_setfacl, nfs4_editfacl suite of commands, but maybe that could be added?? I could maybe hack something up that would basically crawl the ?outside source? namespace, using the nfs4_getacl operation get the NFSv4 ACLs, parse that output, then attempt to use GPFS `mmputacl` to store the ACL again. This seems like a horrible way to go, likely prone to mistakes, tough to validate, nightmare to maintain. Anybody got better ideas? Thanks! -Bryan From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Simon Thompson Sent: Friday, September 14, 2018 8:37 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] replicating ACLs across GPFS's? [EXTERNAL EMAIL] Oh I also heard a rumour of some sort of mmcopy type sample script, but I can?t see it in samples on 5.0.1-2? Simon From: > on behalf of Simon Thompson > Reply-To: "gpfsug-discuss at spectrumscale.org" > Date: Friday, 14 September 2018 at 09:41 To: "gpfsug-discuss at spectrumscale.org" > Subject: Re: [gpfsug-discuss] replicating ACLs across GPFS's? Last time I built was still against 3.0.9, note there is also a PR in there which fixes the bug with symlinks. If anyone wants to rebase the patches against 3.1.3, I?ll happily take a PR ? Simon From: > on behalf of "bbanister at jumptrading.com" > Reply-To: "gpfsug-discuss at spectrumscale.org" > Date: Friday, 14 September 2018 at 00:33 To: "gpfsug-discuss at spectrumscale.org" > Subject: [gpfsug-discuss] replicating ACLs across GPFS's? I?m checking in on this thread. Is this patch still working for people with the latest rsync releases? https://github.com/gpfsug/gpfsug-tools/tree/master/bin/rsync Thanks! -Bryan ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential, or privileged information and/or personal data. If you are not the intended recipient, you are hereby notified that any review, dissemination, or copying of this email is strictly prohibited, and requested to notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request, or solicitation of any kind to buy, sell, subscribe, redeem, or perform any type of transaction of a financial product. Personal data, as defined by applicable data privacy laws, contained in this email may be processed by the Company, and any of its affiliated or related companies, for potential ongoing compliance and/or business-related purposes. You may have rights regarding your personal data; for information on exercising these rights or the Company?s treatment of personal data, please email datarequests at jumptrading.com. ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential, or privileged information and/or personal data. If you are not the intended recipient, you are hereby notified that any review, dissemination, or copying of this email is strictly prohibited, and requested to notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request, or solicitation of any kind to buy, sell, subscribe, redeem, or perform any type of transaction of a financial product. Personal data, as defined by applicable data privacy laws, contained in this email may be processed by the Company, and any of its affiliated or related companies, for potential ongoing compliance and/or business-related purposes. You may have rights regarding your personal data; for information on exercising these rights or the Company?s treatment of personal data, please email datarequests at jumptrading.com. -------------- next part -------------- An HTML attachment was scrubbed... URL: From bbanister at jumptrading.com Tue Sep 25 19:05:37 2018 From: bbanister at jumptrading.com (Bryan Banister) Date: Tue, 25 Sep 2018 18:05:37 +0000 Subject: [gpfsug-discuss] replicating ACLs across GPFS's? In-Reply-To: <3c036815d65d4ac0a061f3f1bec11159@jumptrading.com> References: <14417376-CF84-45F8-9461-DBE1A86777D8@bham.ac.uk> <3c036815d65d4ac0a061f3f1bec11159@jumptrading.com> Message-ID: <83f8d6b783e74b9ea8268e6aca0b4a07@jumptrading.com> I have to correct myself, looks like using nfs4_getacl, nfs4_setfacl, nfs4_editfacl on the NFSv4 client mount of the GPFS file system from a CES protocol node is working. So could use that to basically crawl the file system, getting the ?outside source? NFSv4 ACL and then applying that to the file on the NFSv4 client mount of the GPFS file system. Sorry for confusion, -Bryan From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Bryan Banister Sent: Tuesday, September 25, 2018 12:22 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] replicating ACLs across GPFS's? [EXTERNAL EMAIL] Thanks Simon, I tried out the older patched version of rsync to see if that would work, but still not able to preserve ACLs from an non-GPFS source. There was another thread about this on the user group some time ago as well (2013!), but doesn?t look like any real solution was found (Copy ACLs from outside sources). I?ve also tried tar | tar, but not luck with that either. GPFS doesn?t support the nfs4_getacl, nfs4_setfacl, nfs4_editfacl suite of commands, but maybe that could be added?? I could maybe hack something up that would basically crawl the ?outside source? namespace, using the nfs4_getacl operation get the NFSv4 ACLs, parse that output, then attempt to use GPFS `mmputacl` to store the ACL again. This seems like a horrible way to go, likely prone to mistakes, tough to validate, nightmare to maintain. Anybody got better ideas? Thanks! -Bryan From: gpfsug-discuss-bounces at spectrumscale.org > On Behalf Of Simon Thompson Sent: Friday, September 14, 2018 8:37 AM To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] replicating ACLs across GPFS's? [EXTERNAL EMAIL] Oh I also heard a rumour of some sort of mmcopy type sample script, but I can?t see it in samples on 5.0.1-2? Simon From: > on behalf of Simon Thompson > Reply-To: "gpfsug-discuss at spectrumscale.org" > Date: Friday, 14 September 2018 at 09:41 To: "gpfsug-discuss at spectrumscale.org" > Subject: Re: [gpfsug-discuss] replicating ACLs across GPFS's? Last time I built was still against 3.0.9, note there is also a PR in there which fixes the bug with symlinks. If anyone wants to rebase the patches against 3.1.3, I?ll happily take a PR ? Simon From: > on behalf of "bbanister at jumptrading.com" > Reply-To: "gpfsug-discuss at spectrumscale.org" > Date: Friday, 14 September 2018 at 00:33 To: "gpfsug-discuss at spectrumscale.org" > Subject: [gpfsug-discuss] replicating ACLs across GPFS's? I?m checking in on this thread. Is this patch still working for people with the latest rsync releases? https://github.com/gpfsug/gpfsug-tools/tree/master/bin/rsync Thanks! -Bryan ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential, or privileged information and/or personal data. If you are not the intended recipient, you are hereby notified that any review, dissemination, or copying of this email is strictly prohibited, and requested to notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request, or solicitation of any kind to buy, sell, subscribe, redeem, or perform any type of transaction of a financial product. Personal data, as defined by applicable data privacy laws, contained in this email may be processed by the Company, and any of its affiliated or related companies, for potential ongoing compliance and/or business-related purposes. You may have rights regarding your personal data; for information on exercising these rights or the Company?s treatment of personal data, please email datarequests at jumptrading.com. ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential, or privileged information and/or personal data. If you are not the intended recipient, you are hereby notified that any review, dissemination, or copying of this email is strictly prohibited, and requested to notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request, or solicitation of any kind to buy, sell, subscribe, redeem, or perform any type of transaction of a financial product. Personal data, as defined by applicable data privacy laws, contained in this email may be processed by the Company, and any of its affiliated or related companies, for potential ongoing compliance and/or business-related purposes. You may have rights regarding your personal data; for information on exercising these rights or the Company?s treatment of personal data, please email datarequests at jumptrading.com. ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential, or privileged information and/or personal data. If you are not the intended recipient, you are hereby notified that any review, dissemination, or copying of this email is strictly prohibited, and requested to notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request, or solicitation of any kind to buy, sell, subscribe, redeem, or perform any type of transaction of a financial product. Personal data, as defined by applicable data privacy laws, contained in this email may be processed by the Company, and any of its affiliated or related companies, for potential ongoing compliance and/or business-related purposes. You may have rights regarding your personal data; for information on exercising these rights or the Company?s treatment of personal data, please email datarequests at jumptrading.com. -------------- next part -------------- An HTML attachment was scrubbed... URL: From janfrode at tanso.net Tue Sep 25 20:18:01 2018 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Tue, 25 Sep 2018 21:18:01 +0200 Subject: [gpfsug-discuss] replicating ACLs across GPFS's? In-Reply-To: <83f8d6b783e74b9ea8268e6aca0b4a07@jumptrading.com> References: <14417376-CF84-45F8-9461-DBE1A86777D8@bham.ac.uk> <3c036815d65d4ac0a061f3f1bec11159@jumptrading.com> <83f8d6b783e74b9ea8268e6aca0b4a07@jumptrading.com> Message-ID: Not sure if better or worse idea, but I believe robocopy support syncing just the ACLs, so if you do SMB mounts from both sides, that might be an option. -jf tir. 25. sep. 2018 kl. 20:05 skrev Bryan Banister : > I have to correct myself, looks like using nfs4_getacl, nfs4_setfacl, > nfs4_editfacl on the NFSv4 client mount of the GPFS file system from a CES > protocol node is working. So could use that to basically crawl the file > system, getting the ?outside source? NFSv4 ACL and then applying that to > the file on the NFSv4 client mount of the GPFS file system. > > > > Sorry for confusion, > > -Bryan > > > > *From:* gpfsug-discuss-bounces at spectrumscale.org < > gpfsug-discuss-bounces at spectrumscale.org> *On Behalf Of *Bryan Banister > *Sent:* Tuesday, September 25, 2018 12:22 PM > > > *To:* gpfsug main discussion list > *Subject:* Re: [gpfsug-discuss] replicating ACLs across GPFS's? > > > > [EXTERNAL EMAIL] > > Thanks Simon, > > > > I tried out the older patched version of rsync to see if that would work, > but still not able to preserve ACLs from an non-GPFS source. There was > another thread about this on the user group some time ago as well (2013!), > but doesn?t look like any real solution was found (Copy ACLs from outside > sources > > ). > > > > I?ve also tried tar | tar, but not luck with that either. > > > > GPFS doesn?t support the nfs4_getacl, nfs4_setfacl, nfs4_editfacl suite of > commands, but maybe that could be added?? > > > > I could maybe hack something up that would basically crawl the ?outside > source? namespace, using the nfs4_getacl operation get the NFSv4 ACLs, > parse that output, then attempt to use GPFS `mmputacl` to store the ACL > again. This seems like a horrible way to go, likely prone to mistakes, > tough to validate, nightmare to maintain. > > > > Anybody got better ideas? > > > > Thanks! > > -Bryan > > > > *From:* gpfsug-discuss-bounces at spectrumscale.org < > gpfsug-discuss-bounces at spectrumscale.org> *On Behalf Of *Simon Thompson > *Sent:* Friday, September 14, 2018 8:37 AM > *To:* gpfsug main discussion list > *Subject:* Re: [gpfsug-discuss] replicating ACLs across GPFS's? > > > > [EXTERNAL EMAIL] > > Oh I also heard a rumour of some sort of mmcopy type sample script, but I > can?t see it in samples on 5.0.1-2? > > > > Simon > > > > *From: * on behalf of Simon > Thompson > *Reply-To: *"gpfsug-discuss at spectrumscale.org" < > gpfsug-discuss at spectrumscale.org> > *Date: *Friday, 14 September 2018 at 09:41 > *To: *"gpfsug-discuss at spectrumscale.org" > > *Subject: *Re: [gpfsug-discuss] replicating ACLs across GPFS's? > > > > Last time I built was still against 3.0.9, note there is also a PR in > there which fixes the bug with symlinks. > > > > If anyone wants to rebase the patches against 3.1.3, I?ll happily take a > PR ? > > > > Simon > > > > *From: * on behalf of " > bbanister at jumptrading.com" > *Reply-To: *"gpfsug-discuss at spectrumscale.org" < > gpfsug-discuss at spectrumscale.org> > *Date: *Friday, 14 September 2018 at 00:33 > *To: *"gpfsug-discuss at spectrumscale.org" > > *Subject: *[gpfsug-discuss] replicating ACLs across GPFS's? > > > > I?m checking in on this thread. Is this patch still working for people > with the latest rsync releases? > > https://github.com/gpfsug/gpfsug-tools/tree/master/bin/rsync > > > > > Thanks! > > -Bryan > > > ------------------------------ > > > Note: This email is for the confidential use of the named addressee(s) > only and may contain proprietary, confidential, or privileged information > and/or personal data. If you are not the intended recipient, you are hereby > notified that any review, dissemination, or copying of this email is > strictly prohibited, and requested to notify the sender immediately and > destroy this email and any attachments. Email transmission cannot be > guaranteed to be secure or error-free. The Company, therefore, does not > make any guarantees as to the completeness or accuracy of this email or any > attachments. This email is for informational purposes only and does not > constitute a recommendation, offer, request, or solicitation of any kind to > buy, sell, subscribe, redeem, or perform any type of transaction of a > financial product. Personal data, as defined by applicable data privacy > laws, contained in this email may be processed by the Company, and any of > its affiliated or related companies, for potential ongoing compliance > and/or business-related purposes. You may have rights regarding your > personal data; for information on exercising these rights or the Company?s > treatment of personal data, please email datarequests at jumptrading.com. > > > ------------------------------ > > > Note: This email is for the confidential use of the named addressee(s) > only and may contain proprietary, confidential, or privileged information > and/or personal data. If you are not the intended recipient, you are hereby > notified that any review, dissemination, or copying of this email is > strictly prohibited, and requested to notify the sender immediately and > destroy this email and any attachments. Email transmission cannot be > guaranteed to be secure or error-free. The Company, therefore, does not > make any guarantees as to the completeness or accuracy of this email or any > attachments. This email is for informational purposes only and does not > constitute a recommendation, offer, request, or solicitation of any kind to > buy, sell, subscribe, redeem, or perform any type of transaction of a > financial product. Personal data, as defined by applicable data privacy > laws, contained in this email may be processed by the Company, and any of > its affiliated or related companies, for potential ongoing compliance > and/or business-related purposes. You may have rights regarding your > personal data; for information on exercising these rights or the Company?s > treatment of personal data, please email datarequests at jumptrading.com. > > ------------------------------ > > Note: This email is for the confidential use of the named addressee(s) > only and may contain proprietary, confidential, or privileged information > and/or personal data. If you are not the intended recipient, you are hereby > notified that any review, dissemination, or copying of this email is > strictly prohibited, and requested to notify the sender immediately and > destroy this email and any attachments. Email transmission cannot be > guaranteed to be secure or error-free. The Company, therefore, does not > make any guarantees as to the completeness or accuracy of this email or any > attachments. This email is for informational purposes only and does not > constitute a recommendation, offer, request, or solicitation of any kind to > buy, sell, subscribe, redeem, or perform any type of transaction of a > financial product. Personal data, as defined by applicable data privacy > laws, contained in this email may be processed by the Company, and any of > its affiliated or related companies, for potential ongoing compliance > and/or business-related purposes. You may have rights regarding your > personal data; for information on exercising these rights or the Company?s > treatment of personal data, please email datarequests at jumptrading.com. > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Tue Sep 25 22:40:50 2018 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Tue, 25 Sep 2018 17:40:50 -0400 Subject: [gpfsug-discuss] replicating ACLs across GPFS's? In-Reply-To: <3c036815d65d4ac0a061f3f1bec11159@jumptrading.com> References: <14417376-CF84-45F8-9461-DBE1A86777D8@bham.ac.uk> <3c036815d65d4ac0a061f3f1bec11159@jumptrading.com> Message-ID: <317dae7d-fbfa-791b-78de-0d5eb2e6a490@nasa.gov> Just to clarify for myself, is it *all* ACLs that aren't being preserved or just NFS4 ACLs that aren't being preserved (e.g. POSIX ACLs work just fine). If it's just NFS4 ACLs, I suspect it might not be too hard to modify rsync based on the existing patches to translate the nfs4_getfacl output to a gpfs_acl_t struct and use gpfs_putacl to write it. https://www.ibm.com/support/knowledgecenter/SSFKCN_4.1.0.4/com.ibm.cluster.gpfs.v4r104.gpfs100.doc/bl1adm_gpfs_acl_t.htm Just bear in mind that, to the best of my knowledge, calls like gpfs_putacl can be vulnerable to symlink attacks. -Aaron On 9/25/18 1:22 PM, Bryan Banister wrote: > Thanks Simon, > > I tried out the older patched version of rsync to see if that would > work, but still not able to preserve ACLs from an non-GPFS source. > There was another thread about this on the user group some time ago as > well (2013!), but doesn?t look like any real solution was found (Copy > ACLs from outside sources > ). > > I?ve also tried tar | tar, but not luck with that either. > > GPFS doesn?t support the nfs4_getacl, nfs4_setfacl, nfs4_editfacl suite > of commands, but maybe that could be added?? > > I could maybe hack something up that would basically crawl the ?outside > source? namespace, using the nfs4_getacl operation get the NFSv4 ACLs, > parse that output, then attempt to use GPFS `mmputacl` to store the ACL > again.? This seems like a horrible way to go, likely prone to mistakes, > tough to validate, nightmare to maintain. > > Anybody got better ideas? > > Thanks! > > -Bryan > > *From:* gpfsug-discuss-bounces at spectrumscale.org > *On Behalf Of *Simon Thompson > *Sent:* Friday, September 14, 2018 8:37 AM > *To:* gpfsug main discussion list > *Subject:* Re: [gpfsug-discuss] replicating ACLs across GPFS's? > > [EXTERNAL EMAIL] > > Oh I also heard a rumour of some sort of mmcopy type sample script, but > I can?t see it in samples on 5.0.1-2? > > Simon > > *From: * > on behalf of Simon > Thompson > > *Reply-To: *"gpfsug-discuss at spectrumscale.org > " > > > *Date: *Friday, 14 September 2018 at 09:41 > *To: *"gpfsug-discuss at spectrumscale.org > " > > > *Subject: *Re: [gpfsug-discuss] replicating ACLs across GPFS's? > > Last time I built was still against 3.0.9, note there is also a PR in > there which fixes the bug with symlinks. > > If anyone wants to rebase the patches against 3.1.3, I?ll happily take a > PR ? > > Simon > > *From: * > on behalf of > "bbanister at jumptrading.com " > > > *Reply-To: *"gpfsug-discuss at spectrumscale.org > " > > > *Date: *Friday, 14 September 2018 at 00:33 > *To: *"gpfsug-discuss at spectrumscale.org > " > > > *Subject: *[gpfsug-discuss] replicating ACLs across GPFS's? > > I?m checking in on this thread.? Is this patch still working for people > with the latest rsync releases? > > https://github.com/gpfsug/gpfsug-tools/tree/master/bin/rsync > > Thanks! > > -Bryan > > ------------------------------------------------------------------------ > > > Note: This email is for the confidential use of the named addressee(s) > only and may contain proprietary, confidential, or privileged > information and/or personal data. If you are not the intended recipient, > you are hereby notified that any review, dissemination, or copying of > this email is strictly prohibited, and requested to notify the sender > immediately and destroy this email and any attachments. Email > transmission cannot be guaranteed to be secure or error-free. The > Company, therefore, does not make any guarantees as to the completeness > or accuracy of this email or any attachments. This email is for > informational purposes only and does not constitute a recommendation, > offer, request, or solicitation of any kind to buy, sell, subscribe, > redeem, or perform any type of transaction of a financial product. > Personal data, as defined by applicable data privacy laws, contained in > this email may be processed by the Company, and any of its affiliated or > related companies, for potential ongoing compliance and/or > business-related purposes. You may have rights regarding your personal > data; for information on exercising these rights or the Company?s > treatment of personal data, please email datarequests at jumptrading.com > . > > > ------------------------------------------------------------------------ > > Note: This email is for the confidential use of the named addressee(s) > only and may contain proprietary, confidential, or privileged > information and/or personal data. If you are not the intended recipient, > you are hereby notified that any review, dissemination, or copying of > this email is strictly prohibited, and requested to notify the sender > immediately and destroy this email and any attachments. Email > transmission cannot be guaranteed to be secure or error-free. The > Company, therefore, does not make any guarantees as to the completeness > or accuracy of this email or any attachments. This email is for > informational purposes only and does not constitute a recommendation, > offer, request, or solicitation of any kind to buy, sell, subscribe, > redeem, or perform any type of transaction of a financial product. > Personal data, as defined by applicable data privacy laws, contained in > this email may be processed by the Company, and any of its affiliated or > related companies, for potential ongoing compliance and/or > business-related purposes. You may have rights regarding your personal > data; for information on exercising these rights or the Company?s > treatment of personal data, please email datarequests at jumptrading.com. > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From jonathan.buzzard at strath.ac.uk Wed Sep 26 18:13:41 2018 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Wed, 26 Sep 2018 18:13:41 +0100 Subject: [gpfsug-discuss] replicating ACLs across GPFS's? In-Reply-To: <3c036815d65d4ac0a061f3f1bec11159@jumptrading.com> References: <14417376-CF84-45F8-9461-DBE1A86777D8@bham.ac.uk> <3c036815d65d4ac0a061f3f1bec11159@jumptrading.com> Message-ID: <1537982021.17046.48.camel@strath.ac.uk> On Tue, 2018-09-25 at 17:22 +0000, Bryan Banister wrote: > Thanks Simon, > ? > I tried out the older patched version of rsync to see if that would > work, but still not able to preserve ACLs from an non-GPFS source.? > There was another thread about this on the user group some time ago > as well (2013!), but doesn?t look like any real solution was found > (Copy ACLs from outside sources). > ? > I?ve also tried tar | tar, but not luck with that either. > ? > GPFS doesn?t support the nfs4_getacl, nfs4_setfacl, nfs4_editfacl > suite of commands, but maybe that coulnfs4_acl_for_path.d be added?? > ? Well no they work completely differently. However I did write about this last month. You can do this by modifying just nfs4_acl_for_path.c and nfs4_set_acl.c so they read/write the GPFS ACL struct and convert between the GPFS representation and the internal data structure used by the nfs4-acl-tools to hold NFSv4 ACL's. I have it working for nfs4_getacl. Though this in of itself gets nothing over mmgetacl, other than proving the concept valid. I don't have a test GPFS cluster these days so I need to tread very lightly. However I had some questions that I was hoping someone from IBM might answer but didn't and have been busy since. Namely 1. What's the purpose of a special?flag to indicate that it is smbd setting the ACL? Does this tie in with?the undocumented "mmchfs -k samba" feature? 2. There?is a whole bunch of stuff in the documentation about v4.1 ACL's. How does one?trigger that. All I seem to be able to do is? get POSIX and v4 ACL's. Do?you get v4.1 ACL's if you set the file system to "Samba" ACL's? > I could maybe hack something up that would basically crawl the > ?outside source? namespace, using the nfs4_getacl operation get the > NFSv4 ACLs, parse that output, then attempt to use GPFS `mmputacl` to > store the ACL again.? This seems like a horrible way to go, likely > prone to mistakes, tough to validate, nightmare to maintain. > ? I have said it before and will say it again, mmputacl is an abomination that needs to be put down with extreme prejudice. I still think that longer term it would be better to modify?FreeBSD's setfacl/getfacl (say renamed to mmsetfacl and mmgetfacl) to do the job, on the basis that they handle both POSIX and NFSv4 ACL's in a? single command. Though strictly speaking you only need an mmsetfacl. Perhaps a RFE? JAB. -- Jonathan A. Buzzard?????????????????????????Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From S.J.Thompson at bham.ac.uk Wed Sep 26 18:40:26 2018 From: S.J.Thompson at bham.ac.uk (Simon Thompson) Date: Wed, 26 Sep 2018 17:40:26 +0000 Subject: [gpfsug-discuss] replicating ACLs across GPFS's? In-Reply-To: <1537982021.17046.48.camel@strath.ac.uk> References: <14417376-CF84-45F8-9461-DBE1A86777D8@bham.ac.uk> <3c036815d65d4ac0a061f3f1bec11159@jumptrading.com>, <1537982021.17046.48.camel@strath.ac.uk> Message-ID: Don't forget we have the upcoming pitch you RFE online meeting. RFEs have not been flooding in and registrations for the pitch meeting are rather thin on the ground... Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Jonathan Buzzard [jonathan.buzzard at strath.ac.uk] Sent: 26 September 2018 18:13 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] replicating ACLs across GPFS's? On Tue, 2018-09-25 at 17:22 +0000, Bryan Banister wrote: > Thanks Simon, > > I tried out the older patched version of rsync to see if that would > work, but still not able to preserve ACLs from an non-GPFS source. > There was another thread about this on the user group some time ago > as well (2013!), but doesn?t look like any real solution was found > (Copy ACLs from outside sources). > > I?ve also tried tar | tar, but not luck with that either. > > GPFS doesn?t support the nfs4_getacl, nfs4_setfacl, nfs4_editfacl > suite of commands, but maybe that coulnfs4_acl_for_path.d be added?? > Well no they work completely differently. However I did write about this last month. You can do this by modifying just nfs4_acl_for_path.c and nfs4_set_acl.c so they read/write the GPFS ACL struct and convert between the GPFS representation and the internal data structure used by the nfs4-acl-tools to hold NFSv4 ACL's. I have it working for nfs4_getacl. Though this in of itself gets nothing over mmgetacl, other than proving the concept valid. I don't have a test GPFS cluster these days so I need to tread very lightly. However I had some questions that I was hoping someone from IBM might answer but didn't and have been busy since. Namely 1. What's the purpose of a special flag to indicate that it is smbd setting the ACL? Does this tie in with the undocumented "mmchfs -k samba" feature? 2. There is a whole bunch of stuff in the documentation about v4.1 ACL's. How does one trigger that. All I seem to be able to do is get POSIX and v4 ACL's. Do you get v4.1 ACL's if you set the file system to "Samba" ACL's? > I could maybe hack something up that would basically crawl the > ?outside source? namespace, using the nfs4_getacl operation get the > NFSv4 ACLs, parse that output, then attempt to use GPFS `mmputacl` to > store the ACL again. This seems like a horrible way to go, likely > prone to mistakes, tough to validate, nightmare to maintain. > I have said it before and will say it again, mmputacl is an abomination that needs to be put down with extreme prejudice. I still think that longer term it would be better to modify FreeBSD's setfacl/getfacl (say renamed to mmsetfacl and mmgetfacl) to do the job, on the basis that they handle both POSIX and NFSv4 ACL's in a single command. Though strictly speaking you only need an mmsetfacl. Perhaps a RFE? JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From jtucker at pixitmedia.com Wed Sep 26 18:43:35 2018 From: jtucker at pixitmedia.com (Jez Tucker) Date: Wed, 26 Sep 2018 18:43:35 +0100 Subject: [gpfsug-discuss] replicating ACLs across GPFS's? In-Reply-To: References: <14417376-CF84-45F8-9461-DBE1A86777D8@bham.ac.uk> <3c036815d65d4ac0a061f3f1bec11159@jumptrading.com> <1537982021.17046.48.camel@strath.ac.uk> Message-ID: Hey Carl, ? Are the 4 RFEs I've sent you Q2/18 under this new system or do I need to resubmit them? Jez On 26/09/18 18:40, Simon Thompson wrote: > Don't forget we have the upcoming pitch you RFE online meeting. > > RFEs have not been flooding in and registrations for the pitch meeting are rather thin on the ground... > > Simon > ________________________________________ > From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Jonathan Buzzard [jonathan.buzzard at strath.ac.uk] > Sent: 26 September 2018 18:13 > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] replicating ACLs across GPFS's? > > On Tue, 2018-09-25 at 17:22 +0000, Bryan Banister wrote: >> Thanks Simon, >> >> I tried out the older patched version of rsync to see if that would >> work, but still not able to preserve ACLs from an non-GPFS source. >> There was another thread about this on the user group some time ago >> as well (2013!), but doesn?t look like any real solution was found >> (Copy ACLs from outside sources). >> >> I?ve also tried tar | tar, but not luck with that either. >> >> GPFS doesn?t support the nfs4_getacl, nfs4_setfacl, nfs4_editfacl >> suite of commands, but maybe that coulnfs4_acl_for_path.d be added?? >> > Well no they work completely differently. However I did write about > this last month. You can do this by modifying just nfs4_acl_for_path.c > and nfs4_set_acl.c so they read/write the GPFS ACL struct and convert > between the GPFS representation and the internal data structure used by > the nfs4-acl-tools to hold NFSv4 ACL's. I have it working for > nfs4_getacl. Though this in of itself gets nothing over mmgetacl, other > than proving the concept valid. I don't have a test GPFS cluster these > days so I need to tread very lightly. > > However I had some questions that I was hoping someone from IBM might > answer but didn't and have been busy since. Namely > > 1. What's the purpose of a special flag to indicate that it is smbd > setting the ACL? Does this tie in with the undocumented "mmchfs -k > samba" feature? > > 2. There is a whole bunch of stuff in the documentation about v4.1 > ACL's. How does one trigger that. All I seem to be able to do is > get POSIX and v4 ACL's. Do you get v4.1 ACL's if you set the file > system to "Samba" ACL's? > >> I could maybe hack something up that would basically crawl the >> ?outside source? namespace, using the nfs4_getacl operation get the >> NFSv4 ACLs, parse that output, then attempt to use GPFS `mmputacl` to >> store the ACL again. This seems like a horrible way to go, likely >> prone to mistakes, tough to validate, nightmare to maintain. >> > I have said it before and will say it again, mmputacl is an > abomination that needs to be put down with extreme prejudice. > > I still think that longer term it would be better to modify FreeBSD's > setfacl/getfacl (say renamed to mmsetfacl and mmgetfacl) to do the job, > on the basis that they handle both POSIX and NFSv4 ACL's in a > single command. Though strictly speaking you only need an mmsetfacl. > > Perhaps a RFE? > > JAB. > > -- > Jonathan A. Buzzard Tel: +44141-5483420 > HPC System Administrator, ARCHIE-WeSt. > University of Strathclyde, John Anderson Building, Glasgow. G4 0NG > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- *Jez Tucker* Head of Research and Development, Pixit Media 07764193820 | jtucker at pixitmedia.com www.pixitmedia.com | Tw:@pixitmedia.com -- This email is confidential in that it is intended for the exclusive attention of the addressee(s) indicated. If you are not the intended recipient, this email should not be read or disclosed to any other person. Please notify the sender immediately and delete this email from your computer system. Any opinions expressed are not necessarily those of the company from which this email was sent and, whilst to the best of our knowledge no viruses or defects exist, no responsibility can be accepted for any loss or damage arising from its receipt or subsequent use of this email. -------------- next part -------------- An HTML attachment was scrubbed... URL: From bbanister at jumptrading.com Wed Sep 26 18:50:00 2018 From: bbanister at jumptrading.com (Bryan Banister) Date: Wed, 26 Sep 2018 17:50:00 +0000 Subject: [gpfsug-discuss] replicating ACLs across GPFS's? In-Reply-To: References: <14417376-CF84-45F8-9461-DBE1A86777D8@bham.ac.uk> <3c036815d65d4ac0a061f3f1bec11159@jumptrading.com>, <1537982021.17046.48.camel@strath.ac.uk> Message-ID: <2e40bed94c9744638aabbc9e4dbfb525@jumptrading.com> I was thinking the same thing Simon. Johnathan, if you're interested in working together on this RFE, then I'm happy to help! Just hit me up off list. Thanks for your response! -Bryan -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Simon Thompson Sent: Wednesday, September 26, 2018 12:40 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] replicating ACLs across GPFS's? [EXTERNAL EMAIL] Don't forget we have the upcoming pitch you RFE online meeting. RFEs have not been flooding in and registrations for the pitch meeting are rather thin on the ground... Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Jonathan Buzzard [jonathan.buzzard at strath.ac.uk] Sent: 26 September 2018 18:13 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] replicating ACLs across GPFS's? On Tue, 2018-09-25 at 17:22 +0000, Bryan Banister wrote: > Thanks Simon, > > I tried out the older patched version of rsync to see if that would > work, but still not able to preserve ACLs from an non-GPFS source. > There was another thread about this on the user group some time ago as > well (2013!), but doesn?t look like any real solution was found (Copy > ACLs from outside sources). > > I?ve also tried tar | tar, but not luck with that either. > > GPFS doesn?t support the nfs4_getacl, nfs4_setfacl, nfs4_editfacl > suite of commands, but maybe that coulnfs4_acl_for_path.d be added?? > Well no they work completely differently. However I did write about this last month. You can do this by modifying just nfs4_acl_for_path.c and nfs4_set_acl.c so they read/write the GPFS ACL struct and convert between the GPFS representation and the internal data structure used by the nfs4-acl-tools to hold NFSv4 ACL's. I have it working for nfs4_getacl. Though this in of itself gets nothing over mmgetacl, other than proving the concept valid. I don't have a test GPFS cluster these days so I need to tread very lightly. However I had some questions that I was hoping someone from IBM might answer but didn't and have been busy since. Namely 1. What's the purpose of a special flag to indicate that it is smbd setting the ACL? Does this tie in with the undocumented "mmchfs -k samba" feature? 2. There is a whole bunch of stuff in the documentation about v4.1 ACL's. How does one trigger that. All I seem to be able to do is get POSIX and v4 ACL's. Do you get v4.1 ACL's if you set the file system to "Samba" ACL's? > I could maybe hack something up that would basically crawl the > ?outside source? namespace, using the nfs4_getacl operation get the > NFSv4 ACLs, parse that output, then attempt to use GPFS `mmputacl` to > store the ACL again. This seems like a horrible way to go, likely > prone to mistakes, tough to validate, nightmare to maintain. > I have said it before and will say it again, mmputacl is an abomination that needs to be put down with extreme prejudice. I still think that longer term it would be better to modify FreeBSD's setfacl/getfacl (say renamed to mmsetfacl and mmgetfacl) to do the job, on the basis that they handle both POSIX and NFSv4 ACL's in a single command. Though strictly speaking you only need an mmsetfacl. Perhaps a RFE? JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=02%7C01%7Cbbanister%40jumptrading.com%7C79b6b05f19774e69f7c508d623d72bd8%7C11f2af738873424085a3063ce66fc61c%7C1%7C0%7C636735804397065948&sdata=UXl8e7i4Lw9aT023MqI5ys3hG3t8Trk1rMaq1toluxM%3D&reserved=0 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=02%7C01%7Cbbanister%40jumptrading.com%7C79b6b05f19774e69f7c508d623d72bd8%7C11f2af738873424085a3063ce66fc61c%7C1%7C0%7C636735804397065948&sdata=UXl8e7i4Lw9aT023MqI5ys3hG3t8Trk1rMaq1toluxM%3D&reserved=0 ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential, or privileged information and/or personal data. If you are not the intended recipient, you are hereby notified that any review, dissemination, or copying of this email is strictly prohibited, and requested to notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request, or solicitation of any kind to buy, sell, subscribe, redeem, or perform any type of transaction of a financial product. Personal data, as defined by applicable data privacy laws, contained in this email may be processed by the Company, and any of its affiliated or related companies, for potential ongoing compliance and/or business-related purposes. You may have rights regarding your personal data; for information on exercising these rights or the Company?s treatment of personal data, please email datarequests at jumptrading.com. From olaf.weiser at de.ibm.com Thu Sep 27 13:55:35 2018 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Thu, 27 Sep 2018 14:55:35 +0200 Subject: [gpfsug-discuss] IBM ESS - certified now for SAP Message-ID: An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Thu Sep 27 15:22:23 2018 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Thu, 27 Sep 2018 14:22:23 +0000 Subject: [gpfsug-discuss] =?utf-8?q?Spectrum_Scale_User_Group_Meeting_?= =?utf-8?q?=E2=80=93_NYC_-_New_York_Genome_Center?= Message-ID: <7E34B1A5-2412-4415-9095-C52EDDCE2A04@nuance.com> For those of you in the NE US or NYC area, here is the agenda for the NYC meeting coming up on October 24th. Special thanks to Richard Rupp at IBM for helping to organize this event. If you can make it, please register at the Eventbrite link below. Spectrum Scale User Group ? NYC October 24th, 2018 The New York Genome Center 101 Avenue of the Americas, New York, NY 10013 First Floor Auditorium Register Here: https://www.eventbrite.com/e/2018-spectrum-scale-user-group-nyc-tickets-49786782607 08:45-09:00 Coffee & Registration 09:00-09:15 Welcome 09:15-09:45 What is new in IBM Spectrum Scale? 09:45-10:00 What is new in ESS? 10:00-10:20 How does CORAL help other workloads? 10:20-10:40 --- Break --- 10:40-11:00 Customer Talk ? The New York Genome Center 11:00-11:20 Spinning up a Hadoop cluster on demand 11:20-11:40 Customer Talk ? Mt. Sinai School of Medicine 11:40-12:10 Spectrum Scale Network Flow 12:10-13:00 --- Lunch --- 13:00-13:40 Special Announcement and Demonstration 13:40-14:00 Multi-cloud Transparent Cloud Tiering 14:00-14:20 Customer Talk ? Princeton University 14:20-14:40 AI Reference Architecture 14:40-15:00 Updates on Container Support 15:00-15:20 Customer Talk ? TBD 15:20-15:40 --- Break --- 15:40-16:10 IBM Spectrum Scale Tuning and Troubleshooting 16:10-16:40 Service Update 16:40-17:10 Open Forum 17:10-17:30 Wrap-Up 17:30- Social Event Bob Oesterlin Sr Principal Storage Engineer, Nuance -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Thu Sep 27 16:04:05 2018 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Thu, 27 Sep 2018 15:04:05 +0000 Subject: [gpfsug-discuss] What is this error message telling me? Message-ID: Hi All, 2018-09-27_09:48:50.923-0500: [E] The TCP connection to IP address 1.2.3.4 some client (socket 442) state is unexpected: ca_state=1 unacked=3 rto=27008000 Seeing errors like the above and trying to track down the root cause. I know that at last weeks? GPFS User Group meeting at ORNL this very error message was discussed, but I don?t recall the details and the slides haven?t been posted to the website yet. IIRC, the ?rto? is significant ? I?ve Googled, but haven?t gotten any hits, nor have I found anything in the GPFS 4.2.2 Problem Determination Guide. Thanks in advance? ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From jjdoherty at yahoo.com Thu Sep 27 17:00:03 2018 From: jjdoherty at yahoo.com (Jim Doherty) Date: Thu, 27 Sep 2018 16:00:03 +0000 (UTC) Subject: [gpfsug-discuss] What is this error message telling me? In-Reply-To: References: Message-ID: <955514653.983560.1538064003171@mail.yahoo.com> The data? is also shown in an internaldump as a part of the mmfsadm dump tscomm data,? the RTO & RTT times are listed in microseconds.? So the RTO here in my example is 18.5 seconds (see below).?? You? can get the same information from the? Linux networking command?? ss -i.??? The normal setting for RTO is 200 ms.??? Seeing retransmits and backups will drive up the RTO time.??? When I look at internaldumps from node expels it is not unusual to see 13 backoffs and retransmits and RTO to have hit 120 seconds?? at which point the tcp/ip connection times out. ?10.0.0.31.24/0 ??? state 1 established snd_wscale 10 rcv_wscale 10 rto 18558000 ato 40000 ??? retransmits 4 probes 0 backoff 4 options: TSTAMP SACK WSCALE ??? rtt 2761650 rttvar 3238039 snd_ssthresh 4 snd_cwnd 5 unacked 0 ??? snd_mss 1992 rcv_mss 1992 pmtu 2044 advmss 1992 rcv_ssthresh 157708 ??? sacked 0 lost 0 retrans 0 fackets 0 reordering 3 ca_state 'open' Jim On Thursday, September 27, 2018, 11:14:43 AM EDT, Buterbaugh, Kevin L wrote: Hi All, 2018-09-27_09:48:50.923-0500: [E] The TCP connection to IP address 1.2.3.4 some client (socket 442) state is unexpected: ca_state=1 unacked=3 rto=27008000 Seeing errors like the above and trying to track down the root cause. ?I know that at last weeks? GPFS User Group meeting at ORNL this very error message was discussed, but I don?t recall the details and the slides haven?t been posted to the website yet. ?IIRC, the ?rto? is significant ?? I?ve Googled, but haven?t gotten any hits, nor have I found anything in the GPFS 4.2.2 Problem Determination Guide. Thanks in advance? ?Kevin Buterbaugh - Senior System AdministratorVanderbilt University - Advanced Computing Center for Research and EducationKevin.Buterbaugh at vanderbilt.edu?- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Thu Sep 27 17:03:31 2018 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Thu, 27 Sep 2018 12:03:31 -0400 Subject: [gpfsug-discuss] What is this error message telling me? In-Reply-To: References: Message-ID: Kevin, Is the communication in this case by chance using IPoIB in connected mode? -Aaron On 9/27/18 11:04 AM, Buterbaugh, Kevin L wrote: > Hi All, > > 2018-09-27_09:48:50.923-0500: [E] The TCP connection to IP address > 1.2.3.4 some client (socket 442) state is unexpected: > ca_state=1 unacked=3 rto=27008000 > > Seeing errors like the above and trying to track down the root cause. ?I > know that at last weeks? GPFS User Group meeting at ORNL this very error > message was discussed, but I don?t recall the details and the slides > haven?t been posted to the website yet. ?IIRC, the ?rto? is significant ? > > I?ve Googled, but haven?t gotten any hits, nor have I found anything in > the GPFS 4.2.2 Problem Determination Guide. > > Thanks in advance? > > ? > Kevin Buterbaugh - Senior System Administrator > Vanderbilt University - Advanced Computing Center for Research and Education > Kevin.Buterbaugh at vanderbilt.edu > ?- (615)875-9633 > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From jlewars at us.ibm.com Thu Sep 27 17:37:34 2018 From: jlewars at us.ibm.com (John Lewars) Date: Thu, 27 Sep 2018 12:37:34 -0400 Subject: [gpfsug-discuss] Fw: What is this error message telling me? In-Reply-To: References: Message-ID: Hi Kevin, The message below indicates that the mmfsd code had a pending message on a socket, and, when it looked at the low level socket statistics, GPFS found indications that the TCP connection was in a 'bad state'. GPFS determines a connection to be a 'bad state' if: 1) the CA_STATE for the socket is not in 0 (or open) state, which means the state must be disorder, recovery, or loss. See this paper for more details on CA_STATE: https://wiki.aalto.fi/download/attachments/69901948/TCP-CongestionControlFinal.pdf or 2) the RTO is greater than 10 seconds and there are unacknowledged messages pending on the socket (unacked > 0). In the example below we see that rto=27008000, which means that the non-fast path TCP retransmission timeout is about 27 seconds, and that probably means the connection has experienced significant packet loss. If there was no expel following this message, I would suspect there was some transient packet loss that was recovered from. There are plenty of places in which to find more details on RTO, but you might want to start with wikipedia ( https://en.wikipedia.org/wiki/Transmission_Control_Protocol) which states: In addition, senders employ a retransmission timeout (RTO) that is based on the estimated round-trip time (or RTT) between the sender and receiver, as well as the variance in this round trip time. The behavior of this timer is specified in RFC 6298. There are subtleties in the estimation of RTT. For example, senders must be careful when calculating RTT samples for retransmitted packets; typically they use Karn's Algorithm or TCP timestamps (see RFC 1323). These individual RTT samples are then averaged over time to create a Smoothed Round Trip Time (SRTT) using Jacobson's algorithm. This SRTT value is what is finally used as the round-trip time estimate. [. . .] Reliability is achieved by the sender detecting lost data and retransmitting it. TCP uses two primary techniques to identify loss. Retransmission timeout (abbreviated as RTO) and duplicate cumulative acknowledgements (DupAcks). Note that older versions of the Spectrum Scale code had a third criteria in checking for 'bad state', which included checking if unacked was greater than 8, but that check would sometimes call-out a socket that was working fine, so this third check has been removed via the APAR IJ02566. All Spectrum Scale V5 code has this fix and the 4.2.X code stream picked up this fix in PTF 7 (4.2.3.7 ships APAR IJ02566). More details on debugging expels using these TCP connection messages are in the presentation you referred to, which I posted here: https://www.ibm.com/developerworks/community/wikis/home?lang=en_us#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/DEBUG%20Expels Regards, John Lewars Technical Computing Development, IBM Poughkeepsie ----- Forwarded by Lyle Gayne/Poughkeepsie/IBM on 09/27/2018 11:15 AM ----- Hi All, 2018-09-27_09:48:50.923-0500: [E] The TCP connection to IP address 1.2.3.4 some client (socket 442) state is unexpected: ca_state=1 unacked=3 rto=27008000 Seeing errors like the above and trying to track down the root cause. I know that at last weeks? GPFS User Group meeting at ORNL this very error message was discussed, but I don?t recall the details and the slides haven?t been posted to the website yet. IIRC, the ?rto? is significant ? I?ve Googled, but haven?t gotten any hits, nor have I found anything in the GPFS 4.2.2 Problem Determination Guide. Thanks in advance? ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From valleru at cbio.mskcc.org Thu Sep 27 20:31:00 2018 From: valleru at cbio.mskcc.org (valleru at cbio.mskcc.org) Date: Thu, 27 Sep 2018 15:31:00 -0400 Subject: [gpfsug-discuss] GPFS, Pagepool and Block size -> Perfomance reduces with larger block size In-Reply-To: References: <6bb509b7-b7c5-422d-8e27-599333b6b7c4@Spark> <013aeb31-ebd2-4cc7-97d1-06883d9569f7@Spark> Message-ID: <243c5d36-f25e-4ebb-b9f3-6fc47bc6d93c@Spark> Thank you Sven, Turning of prefetching did not improve the performance, but it did degrade a bit. I have made the prefetching default and took trace dump, for tracectl with trace=io. Let me know if you want me to paste/attach it here. May i know, how could i confirm if the below is true? > > > > > > > 1. this could be serialization around buffer locks. as larger your blocksize gets as larger is the amount of data one of this pagepool buffers will maintain, if there is a lot of concurrency on smaller amount of data more threads potentially compete for the same buffer lock to copy stuff in and out of a particular buffer, hence things go slower compared to the same amount of data spread across more buffers, each of smaller size. > > > > > > > Will the above trace help in understanding if it is a serialization issue? I had been discussing the same with GPFS support for past few months, and it seems to be that most of the time is being spent at?cxiUXfer. They could not understand on why it is taking spending so much of time in cxiUXfer. I was seeing the same from perf top, and pagefaults. Below is snippet from what the support had said : ???????????????????????????? I searched all of the gpfsRead from trace and sort them by spending-time. Except 2 reads which need fetch data from nsd server, the slowest read is in the thread 72170. It took 112470.362 us. trcrpt.2018-08-06_12.27.39.55538.lt15.trsum:?? 72165?????? 6.860911319 rdwr?????????????????? 141857.076 us + NSDIO trcrpt.2018-08-06_12.26.28.39794.lt15.trsum:?? 72170?????? 1.483947593 rdwr?????????????????? 112470.362 us + cxiUXfer trcrpt.2018-08-06_12.27.39.55538.lt15.trsum:?? 72165?????? 6.949042593 rdwr??????????????????? 88126.278 us + NSDIO trcrpt.2018-08-06_12.27.03.47706.lt15.trsum:?? 72156?????? 2.919334474 rdwr??????????????????? 81057.657 us + cxiUXfer trcrpt.2018-08-06_12.23.30.72745.lt15.trsum:?? 72154?????? 1.167484466 rdwr??????????????????? 76033.488 us + cxiUXfer trcrpt.2018-08-06_12.24.06.7508.lt15.trsum:?? 72187?????? 0.685237501 rdwr??????????????????? 70772.326 us + cxiUXfer trcrpt.2018-08-06_12.25.17.23989.lt15.trsum:?? 72193?????? 4.757996530 rdwr??????????????????? 70447.838 us + cxiUXfer I check each of the slow IO as above, and find they all spend much time in the function cxiUXfer. This function is used to copy data from kernel buffer to user buffer. I am not sure why it took so much time. This should be related to the pagefaults and pgfree you observed. Below is the trace data for thread 72170. ?????????????????? 1.371477231? 72170 TRACE_VNODE: gpfs_f_rdwr enter: fP 0xFFFF882541649400 f_flags 0x8000 flags 0x8001 op 0 iovec 0xFFFF881F2AFB3E70 count 1 offset 0x168F30D dentry 0xFFFF887C0CC298C0 private 0xFFFF883F607175C0 iP 0xFFFF8823AA3CBFC0 name '410513.svs' ????????????? .... ?????????????????? 1.371483547? 72170 TRACE_KSVFS: cachedReadFast exit: uio_resid 16777216 code 1 err 11 ????????????? .... ?????????????????? 1.371498780? 72170 TRACE_KSVFS: kSFSReadFast: oiP 0xFFFFC90060B46740 offset 0x168F30D dataBufP FFFFC9003645A5A8 nDesc 64 buf 200043C0000 valid words 64 dirty words 0 blkOff 0 ?????????????????? 1.371499035? 72170 TRACE_LOG: UpdateLogger::beginDataUpdate begin ul 0xFFFFC900333F1A40 holdCount 0 ioType 0x2 inProg 0x15 ?????????????????? 1.371500157? 72170 TRACE_LOG: UpdateLogger::beginDataUpdate ul 0xFFFFC900333F1A40 holdCount 0 ioType 0x2 inProg 0x16 err 0 ?????????????????? 1.371500606? 72170 TRACE_KSVFS: cxiUXfer: nDesc 64 1st dataPtr 0x200043C0000 plP 0xFFFF887F7B90D600 toIOBuf 0 offset 6877965 len 9899251 ?????????????????? 1.371500793? 72170 TRACE_KSVFS: cxiUXfer: ndesc 0 skip dataAddrP 0x200043C0000 currOffset 0 currLen 262144 bufOffset 6877965 ????????????? .... ?????????????????? 1.371505949? 72170 TRACE_KSVFS: cxiUXfer: ndesc 25 skip dataAddrP 0x2001AF80000 currOffset 6553600 currLen 262144 bufOffset 6877965 ?????????????????? 1.371506236? 72170 TRACE_KSVFS: cxiUXfer: nDesc 26 currOffset 6815744 tmpLen 262144 dataAddrP 0x2001AFCF30D currLen 199923 pageOffset 781 pageLen 3315 plP 0xFFFF887F7B90D600 ?????????????????? 1.373649823? 72170 TRACE_KSVFS: cxiUXfer: nDesc 27 currOffset 7077888 tmpLen 262144 dataAddrP 0x20027400000 currLen 262144 pageOffset 0 pageLen 4096 plP 0xFFFF887F7B90D600 ?????????????????? 1.375158799? 72170 TRACE_KSVFS: cxiUXfer: nDesc 28 currOffset 7340032 tmpLen 262144 dataAddrP 0x20027440000 currLen 262144 pageOffset 0 pageLen 4096 plP 0xFFFF887F7B90D600 ?????????????????? 1.376661566? 72170 TRACE_KSVFS: cxiUXfer: nDesc 29 currOffset 7602176 tmpLen 262144 dataAddrP 0x2002C180000 currLen 262144 pageOffset 0 pageLen 4096 plP 0xFFFF887F7B90D600 ?????????????????? 1.377892653? 72170 TRACE_KSVFS: cxiUXfer: nDesc 30 currOffset 7864320 tmpLen 262144 dataAddrP 0x2002C1C0000 currLen 262144 pageOffset 0 pageLen 4096 plP 0xFFFF887F7B90D600 ????????????? .... ?????????????????? 1.471389843? 72170 TRACE_KSVFS: cxiUXfer: nDesc 62 currOffset 16252928 tmpLen 262144 dataAddrP 0x2001D2C0000 currLen 262144 pageOffset 0 pageLen 4096 plP 0xFFFF887F7B90D600 ?????????????????? 1.471845629? 72170 TRACE_KSVFS: cxiUXfer: nDesc 63 currOffset 16515072 tmpLen 262144 dataAddrP 0x2003EC80000 currLen 262144 pageOffset 0 pageLen 4096 plP 0xFFFF887F7B90D600 ?????????????????? 1.472417149? 72170 TRACE_KSVFS: cxiDetachIOBuffer: dataPtr 0x200043C0000 plP 0xFFFF887F7B90D600 ?????????????????? 1.472417775? 72170 TRACE_LOCK: unlock_vfs: type Data, key 0000000000000004:000000001B1F24BF:0000000000000001 lock_mode have ro token xw lock_state old [ ro:27 ] new [ ro:26 ] holdCount now 27 ?????????????????? 1.472418427? 72170 TRACE_LOCK: hash tab lookup vfs: found cP 0xFFFFC9005FC0CDE0 holdCount now 14 ?????????????????? 1.472418592? 72170 TRACE_LOCK: lock_vfs: type Data key 0000000000000004:000000001B1F24BF:0000000000000002 lock_mode want ro status valid token xw/xw lock_state [ ro:12 ] flags 0x0 holdCount 14 ?????????????????? 1.472419842? 72170 TRACE_KSVFS: kSFSReadFast: oiP 0xFFFFC90060B46740 offset 0x2000000 dataBufP FFFFC9003643C908 nDesc 64 buf 38033480000 valid words 64 dirty words 0 blkOff 0 ?????????????????? 1.472420029? 72170 TRACE_LOG: UpdateLogger::beginDataUpdate begin ul 0xFFFFC9005FC0CF98 holdCount 0 ioType 0x2 inProg 0xC ?????????????????? 1.472420187? 72170 TRACE_LOG: UpdateLogger::beginDataUpdate ul 0xFFFFC9005FC0CF98 holdCount 0 ioType 0x2 inProg 0xD err 0 ?????????????????? 1.472420652? 72170 TRACE_KSVFS: cxiUXfer: nDesc 64 1st dataPtr 0x38033480000 plP 0xFFFF887F7B934320 toIOBuf 0 offset 0 len 6877965 ?????????????????? 1.472420936? 72170 TRACE_KSVFS: cxiUXfer: nDesc 0 currOffset 0 tmpLen 262144 dataAddrP 0x38033480000 currLen 262144 pageOffset 0 pageLen 4096 plP 0xFFFF887F7B934320 ?????????????????? 1.472824790? 72170 TRACE_KSVFS: cxiUXfer: nDesc 1 currOffset 262144 tmpLen 262144 dataAddrP 0x380334C0000 currLen 262144 pageOffset 0 pageLen 4096 plP 0xFFFF887F7B934320 ?????????????????? 1.473243905? 72170 TRACE_KSVFS: cxiUXfer: nDesc 2 currOffset 524288 tmpLen 262144 dataAddrP 0x38024280000 currLen 262144 pageOffset 0 pageLen 4096 plP 0xFFFF887F7B934320 ????????????? .... ?????????????????? 1.482949347? 72170 TRACE_KSVFS: cxiUXfer: nDesc 24 currOffset 6291456 tmpLen 262144 dataAddrP 0x38025E80000 currLen 262144 pageOffset 0 pageLen 4096 plP 0xFFFF887F7B934320 ?????????????????? 1.483354265? 72170 TRACE_KSVFS: cxiUXfer: nDesc 25 currOffset 6553600 tmpLen 262144 dataAddrP 0x38025EC0000 currLen 262144 pageOffset 0 pageLen 4096 plP 0xFFFF887F7B934320 ?????????????????? 1.483766631? 72170 TRACE_KSVFS: cxiUXfer: nDesc 26 currOffset 6815744 tmpLen 262144 dataAddrP 0x38003B00000 currLen 262144 pageOffset 0 pageLen 4096 plP 0xFFFF887F7B934320 ?????????????????? 1.483943894? 72170 TRACE_KSVFS: cxiDetachIOBuffer: dataPtr 0x38033480000 plP 0xFFFF887F7B934320 ?????????????????? 1.483944339? 72170 TRACE_LOCK: unlock_vfs: type Data, key 0000000000000004:000000001B1F24BF:0000000000000002 lock_mode have ro token xw lock_state old [ ro:14 ] new [ ro:13 ] holdCount now 14 ?????????????????? 1.483944683? 72170 TRACE_BRL: brUnlockM: ofP 0xFFFFC90069346B68 inode 455025855 snap 0 handle 0xFFFFC9003637D020 range 0x168F30D-0x268F30C mode ro ?????????????????? 1.483944985? 72170 TRACE_KSVFS: kSFSReadFast exit: uio_resid 0 err 0 ?????????????????? 1.483945264? 72170 TRACE_LOCK: unlock_vfs_m: type Inode, key 305F105B9701E60A:000000001B1F24BF:0000000000000000 lock_mode have ro status valid token rs lock_state old [ ro:25 ] new [ ro:24 ] ?????????????????? 1.483945423? 72170 TRACE_LOCK: unlock_vfs_m: cP 0xFFFFC90069346B68 holdCount 25 ?????????????????? 1.483945624? 72170 TRACE_VNODE: gpfsRead exit: fast err 0 ?????????????????? 1.483946831? 72170 TRACE_KSVFS: ReleSG: sli 38 sgP 0xFFFFC90035E52F78 NotQuiesced vfsOp 2 ?????????????????? 1.483946975? 72170 TRACE_KSVFS: ReleSG: sli 38 sgP 0xFFFFC90035E52F78 vfsOp 2 users 1-1 ?????????????????? 1.483947116? 72170 TRACE_KSVFS: ReleaseDaemonSegAndSG: sli 38 count 2 needCleanup 0 ?????????????????? 1.483947593? 72170 TRACE_VNODE: gpfs_f_rdwr exit: fP 0xFFFF882541649400 total_len 16777216 uio_resid 0 offset 0x268F30D rc 0 ??????????????????????????????????????????? Regards, Lohit On Sep 19, 2018, 3:11 PM -0400, Sven Oehme , wrote: > the document primarily explains all performance specific knobs. general advice would be to longer set anything beside workerthreads, pagepool and filecache on 5.X systems as most other settings are no longer relevant (thats a client side statement) . thats is true until you hit strange workloads , which is why all the knobs are still there :-) > > sven > > > > On Wed, Sep 19, 2018 at 11:17 AM wrote: > > > Thanks Sven. > > > I will disable it completely and see how it behaves. > > > > > > Is this the presentation? > > > http://files.gpfsug.org/presentations/2014/UG10_GPFS_Performance_Session_v10.pdf > > > > > > I guess i read it, but it did not strike me at this situation. I will try to read it again and see if i could make use of it. > > > > > > Regards, > > > Lohit > > > > > > On Sep 19, 2018, 2:12 PM -0400, Sven Oehme , wrote: > > > > seem like you never read my performance presentation from a few years ago ;-) > > > > > > > > you can control this on a per node basis , either for all i/o : > > > > > > > > ? ?prefetchAggressiveness = X > > > > > > > > or individual for reads or writes : > > > > > > > > ? ?prefetchAggressivenessRead = X > > > > ? ?prefetchAggressivenessWrite = X > > > > > > > > for a start i would turn it off completely via : > > > > > > > > mmchconfig prefetchAggressiveness=0 -I -N nodename > > > > > > > > that will turn it off only for that node and only until you restart the node. > > > > then see what happens > > > > > > > > sven > > > > > > > > > > > > > On Wed, Sep 19, 2018 at 11:07 AM wrote: > > > > > > Thank you Sven. > > > > > > > > > > > > I mostly think it could be 1. or some other issue. > > > > > > I don?t think it could be 2. , because i can replicate this issue no matter what is the size of the dataset. It happens for few files that could easily fit in the page pool too. > > > > > > > > > > > > I do see a lot more page faults for 16M compared to 1M, so it could be related to many threads trying to compete for the same buffer space. > > > > > > > > > > > > I will try to take the trace with trace=io option and see if can find something. > > > > > > > > > > > > How do i turn of prefetching? Can i turn it off for a single node/client? > > > > > > > > > > > > Regards, > > > > > > Lohit > > > > > > > > > > > > On Sep 18, 2018, 5:23 PM -0400, Sven Oehme , wrote: > > > > > > > Hi, > > > > > > > > > > > > > > taking a trace would tell for sure, but i suspect what you might be hitting one or even multiple issues which have similar negative performance impacts but different root causes. > > > > > > > > > > > > > > 1. this could be serialization around buffer locks. as larger your blocksize gets as larger is the amount of data one of this pagepool buffers will maintain, if there is a lot of concurrency on smaller amount of data more threads potentially compete for the same buffer lock to copy stuff in and out of a particular buffer, hence things go slower compared to the same amount of data spread across more buffers, each of smaller size. > > > > > > > > > > > > > > 2. your data set is small'ish, lets say a couple of time bigger than the pagepool and you random access it with multiple threads. what will happen is that because it doesn't fit into the cache it will be read from the backend. if multiple threads hit the same 16 mb block at once with multiple 4k random reads, it will read the whole 16mb block because it thinks it will benefit from it later on out of cache, but because it fully random the same happens with the next block and the next and so on and before you get back to this block it was pushed out of the cache because of lack of enough pagepool. > > > > > > > > > > > > > > i could think?of multiple other scenarios , which is why its so hard to accurately benchmark an application because you will design a benchmark to test an application, but it actually almost always behaves different then you think it does :-) > > > > > > > > > > > > > > so best is to run the real application and see under which configuration it works best. > > > > > > > > > > > > > > you could also take a trace with trace=io and then look at > > > > > > > > > > > > > > TRACE_VNOP: READ: > > > > > > > TRACE_VNOP: WRITE: > > > > > > > > > > > > > > and compare them to > > > > > > > > > > > > > > TRACE_IO: QIO: read > > > > > > > TRACE_IO: QIO: write > > > > > > > > > > > > > > and see if the numbers summed up for both are somewhat equal. if TRACE_VNOP is significant smaller than TRACE_IO you most likely do more i/o than you should and turning prefetching off might actually make things faster . > > > > > > > > > > > > > > keep in mind i am no longer working for IBM so all i say might be obsolete by now, i no longer have access to the one and only truth aka the source code ... but if i am wrong i am sure somebody will point this out soon ;-) > > > > > > > > > > > > > > sven > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Sep 18, 2018 at 10:31 AM wrote: > > > > > > > > > Hello All, > > > > > > > > > > > > > > > > > > This is a continuation to the previous discussion that i had with Sven. > > > > > > > > > However against what i had mentioned previously - i realize that this is ?not? related to mmap, and i see it when doing random freads. > > > > > > > > > > > > > > > > > > I see that block-size of the filesystem matters when reading from Page pool. > > > > > > > > > I see a major difference in performance when compared 1M to 16M, when doing lot of random small freads with all of the data in pagepool. > > > > > > > > > > > > > > > > > > Performance for 1M is a magnitude ?more? than the performance that i see for 16M. > > > > > > > > > > > > > > > > > > The GPFS that we have currently is : > > > > > > > > > Version :?5.0.1-0.5 > > > > > > > > > Filesystem version:?19.01 (5.0.1.0) > > > > > > > > > Block-size : 16M > > > > > > > > > > > > > > > > > > I had made the filesystem block-size to be 16M, thinking that i would get the most performance for both random/sequential reads from 16M than the smaller block-sizes. > > > > > > > > > With GPFS 5.0, i made use the 1024 sub-blocks instead of 32 and thus not loose lot of storage space even with 16M. > > > > > > > > > I had run few benchmarks and i did see that 16M was performing better ?when hitting storage/disks? with respect to bandwidth for random/sequential on small/large reads. > > > > > > > > > > > > > > > > > > However, with this particular workload - where it freads a chunk of data randomly from hundreds of files -> I see that the number of page-faults increase with block-size and actually reduce the performance. > > > > > > > > > 1M performs a lot better than 16M, and may be i will get better performance with less than 1M. > > > > > > > > > It gives the best performance when reading from local disk, with 4K block size filesystem. > > > > > > > > > > > > > > > > > > What i mean by performance when it comes to this workload - is not the bandwidth but the amount of time that it takes to do each iteration/read batch of data. > > > > > > > > > > > > > > > > > > I figure what is happening is: > > > > > > > > > fread is trying to read a full block size of 16M - which is good in a way, when it hits the hard disk. > > > > > > > > > But the application could be using just a small part of that 16M. Thus when randomly reading(freads) lot of data of 16M chunk size - it is page faulting a lot more and causing the performance to drop . > > > > > > > > > I could try to make the application do read instead of freads, but i fear that could be bad too since it might be hitting the disk with a very small block size and that is not good. > > > > > > > > > > > > > > > > > > With the way i see things now - > > > > > > > > > I believe it could be best if the application does random reads of 4k/1M from pagepool but some how does 16M from rotating disks. > > > > > > > > > > > > > > > > > > I don?t see any way of doing the above other than following a different approach where i create a filesystem with a smaller block size ( 1M or less than 1M ), on SSDs as a tier. > > > > > > > > > > > > > > > > > > May i please ask for advise, if what i am understanding/seeing is right and the best solution possible for the above scenario. > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > Lohit > > > > > > > > > > > > > > > > > > On Apr 11, 2018, 10:36 AM -0400, Lohit Valleru , wrote: > > > > > > > > > > Hey Sven, > > > > > > > > > > > > > > > > > > > > This is regarding mmap issues and GPFS. > > > > > > > > > > We had discussed previously of experimenting with GPFS 5. > > > > > > > > > > > > > > > > > > > > I now have upgraded all of compute nodes and NSD nodes to GPFS 5.0.0.2 > > > > > > > > > > > > > > > > > > > > I am yet to experiment with mmap performance, but before that - I am seeing weird hangs with GPFS 5 and I think it could be related to mmap. > > > > > > > > > > > > > > > > > > > > Have you seen GPFS ever hang on this syscall? > > > > > > > > > > [Tue Apr 10 04:20:13 2018] [] _ZN10gpfsNode_t8mmapLockEiiPKj+0xb5/0x140 [mmfs26] > > > > > > > > > > > > > > > > > > > > I see the above ,when kernel hangs and throws out a series of trace calls. > > > > > > > > > > > > > > > > > > > > I somehow think the above trace is related to processes hanging on GPFS forever. There are no errors in GPFS however. > > > > > > > > > > > > > > > > > > > > Also, I think the above happens only when the mmap threads go above a particular number. > > > > > > > > > > > > > > > > > > > > We had faced a similar issue in 4.2.3 and it was resolved in a patch to 4.2.3.2 . At that time , the issue happened when mmap threads go more than worker1threads. According to the ticket - it was a mmap race condition that GPFS was not handling well. > > > > > > > > > > > > > > > > > > > > I am not sure if this issue is a repeat and I am yet to isolate the incident and test with increasing number of mmap threads. > > > > > > > > > > > > > > > > > > > > I am not 100 percent sure if this is related to mmap yet but just wanted to ask you if you have seen anything like above. > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > > > > > Lohit > > > > > > > > > > > > > > > > > > > > On Feb 22, 2018, 3:59 PM -0500, Sven Oehme , wrote: > > > > > > > > > > > Hi Lohit, > > > > > > > > > > > > > > > > > > > > > > i am working with ray on a mmap performance improvement right now, which most likely has the same root cause as yours , see -->??http://gpfsug.org/pipermail/gpfsug-discuss/2018-January/004411.html > > > > > > > > > > > the thread above is silent after a couple of back and rorth, but ray and i have active communication in the background and will repost as soon as there is something new to share. > > > > > > > > > > > i am happy to look at this issue after we finish with ray's workload if there is something missing, but first let's finish his, get you try the same fix and see if there is something missing. > > > > > > > > > > > > > > > > > > > > > > btw. if people would share their use of MMAP , what applications they use (home grown, just use lmdb which uses mmap under the cover, etc) please let me know so i get a better picture on how wide the usage is with GPFS. i know a lot of the ML/DL workloads are using it, but i would like to know what else is out there i might not think about. feel free to drop me a personal note, i might not reply to it right away, but eventually. > > > > > > > > > > > > > > > > > > > > > > thx. sven > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Feb 22, 2018 at 12:33 PM wrote: > > > > > > > > > > > > > Hi all, > > > > > > > > > > > > > > > > > > > > > > > > > > I wanted to know, how does mmap interact with GPFS pagepool with respect to filesystem block-size? > > > > > > > > > > > > > Does the efficiency depend on the mmap read size and the block-size of the filesystem even if all the data is cached in pagepool? > > > > > > > > > > > > > > > > > > > > > > > > > > GPFS 4.2.3.2 and CentOS7. > > > > > > > > > > > > > > > > > > > > > > > > > > Here is what i observed: > > > > > > > > > > > > > > > > > > > > > > > > > > I was testing a user script that uses mmap to read from 100M to 500MB files. > > > > > > > > > > > > > > > > > > > > > > > > > > The above files are stored on 3 different filesystems. > > > > > > > > > > > > > > > > > > > > > > > > > > Compute nodes - 10G pagepool and 5G seqdiscardthreshold. > > > > > > > > > > > > > > > > > > > > > > > > > > 1. 4M block size GPFS filesystem, with separate metadata and data. Data on Near line and metadata on SSDs > > > > > > > > > > > > > 2. 1M block size GPFS filesystem as a AFM cache cluster, "with all the required files fully cached" from the above GPFS cluster as home. Data and Metadata together on SSDs > > > > > > > > > > > > > 3. 16M block size GPFS filesystem, with separate metadata and data. Data on Near line and metadata on SSDs > > > > > > > > > > > > > > > > > > > > > > > > > > When i run the script first time for ?each" filesystem: > > > > > > > > > > > > > I see that GPFS reads from the files, and caches into the pagepool as it reads, from mmdiag -- iohist > > > > > > > > > > > > > > > > > > > > > > > > > > When i run the second time, i see that there are no IO requests from the compute node to GPFS NSD servers, which is expected since all the data from the 3 filesystems is cached. > > > > > > > > > > > > > > > > > > > > > > > > > > However - the time taken for the script to run for the files in the 3 different filesystems is different - although i know that they are just "mmapping"/reading from pagepool/cache and not from disk. > > > > > > > > > > > > > > > > > > > > > > > > > > Here is the difference in time, for IO just from pagepool: > > > > > > > > > > > > > > > > > > > > > > > > > > 20s 4M block size > > > > > > > > > > > > > 15s 1M block size > > > > > > > > > > > > > 40S 16M block size. > > > > > > > > > > > > > > > > > > > > > > > > > > Why do i see a difference when trying to mmap reads from different block-size filesystems, although i see that the IO requests are not hitting disks and just the pagepool? > > > > > > > > > > > > > > > > > > > > > > > > > > I am willing to share the strace output and mmdiag outputs if needed. > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > Lohit > > > > > > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > > > > > > > > > > gpfsug-discuss mailing list > > > > > > > > > > > > > gpfsug-discuss at spectrumscale.org > > > > > > > > > > > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > > > > _______________________________________________ > > > > > > > > > > > gpfsug-discuss mailing list > > > > > > > > > > > gpfsug-discuss at spectrumscale.org > > > > > > > > > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > > > _______________________________________________ > > > > > > > > > > gpfsug-discuss mailing list > > > > > > > > > > gpfsug-discuss at spectrumscale.org > > > > > > > > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > > _______________________________________________ > > > > > > > > > gpfsug-discuss mailing list > > > > > > > > > gpfsug-discuss at spectrumscale.org > > > > > > > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > _______________________________________________ > > > > > > > gpfsug-discuss mailing list > > > > > > > gpfsug-discuss at spectrumscale.org > > > > > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > _______________________________________________ > > > > > > gpfsug-discuss mailing list > > > > > > gpfsug-discuss at spectrumscale.org > > > > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > > > > gpfsug-discuss mailing list > > > > gpfsug-discuss at spectrumscale.org > > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From kkr at lbl.gov Thu Sep 27 21:35:29 2018 From: kkr at lbl.gov (Kristy Kallback-Rose) Date: Thu, 27 Sep 2018 13:35:29 -0700 Subject: [gpfsug-discuss] Request for Enhancements (RFE) Forum - Submission Deadline October 1 In-Reply-To: <0341213A-6CB7-434F-A575-9099C2D0D703@spectrumscale.org> References: <52220937-CE0A-4949-89A0-6EA41D5ECF93@lbl.gov> <263e53c18647421f8b3cd936da0075df@jumptrading.com> <0341213A-6CB7-434F-A575-9099C2D0D703@spectrumscale.org> Message-ID: Reminder, the October 1st deadline is approaching. We?re looking for at least a few RFEs (Requests For Enhancements) for this first forum, so if you?re interesting in promoting your RFE please reach out to one of us, or even here on the list. Thanks, Kristy > On Sep 7, 2018, at 3:00 AM, Simon Thompson (Spectrum Scale User Group Chair) wrote: > > GPFS/Spectrum Scale Users, > > Here?s a long-ish note about our plans to try and improve the RFE process. We?ve tried to include a tl;dr version if you just read the headers. You?ll find the details underneath ;-) and reading to the end is ideal. > > IMPROVING THE RFE PROCESS > As you?ve heard on the list, and at some of the in-person User Group events, we?ve been talking about ways we can improve the RFE process. We?d like to begin having an RFE forum, and have it be de-coupled from the in-person events because we know not everyone can travel. > > LIGHTNING PRESENTATIONS ON-LINE > In general terms, we?d have regular on-line events, where RFEs could be very briefly (5 minutes, lightning talk) presented by the requester. There would then be time for brief follow-on discussion and questions. The session would be recorded to deal with large time zone differences. > > The live meeting is planned for October 10th 2018, at 4PM BST (that should be 11am EST if we worked is out right!) > > FOLLOW UP POLL > A poll, independent of current RFE voting, would be conducted a couple days after the recording was available to gather votes and feedback on the RFEs submitted ?we may collect site name, to see how many votes are coming from a certain site. > > MAY NOT GET IT RIGHT THE FIRST TIME > We view this supplemental RFE process as organic, that is, we?ll learn as we go and make modifications. The overall goal here is to highlight the RFEs that matter the most to the largest number of UG members by providing a venue for people to speak about their RFEs and collect feedback from fellow community members. > > RFE PRESENTERS WANTED, SUBMISSION DEADLINE OCTOBER 1ST > We?d like to guide a small handful of RFE submitters through this process the first time around, so if you?re interested in being a presenter, let us know now. We?re planning on doing the online meeting and poll for the first time in mid-October, so the submission deadline for your RFE is October 1st. If it?s useful, when you?re drafting your RFE feel free to use the list as a sounding board for feedback. Often sites have similar needs and you may find someone to collaborate with on your RFE to make it useful to more sites, and thereby get more votes. Some guidelines are here: https://drive.google.com/file/d/1o8nN39DTU32qj_EFia5wRhnvfvNfr3cI/view?usp=sharing > > You can submit you RFE by email to: rfe at spectrumscaleug.org > > PARTICIPANTS (AKA YOU!!), VIEW AND VOTE > We are seeking very good participation in the RFE on-line events needed to make this an effective method of Spectrum Scale Community and IBM Developer collaboration. It is to your benefit to participate and help set priorities on Spectrum Scale enhancements!! We want to make this process light lifting for you as a participant. We will limit the duration of the meeting to 1 hour to minimize the use of your valuable time. > > Please register for the online meeting via Eventbrite (https://www.eventbrite.com/e/spectrum-scale-request-for-enhancements-voting-tickets-49979954389 ) ? we?ll send details of how to join the online meeting nearer the time. > > Thanks! > > Simon, Kristy, Bob, Bryan and Carl! > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Thu Sep 27 22:52:01 2018 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Thu, 27 Sep 2018 21:52:01 +0000 Subject: [gpfsug-discuss] What is this error message telling me? In-Reply-To: References: Message-ID: <28086630-FC5B-4214-829D-CF410C3F06D3@vanderbilt.edu> Hi Aaron, No ? just plain old ethernet. Thanks! Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 On Sep 27, 2018, at 11:03 AM, Aaron Knister > wrote: Kevin, Is the communication in this case by chance using IPoIB in connected mode? -Aaron On 9/27/18 11:04 AM, Buterbaugh, Kevin L wrote: Hi All, 2018-09-27_09:48:50.923-0500: [E] The TCP connection to IP address 1.2.3.4 some client (socket 442) state is unexpected: ca_state=1 unacked=3 rto=27008000 Seeing errors like the above and trying to track down the root cause. I know that at last weeks? GPFS User Group meeting at ORNL this very error message was discussed, but I don?t recall the details and the slides haven?t been posted to the website yet. IIRC, the ?rto? is significant ? I?ve Googled, but haven?t gotten any hits, nor have I found anything in the GPFS 4.2.2 Problem Determination Guide. Thanks in advance? ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7C639e397dfb514469f48d08d62492c8a2%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636736610191929732&sdata=GE1IIRL77bjWiFaa2%2FpV68sPtXJNUrtGPrc68GsOrtg%3D&reserved=0 -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7C639e397dfb514469f48d08d62492c8a2%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636736610191929732&sdata=GE1IIRL77bjWiFaa2%2FpV68sPtXJNUrtGPrc68GsOrtg%3D&reserved=0 -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Thu Sep 27 22:53:23 2018 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Thu, 27 Sep 2018 21:53:23 +0000 Subject: [gpfsug-discuss] What is this error message telling me? In-Reply-To: References: Message-ID: <6BC51193-5F38-4749-81A9-F137FE331D5F@vanderbilt.edu> Hi John, Thanks for the explanation and the link to your presentation ? just what I was needing. Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 On Sep 27, 2018, at 11:37 AM, John Lewars > wrote: Hi Kevin, The message below indicates that the mmfsd code had a pending message on a socket, and, when it looked at the low level socket statistics, GPFS found indications that the TCP connection was in a 'bad state'. GPFS determines a connection to be a 'bad state' if: 1) the CA_STATE for the socket is not in 0 (or open) state, which means the state must be disorder, recovery, or loss. See this paper for more details on CA_STATE: https://wiki.aalto.fi/download/attachments/69901948/TCP-CongestionControlFinal.pdf or 2) the RTO is greater than 10 seconds and there are unacknowledged messages pending on the socket (unacked > 0). In the example below we see that rto=27008000, which means that the non-fast path TCP retransmission timeout is about 27 seconds, and that probably means the connection has experienced significant packet loss. If there was no expel following this message, I would suspect there was some transient packet loss that was recovered from. There are plenty of places in which to find more details on RTO, but you might want to start with wikipedia (https://en.wikipedia.org/wiki/Transmission_Control_Protocol) which states: In addition, senders employ a retransmission timeout(RTO) that is based on the estimated round-trip time (or RTT) between the sender and receiver, as well as the variance in this round trip time. The behavior of this timer is specified in RFC 6298. There are subtleties in the estimation of RTT. For example, senders must be careful when calculating RTT samples for retransmitted packets; typically they use Karn's Algorithm or TCP timestamps (see RFC 1323). These individual RTT samples are then averaged over time to create a Smoothed Round Trip Time (SRTT) using Jacobson's algorithm. This SRTT value is what is finally used as the round-trip time estimate. [. . .] Reliability is achieved by the sender detecting lost data and retransmitting it. TCP uses two primary techniques to identify loss. Retransmission timeout (abbreviated as RTO) and duplicate cumulative acknowledgements (DupAcks). Note that older versions of the Spectrum Scale code had a third criteria in checking for 'bad state', which included checking if unacked was greater than 8, but that check would sometimes call-out a socket that was working fine, so this third check has been removed via the APAR IJ02566. All Spectrum Scale V5 code has this fix and the 4.2.X code stream picked up this fix in PTF 7 (4.2.3.7 ships APAR IJ02566). More details on debugging expels using these TCP connection messages are in the presentation you referred to, which I posted here:https://www.ibm.com/developerworks/community/wikis/home?lang=en_us#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/DEBUG%20Expels Regards, John Lewars Technical Computing Development, IBM Poughkeepsie ----- Forwarded by Lyle Gayne/Poughkeepsie/IBM on 09/27/2018 11:15 AM ----- Hi All, 2018-09-27_09:48:50.923-0500: [E] The TCP connection to IP address 1.2.3.4 some client (socket 442) state is unexpected: ca_state=1 unacked=3 rto=27008000 Seeing errors like the above and trying to track down the root cause. I know that at last weeks? GPFS User Group meeting at ORNL this very error message was discussed, but I don?t recall the details and the slides haven?t been posted to the website yet. IIRC, the ?rto? is significant ? I?ve Googled, but haven?t gotten any hits, nor have I found anything in the GPFS 4.2.2 Problem Determination Guide. Thanks in advance? ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Fri Sep 28 13:52:38 2018 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Fri, 28 Sep 2018 08:52:38 -0400 Subject: [gpfsug-discuss] Request for Enhancements (RFE) Forum - Submission Deadline October 1 In-Reply-To: References: <52220937-CE0A-4949-89A0-6EA41D5ECF93@lbl.gov> <263e53c18647421f8b3cd936da0075df@jumptrading.com> <0341213A-6CB7-434F-A575-9099C2D0D703@spectrumscale.org> Message-ID: <585b21e7-d437-380f-65d8-d24fa236ce3b@nasa.gov> Hi Kristy, At some point I thought I'd read there was a per-site limit of the number of RFEs that could be submitted but I can't find it skimming through email. I'd think submitting 10 would be unreasonable but would 2 or 3 be OK? -Aaron On 9/27/18 4:35 PM, Kristy Kallback-Rose wrote: > Reminder, the*October 1st* deadline is approaching. We?re looking for at > least a few RFEs (Requests For Enhancements) for this first forum, so if > you?re interesting in promoting your RFE please reach out to one of us, > or even here on the list. > > Thanks, > Kristy > >> On Sep 7, 2018, at 3:00 AM, Simon Thompson (Spectrum Scale User Group >> Chair) > wrote: >> >> GPFS/Spectrum Scale Users, >> Here?s a long-ish note about our plans to try and improve the RFE >> process. We?ve tried to include a tl;dr version if you just read the >> headers. You?ll find the details underneath ;-) and reading to the end >> is ideal. >> >> IMPROVING THE RFE PROCESS >> As you?ve heard on the list, and at some of the in-person User Group >> events, we?ve been talking about ways we can?improve the RFE process. >> We?d like to begin having an RFE forum, and have it be de-coupled from >> the in-person?events because we know not everyone can travel. >> LIGHTNING PRESENTATIONS ON-LINE >> In general terms, we?d have regular on-line events, where RFEs could >> be/very briefly/(5?minutes, lightning talk) presented by the >> requester.?There would then be time for brief follow-on discussion >> and?questions. The session would be recorded to deal with large time >> zone differences. >> The live meeting is planned for October 10^th 2018, at 4PM BST (that >> should be 11am EST if we worked is out right!) >> FOLLOW UP POLL >> A poll, independent of current?RFE voting, would be conducted a couple >> days after the recording was available to gather votes and feedback >> on?the RFEs submitted ?we may collect site name, to see how many votes >> are coming from a certain site. >> >> MAY NOT GET IT RIGHT THE FIRST TIME >> We view this supplemental RFE process as organic, that is, we?ll learn >> as we go and make modifications. The overall?goal here is to highlight >> the RFEs that matter the most to the largest number of UG members by >> providing a venue?for people to speak about their RFEs and collect >> feedback from fellow community members. >> >> *RFE PRESENTERS WANTED, SUBMISSION DEADLINE OCTOBER 1ST >> *We?d like to guide a small handful of RFE submitters through this >> process the first time around, so if you?re?interested in being a >> presenter, let us know now. We?re planning on doing the online meeting >> and poll for the first time in mid-October, so the submission deadline >> for your RFE is October 1st. If it?s useful, when you?re drafting your >> RFE feel free to use the list as a sounding board for feedback. Often >> sites?have similar needs and you may find someone to collaborate with >> on your RFE to make it useful to more sites, and?thereby get more >> votes. Some guidelines are here: >> https://drive.google.com/file/d/1o8nN39DTU32qj_EFia5wRhnvfvNfr3cI/view?usp=sharing >> You can submit you RFE by email to:rfe at spectrumscaleug.org >> >> >> *PARTICIPANTS (AKA YOU!!), VIEW AND VOTE >> *We are seeking very good participation in the RFE on-line events >> needed to make this an effective method of?Spectrum Scale Community >> and IBM Developer collaboration. *?It is to your benefit to >> participate and help set?priorities on Spectrum Scale enhancements!! >> *We want to make this process light lifting for you as a?participant. >> ?We will limit the duration of the meeting to 1 hour to minimize the >> use of your valuable time. >> >> Please register for the online meeting via Eventbrite >> (https://www.eventbrite.com/e/spectrum-scale-request-for-enhancements-voting-tickets-49979954389) >> ? we?ll send details of how to join the online meeting nearer the time. >> >> Thanks! >> >> Simon, Kristy, Bob, Bryan and Carl! >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss atspectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From S.J.Thompson at bham.ac.uk Fri Sep 28 14:42:12 2018 From: S.J.Thompson at bham.ac.uk (Simon Thompson) Date: Fri, 28 Sep 2018 13:42:12 +0000 Subject: [gpfsug-discuss] File system descriptor quorum Message-ID: <05F25CA0-B8D7-4BD2-A77E-961A6E1CA667@bham.ac.uk> I?ve been trying to get my head around how GPFS file-system descriptor quorum is placed. I read the docs and I can see: Based on the number of failure groups and disks, GPFS creates one to five replicas of the descriptor: * If there are at least five different failure groups, five replicas are created. * If there are at least three different disks, three replicas are created. * If there are only one or two disks, a replica is created on each disk. And I also know I can have a disk type of descOnly. Now what I'm not clear on is, if I have three disks with descOnly (and different failure groups), will GPFS prefer to place descriptors on those disks vs other "data" or "metadata" disks with shared failure groups with the descOnly failure groups. Basically I want to do: Site 1: Lots of storage, FG100 Quorum/manager node, FG100 (with descOnly) Site 2: Some storage, FG200 Quorum/manager node, FG200 (with descOnly) Site 3: Quorum/manager node, FG300 (with descOnly) (I'm assuming I have networking such that if a site is lost the other two sites can continue to talk). i.e. can I be sure that GPFS will place the descriptors on the quorum nodes with the descOnly disks and be sure about the placement? Or in site 1, might it place the descriptor on "lots of storage" in FG100. I did wonder about using more FGs, but I'm a little hesitant as the docs are unclear if there are more than 5 ... (and us accidentally ending up with too many replicas in say Site 1) Thanks Simon From S.J.Thompson at bham.ac.uk Fri Sep 28 14:44:34 2018 From: S.J.Thompson at bham.ac.uk (Simon Thompson) Date: Fri, 28 Sep 2018 13:44:34 +0000 Subject: [gpfsug-discuss] Request for Enhancements (RFE) Forum - Submission Deadline October 1 In-Reply-To: <585b21e7-d437-380f-65d8-d24fa236ce3b@nasa.gov> References: <52220937-CE0A-4949-89A0-6EA41D5ECF93@lbl.gov> <263e53c18647421f8b3cd936da0075df@jumptrading.com> <0341213A-6CB7-434F-A575-9099C2D0D703@spectrumscale.org> <585b21e7-d437-380f-65d8-d24fa236ce3b@nasa.gov> Message-ID: <841FA5CA-5C6B-4626-8137-BA5994C3A651@bham.ac.uk> There is a limit on votes, not submissions. i.e. your site gets three votes, so you can't have three votes and someone else from Goddard also have three. We have to review the submissions, so as you say 10 we'd think unreasonable and skip, but a sensible number is OK. Simon ?On 28/09/2018, 13:52, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Aaron Knister" wrote: Hi Kristy, At some point I thought I'd read there was a per-site limit of the number of RFEs that could be submitted but I can't find it skimming through email. I'd think submitting 10 would be unreasonable but would 2 or 3 be OK? -Aaron On 9/27/18 4:35 PM, Kristy Kallback-Rose wrote: > Reminder, the*October 1st* deadline is approaching. We?re looking for at > least a few RFEs (Requests For Enhancements) for this first forum, so if > you?re interesting in promoting your RFE please reach out to one of us, > or even here on the list. > > Thanks, > Kristy > >> On Sep 7, 2018, at 3:00 AM, Simon Thompson (Spectrum Scale User Group >> Chair) > wrote: >> >> GPFS/Spectrum Scale Users, >> Here?s a long-ish note about our plans to try and improve the RFE >> process. We?ve tried to include a tl;dr version if you just read the >> headers. You?ll find the details underneath ;-) and reading to the end >> is ideal. >> >> IMPROVING THE RFE PROCESS >> As you?ve heard on the list, and at some of the in-person User Group >> events, we?ve been talking about ways we can improve the RFE process. >> We?d like to begin having an RFE forum, and have it be de-coupled from >> the in-person events because we know not everyone can travel. >> LIGHTNING PRESENTATIONS ON-LINE >> In general terms, we?d have regular on-line events, where RFEs could >> be/very briefly/(5 minutes, lightning talk) presented by the >> requester. There would then be time for brief follow-on discussion >> and questions. The session would be recorded to deal with large time >> zone differences. >> The live meeting is planned for October 10^th 2018, at 4PM BST (that >> should be 11am EST if we worked is out right!) >> FOLLOW UP POLL >> A poll, independent of current RFE voting, would be conducted a couple >> days after the recording was available to gather votes and feedback >> on the RFEs submitted ?we may collect site name, to see how many votes >> are coming from a certain site. >> >> MAY NOT GET IT RIGHT THE FIRST TIME >> We view this supplemental RFE process as organic, that is, we?ll learn >> as we go and make modifications. The overall goal here is to highlight >> the RFEs that matter the most to the largest number of UG members by >> providing a venue for people to speak about their RFEs and collect >> feedback from fellow community members. >> >> *RFE PRESENTERS WANTED, SUBMISSION DEADLINE OCTOBER 1ST >> *We?d like to guide a small handful of RFE submitters through this >> process the first time around, so if you?re interested in being a >> presenter, let us know now. We?re planning on doing the online meeting >> and poll for the first time in mid-October, so the submission deadline >> for your RFE is October 1st. If it?s useful, when you?re drafting your >> RFE feel free to use the list as a sounding board for feedback. Often >> sites have similar needs and you may find someone to collaborate with >> on your RFE to make it useful to more sites, and thereby get more >> votes. Some guidelines are here: >> https://drive.google.com/file/d/1o8nN39DTU32qj_EFia5wRhnvfvNfr3cI/view?usp=sharing >> You can submit you RFE by email to:rfe at spectrumscaleug.org >> >> >> *PARTICIPANTS (AKA YOU!!), VIEW AND VOTE >> *We are seeking very good participation in the RFE on-line events >> needed to make this an effective method of Spectrum Scale Community >> and IBM Developer collaboration. * It is to your benefit to >> participate and help set priorities on Spectrum Scale enhancements!! >> *We want to make this process light lifting for you as a participant. >> We will limit the duration of the meeting to 1 hour to minimize the >> use of your valuable time. >> >> Please register for the online meeting via Eventbrite >> (https://www.eventbrite.com/e/spectrum-scale-request-for-enhancements-voting-tickets-49979954389) >> ? we?ll send details of how to join the online meeting nearer the time. >> >> Thanks! >> >> Simon, Kristy, Bob, Bryan and Carl! >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss atspectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From Robert.Oesterlin at nuance.com Fri Sep 28 18:37:53 2018 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Fri, 28 Sep 2018 17:37:53 +0000 Subject: [gpfsug-discuss] Upcoming SSUG Meetups - IBM Technical University Message-ID: <7AF6FC30-D45C-4D4D-9FFB-F5B808632F54@nuance.com> If you are attending either of the upcoming IBM Technical Universities, be sure to register and stop by for the Scale User Group Meetups. 1. NA Systems Technical University - https://www-03.ibm.com/services/learning/ites.wss/zz-en?pageType=page&c=C954862Z05206G17 October 15 - 19 | Hollywood, FL The Diplomat Beach Resort Date: Wednesday 17th October 2018 Time: 17:30 PM - 19:30 PM Room: 204 Agenda: 17:30 ? Welcome & Introductions 17:40 ? IBM Spectrum Scale Enhancements and CORAL 18:30 ? Spectrum Scale Use Cases 18:50 ? Client Presentation - Why Spectrum Scale ? 19:10 ? AI with ESS: Better Inference, Throughput and Cost 19:30 ? Questions & Close 19:40 ? Cocktails & Networking 2. Europe Systems Technical University - https://www-03.ibm.com/services/learning/ites.wss/zz-en?pageType=page&c=Q424461P26867Y60 22-26 October 2018, Rome, Italy Rome Marriott Park Hotel Date: Wednesday 24th October 2018 Time: 17:00pm - 19:30 PM Room: Bernini 4 Agenda: 17:00 ? Welcome & Introductions 17:10 ? IBM Spectrum Scale Enhancements and CORAL 18:00 ? Spectrum Scale Use Cases 18:30 ? Client Presentation - Why Spectrum Scale ? 18:50 ? AI with ESS: Better Inference, Throughput and Cost 19:20 ? Questions & Close 19:30 ? Cocktails & Networking Bob Oesterlin Sr Principal Storage Engineer, Nuance 507-269-0413 -------------- next part -------------- An HTML attachment was scrubbed... URL: From valdis.kletnieks at vt.edu Fri Sep 28 23:18:16 2018 From: valdis.kletnieks at vt.edu (valdis.kletnieks at vt.edu) Date: Fri, 28 Sep 2018 18:18:16 -0400 Subject: [gpfsug-discuss] mmapplypolicy - sql math on crack? Message-ID: <34863.1538173096@turing-police.cc.vt.edu> OK, so we're running ltfs/ee for archiving to tape, and currently we migrate based sheerly on "largest file first". I'm trying to cook up something that does "largest LRU first" (basically, large unaccessed files get moved earlier than large used files). So I need to do some testing for what function works best. First cut was "file size times number of months idle". And the policy looks like (Yes, I know FILE_SIZE instead of KB_ALLOCATED when you have terabyte files is silly.. bear with me...) -- define(user_exclude_list,(PATH_NAME LIKE '/gpfs/archive/.ltfsee/%' OR PATH_NAME LIKE '/gpfs/archive/.SpaceMan/%' OR PATH_NAME LIKE '/gpfs/archive/ces/%' OR PATH_NAME LIKE '/gpfs/archive/config/%')) define(is_premigrated,(MISC_ATTRIBUTES LIKE '%M%' AND MISC_ATTRIBUTES NOT LIKE '%V%')) define(is_migrated,(MISC_ATTRIBUTES LIKE '%V%')) define(is_resident,(NOT MISC_ATTRIBUTES LIKE '%M%')) define(months_old,((DAYS(CURRENT_TIMESTAMP)-DAYS(ACCESS_TIME))/30)) define(STALE,(months_old * FILE_SIZE)) define(attr,varchar($1) || ' ') define(all_attrs,attr(USER_ID) || attr(GROUP_ID) || attr(FILE_SIZE) || attr(STALE) || attr(months_old) || attr(DAYS(CURRENT_TIMESTAMP)-DAYS(ACCESS_TIME)) || attr(ACCESS_TIME) ) RULE 'SYSTEM_POOL_PLACEMENT_RULE' SET POOL 'system' RULE EXTERNAL LIST 'testdrive_files' EXEC '' RULE 'FILES_PRUNE' LIST 'testdrive_files' THRESHOLD(85,83) WEIGHT(FILE_SIZE) SHOW('candidate ' || all_attrs) WHERE is_premigrated AND (LENGTH(PATH_NAME) < 200) AND NOT user_exclude_list -- And something odd happens (tossing out files where 'months_old' is 0 and thus STALE is as well): grep -v ' 0 0 ' ../list.testdrive_files | head -10| cut -f1-4 -d/ | sed 's/$/(...)/' 528469735 1844029648 0 candidate 21008 675 132918210560 -897871872 23 714 2016-10-14 07:00:52.000000 -- /gpfs/archive/vbi(...) 528469499 452828309 0 candidate 21008 675 128930785280 1880627200 23 714 2016-10-14 05:51:37.000000 -- /gpfs/archive/vbi(...) 528994319 1729658954 0 candidate 21008 675 122826721280 -1073891328 23 714 2016-10-14 08:03:26.000000 -- /gpfs/archive/vbi(...) 521263267 1704365147 0 candidate 1691616 1691616 111187014003 111187014003 1 44 2018-08-15 19:15:18.534360 -- /gpfs/archive/arc(...) 529340511 975566740 0 candidate 21008 675 92028549120 -762247168 23 715 2016-10-13 20:02:50.000000 -- /gpfs/archive/vbi(...) 528289152 1660557219 0 candidate 21008 675 91083468800 -1024258048 23 714 2016-10-14 05:32:01.000000 -- /gpfs/archive/vbi(...) -- this pair particularly interesting 513739717 1569434383 0 candidate 1691616 1691616 83291991689 -919741166 2 63 2018-07-27 20:34:49.532723 -- /gpfs/archive/arc(...) 513739667 229891659 0 candidate 1691616 1691616 82634076418 2059395588 2 63 2018-07-27 20:27:23.532723 -- /gpfs/archive/arc(...) -- 8007095 9763066 0 candidate 501 502 79531799574 -1823771078 119 3595 2008-11-24 17:00:25.000000 -- /gpfs/archive/edisc(...) 527263956 829892493 0 candidate 1921090 1921090 73720064000 73720064000 1 47 2018-08-12 02:18:07.534040 -- /gpfs/archive/arc(...) It *looks* like "if you multiply by 0, you get zero, multiply by 1 (as in the last line), you get full precision out to at least 2**38, but multiply by anything else you get the value mod 2**32 and treated as a signed integer".... Is that the intended/documented behavior of "(value * value)" in a policy statement? -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 486 bytes Desc: not available URL: From makaplan at us.ibm.com Sat Sep 29 16:23:08 2018 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Sat, 29 Sep 2018 11:23:08 -0400 Subject: [gpfsug-discuss] mmapplypolicy - sql math on crack? - Please try a workaround In-Reply-To: <34863.1538173096@turing-police.cc.vt.edu> References: <34863.1538173096@turing-police.cc.vt.edu> Message-ID: This may be a bug and/or a peculiarity of the SQL type system. A proper investigation and full explanation will take more time than I have right now. In the meanwhile please try forcing the computation/arithmetic to use floating point by changing your "30" to "30.0" and let us know if that helps you move along in your task at hand. From: valdis.kletnieks at vt.edu To: gpfsug-discuss at spectrumscale.org Date: 09/28/2018 06:36 PM Subject: [gpfsug-discuss] mmapplypolicy - sql math on crack? Sent by: gpfsug-discuss-bounces at spectrumscale.org OK, so we're running ltfs/ee for archiving to tape, and currently we migrate based sheerly on "largest file first". I'm trying to cook up something that does "largest LRU first" (basically, large unaccessed files get moved earlier than large used files). So I need to do some testing for what function works best. First cut was "file size times number of months idle". And the policy looks like (Yes, I know FILE_SIZE instead of KB_ALLOCATED when you have terabyte files is silly.. bear with me...) -- define(user_exclude_list,(PATH_NAME LIKE '/gpfs/archive/.ltfsee/%' OR PATH_NAME LIKE '/gpfs/archive/.SpaceMan/%' OR PATH_NAME LIKE '/gpfs/archive/ces/%' OR PATH_NAME LIKE '/gpfs/archive/config/%')) define(is_premigrated,(MISC_ATTRIBUTES LIKE '%M%' AND MISC_ATTRIBUTES NOT LIKE '%V%')) define(is_migrated,(MISC_ATTRIBUTES LIKE '%V%')) define(is_resident,(NOT MISC_ATTRIBUTES LIKE '%M%')) define(months_old,((DAYS(CURRENT_TIMESTAMP)-DAYS(ACCESS_TIME))/30)) define(STALE,(months_old * FILE_SIZE)) define(attr,varchar($1) || ' ') define(all_attrs,attr(USER_ID) || attr(GROUP_ID) || attr(FILE_SIZE) || attr(STALE) || attr(months_old) || attr(DAYS(CURRENT_TIMESTAMP)-DAYS(ACCESS_TIME)) || attr(ACCESS_TIME) ) RULE 'SYSTEM_POOL_PLACEMENT_RULE' SET POOL 'system' RULE EXTERNAL LIST 'testdrive_files' EXEC '' RULE 'FILES_PRUNE' LIST 'testdrive_files' THRESHOLD(85,83) WEIGHT(FILE_SIZE) SHOW('candidate ' || all_attrs) WHERE is_premigrated AND (LENGTH(PATH_NAME) < 200) AND NOT user_exclude_list -- And something odd happens (tossing out files where 'months_old' is 0 and thus STALE is as well): grep -v ' 0 0 ' ../list.testdrive_files | head -10| cut -f1-4 -d/ | sed 's/$/(...)/' 528469735 1844029648 0 candidate 21008 675 132918210560 -897871872 23 714 2016-10-14 07:00:52.000000 -- /gpfs/archive/vbi(...) 528469499 452828309 0 candidate 21008 675 128930785280 1880627200 23 714 2016-10-14 05:51:37.000000 -- /gpfs/archive/vbi(...) 528994319 1729658954 0 candidate 21008 675 122826721280 -1073891328 23 714 2016-10-14 08:03:26.000000 -- /gpfs/archive/vbi(...) 521263267 1704365147 0 candidate 1691616 1691616 111187014003 111187014003 1 44 2018-08-15 19:15:18.534360 -- /gpfs/archive/arc(...) 529340511 975566740 0 candidate 21008 675 92028549120 -762247168 23 715 2016-10-13 20:02:50.000000 -- /gpfs/archive/vbi(...) 528289152 1660557219 0 candidate 21008 675 91083468800 -1024258048 23 714 2016-10-14 05:32:01.000000 -- /gpfs/archive/vbi(...) -- this pair particularly interesting 513739717 1569434383 0 candidate 1691616 1691616 83291991689 -919741166 2 63 2018-07-27 20:34:49.532723 -- /gpfs/archive/arc(...) 513739667 229891659 0 candidate 1691616 1691616 82634076418 2059395588 2 63 2018-07-27 20:27:23.532723 -- /gpfs/archive/arc(...) -- 8007095 9763066 0 candidate 501 502 79531799574 -1823771078 119 3595 2008-11-24 17:00:25.000000 -- /gpfs/archive/edisc(...) 527263956 829892493 0 candidate 1921090 1921090 73720064000 73720064000 1 47 2018-08-12 02:18:07.534040 -- /gpfs/archive/arc(...) It *looks* like "if you multiply by 0, you get zero, multiply by 1 (as in the last line), you get full precision out to at least 2**38, but multiply by anything else you get the value mod 2**32 and treated as a signed integer".... Is that the intended/documented behavior of "(value * value)" in a policy statement? [attachment "attwost2.dat" deleted by Marc A Kaplan/Watson/IBM] _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From valdis.kletnieks at vt.edu Sat Sep 29 21:52:22 2018 From: valdis.kletnieks at vt.edu (valdis.kletnieks at vt.edu) Date: Sat, 29 Sep 2018 16:52:22 -0400 Subject: [gpfsug-discuss] mmapplypolicy - sql math on crack? - Please try a workaround In-Reply-To: References: <34863.1538173096@turing-police.cc.vt.edu> Message-ID: <191368.1538254342@turing-police.cc.vt.edu> On Sat, 29 Sep 2018 11:23:08 -0400, "Marc A Kaplan" said: > This may be a bug and/or a peculiarity of the SQL type system. A proper > investigation and full explanation will take more time than I have right > now. > > In the meanwhile please try forcing the computation/arithmetic to use > floating point by changing your "30" to "30.0" > and let us know if that helps you move along in your task at hand. Oh, once I figured out what was going on, I changed from FILE_SIZE "bytes" to KB_ALLOCATED / 1024 "megabytes" and knocked enough powers of two off so things don't go wonky till that one researcher has a 79 terabyte file he doesn't look at for many decades. :) I'll probably take my original email and stuff it into a PMR later today just so the weirdness doesn't get lost....