[gpfsug-discuss] Tuning Spectrum Scale AFM for stability?

Tue Apr 28 13:25:37 BST 2020

Hi Venkat,

The AFM fileset becomes totally unresponsive from all nodes within the cluster and the only way to resolve it is to do a "mmshutdown" and wait 2 mins, then "mmshutdown" again as it cannot really do it the first time.. and then a "mmstartup" then all is back to normal and AFM is stopped and can be started again for another week or so..

mmafmctl <filesystem> stop -j <fileset> will just hang endless..

i will try to set that value and see if that does anything for us :)

Thanks!

Best Regards
Andi Christiansen

>     On April 28, 2020 1:37 PM Venkateswara R Puvvada <vpuvvada at in.ibm.com> wrote:
> 
> 
>     Hi,
> 
>     What is lock down of  AFM fileset ? Are the messages in requeued state and AFM won't replicate any data ?  I would recommend opening a ticket by collecting the logs and internaldump from the gateway node when the replication is stuck.
> 
>     You can also try increasing the value of afmAsyncOpWaitTimeout option and see if this solves the issue.
> 
>     mmchconfig afmAsyncOpWaitTimeout=3600 -i
> 
>     ~Venkat (vpuvvada at in.ibm.com)
> 
> 
> 
>     From:         Andi Christiansen <andi at christiansen.xxx>
>     To:         "gpfsug-discuss at spectrumscale.org" <gpfsug-discuss at spectrumscale.org>
>     Date:         04/28/2020 12:04 PM
>     Subject:         [EXTERNAL] [gpfsug-discuss] Tuning Spectrum Scale AFM for stability?
>     Sent by:         gpfsug-discuss-bounces at spectrumscale.org
> 
>     ---------------------------------------------
> 
> 
> 
>     Hi All,
> 
>     Can anyone share some thoughts on how to tune AFM for stability? at the moment we have ok performance between our sites (5-8Gbits with 34ms latency) but we encounter a lock down of the cache fileset from week to week, which was day to day before we tuned below settings.. is there any way to tune AFM further i haven't found ?
> 
> 
>     Cache Site only:
>     TCP Settings:
>     sunrpc.tcp_slot_table_entries = 128  
> 
> 
>     Home and Cache:
>     AFM / GPFS Settings:
>     maxBufferDescs=163840
>     afmHardMemThreshold=25G
>     afmMaxWriteMergeLen=30G
> 
> 
>     Cache fileset:
>     Attributes for fileset AFMFILESET:
>     ================================
>     Status Linked
>     Path /mnt/fs02/AFMFILESET
>     Id 1
>     Root inode 524291
>     Parent Id 0
>     Created Tue Apr 14 15:57:43 2020
>     Comment
>     Inode space 1
>     Maximum number of inodes 10000384
>     Allocated inodes 10000384
>     Permission change flag chmodAndSetacl
>     afm-associated Yes
>     Target nfs://DK_VPN/mnt/fs01/AFMFILESET
>     Mode single-writer
>     File Lookup Refresh Interval 30 (default)
>     File Open Refresh Interval 30 (default)
>     Dir Lookup Refresh Interval 60 (default)
>     Dir Open Refresh Interval 60 (default)
>     Async Delay 15 (default)
>     Last pSnapId 0
>     Display Home Snapshots no
>     Number of Read Threads per Gateway 64
>     Parallel Read Chunk Size 128
>     Parallel Read Threshold 1024
>     Number of Gateway Flush Threads 48
>     Prefetch Threshold 0 (default)
>     Eviction Enabled yes (default)
>     Parallel Write Threshold 1024
>     Parallel Write Chunk Size 128
>     Number of Write Threads per Gateway 16
>     IO Flags 0 (default)
> 
> 
>     mmfsadm dump afm:
>     AFM Gateway:
>     RpcQLen: 0 maxPoolSize: 4294967295 QOF: 0 MaxOF: 131072
>     readThLimit 128 minIOBuf 1048576 maxIOBuf 1073741824 msgMaxWriteSize 2147483648
>     readBypassThresh 67108864
>     QLen: 0 QMem: 0 SoftQMem: 10737418240 HardQMem 26843545600
>     Ping thread: Started
>     Fileset: AFMFILESET 1 (fs02)
>     mode: single-writer queue: Normal MDS: <c0n1> QMem 0 CTL 577
>     home: DK_VPN homeServer: 10.110.5.11 proto: nfs port: 2049 lastCmd: 16
>     handler: Mounted Dirty refCount: 1
>     queueTransfer: state: Idle senderVerified: 0 receiverVerified: 1 terminate: 0 psnapWait: 0
>     remoteAttrs: AsyncLookups 0 tsfindinode: success 0 failed 0 totalTime 0.0 avgTime 0,000000 maxTime 0.0
>     queue: delay 15 QLen 0+0 flushThds 0 maxFlushThds 48 numExec 8772518 qfs 0 iwo 0 err 78
>     handlerCreateTime : 2020-04-27_11:14:57.415+0200 numCreateSnaps : 0 InflightAsyncLookups 0
>     lastReplayTime : 2020-04-28_07:22:32.415+0200 lastSyncTime : 2020-04-27_15:09:57.415+0200
>     i/o: readBuf: 33554432 writeBuf: 2097152 sparseReadThresh: 134217728 pReadThreads 64
>     i/o: pReadChunkSize 33554432 pReadThresh: 1073741824 pWriteThresh: 1073741824
>     i/o: prefetchThresh 0 (Prefetch)
>     Mnt status: 0:0 1:0 2:0 3:0
>     Export Map: 10.110.5.10/<c0n0> 10.110.5.11/<c0n1> 10.110.5.12/<c0n2> 10.110.5.13/<c0n9>
>     Priority Queue: Empty (state: Active)
>     Normal Queue: Empty (state: Active)
> 
> 
>     Cluster Config Cache:
>     maxFilesToCache 131072
>     maxStatCache 524288
>     afmDIO 2
>     afmIOFlags 4096
>     maxReceiverThreads 32
>     afmNumReadThreads 64
>     afmNumWriteThreads 8
>     afmHardMemThreshold 26843545600
>     maxBufferDescs 163840
>     afmMaxWriteMergeLen 32212254720
>     workerThreads 1024
> 
> 
>     The entries in the gpfs log states "AFM: Home is taking longer to respond..." but its only AFM and the Cache AFM fileset which enteres a locked state. we have the same NFS exports from home mounted on the same gateway nodes to check when a file is transferred and they are all ok while the AFM lock is happening. a simple gpfs restart of the AFM Master node is enough to make AFM restart and continue for another week..
> 
> 
>     The home target is exported through CES NFS from 4 CES nodes and a map is created at the Cache site to utilize the ParallelWrites feature.
> 
> 
>     If there is anyone sitting around with some ideas/knowledge on how to tune this further for more stability then i would be happy if you could share your thoughts about it! :-)
> 
> 
>     Many Thanks in Advance!
>     Andi Christiansen
>     _______________________________________________
>     gpfsug-discuss mailing list
>     gpfsug-discuss at spectrumscale.org
>     http://gpfsug.org/mailman/listinfo/gpfsug-discuss http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> 
> 
> 
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20200428/5dc88d31/attachment.htm>