[gpfsug-discuss] Tuning Spectrum Scale AFM for stability?
Andi Christiansen
andi at christiansen.xxx
Tue Apr 28 13:25:37 BST 2020
Hi Venkat,
The AFM fileset becomes totally unresponsive from all nodes within the cluster and the only way to resolve it is to do a "mmshutdown" and wait 2 mins, then "mmshutdown" again as it cannot really do it the first time.. and then a "mmstartup" then all is back to normal and AFM is stopped and can be started again for another week or so..
mmafmctl <filesystem> stop -j <fileset> will just hang endless..
i will try to set that value and see if that does anything for us :)
Thanks!
Best Regards
Andi Christiansen
> On April 28, 2020 1:37 PM Venkateswara R Puvvada <vpuvvada at in.ibm.com> wrote:
>
>
> Hi,
>
> What is lock down of AFM fileset ? Are the messages in requeued state and AFM won't replicate any data ? I would recommend opening a ticket by collecting the logs and internaldump from the gateway node when the replication is stuck.
>
> You can also try increasing the value of afmAsyncOpWaitTimeout option and see if this solves the issue.
>
> mmchconfig afmAsyncOpWaitTimeout=3600 -i
>
> ~Venkat (vpuvvada at in.ibm.com)
>
>
>
> From: Andi Christiansen <andi at christiansen.xxx>
> To: "gpfsug-discuss at spectrumscale.org" <gpfsug-discuss at spectrumscale.org>
> Date: 04/28/2020 12:04 PM
> Subject: [EXTERNAL] [gpfsug-discuss] Tuning Spectrum Scale AFM for stability?
> Sent by: gpfsug-discuss-bounces at spectrumscale.org
>
> ---------------------------------------------
>
>
>
> Hi All,
>
> Can anyone share some thoughts on how to tune AFM for stability? at the moment we have ok performance between our sites (5-8Gbits with 34ms latency) but we encounter a lock down of the cache fileset from week to week, which was day to day before we tuned below settings.. is there any way to tune AFM further i haven't found ?
>
>
> Cache Site only:
> TCP Settings:
> sunrpc.tcp_slot_table_entries = 128
>
>
> Home and Cache:
> AFM / GPFS Settings:
> maxBufferDescs=163840
> afmHardMemThreshold=25G
> afmMaxWriteMergeLen=30G
>
>
> Cache fileset:
> Attributes for fileset AFMFILESET:
> ================================
> Status Linked
> Path /mnt/fs02/AFMFILESET
> Id 1
> Root inode 524291
> Parent Id 0
> Created Tue Apr 14 15:57:43 2020
> Comment
> Inode space 1
> Maximum number of inodes 10000384
> Allocated inodes 10000384
> Permission change flag chmodAndSetacl
> afm-associated Yes
> Target nfs://DK_VPN/mnt/fs01/AFMFILESET
> Mode single-writer
> File Lookup Refresh Interval 30 (default)
> File Open Refresh Interval 30 (default)
> Dir Lookup Refresh Interval 60 (default)
> Dir Open Refresh Interval 60 (default)
> Async Delay 15 (default)
> Last pSnapId 0
> Display Home Snapshots no
> Number of Read Threads per Gateway 64
> Parallel Read Chunk Size 128
> Parallel Read Threshold 1024
> Number of Gateway Flush Threads 48
> Prefetch Threshold 0 (default)
> Eviction Enabled yes (default)
> Parallel Write Threshold 1024
> Parallel Write Chunk Size 128
> Number of Write Threads per Gateway 16
> IO Flags 0 (default)
>
>
> mmfsadm dump afm:
> AFM Gateway:
> RpcQLen: 0 maxPoolSize: 4294967295 QOF: 0 MaxOF: 131072
> readThLimit 128 minIOBuf 1048576 maxIOBuf 1073741824 msgMaxWriteSize 2147483648
> readBypassThresh 67108864
> QLen: 0 QMem: 0 SoftQMem: 10737418240 HardQMem 26843545600
> Ping thread: Started
> Fileset: AFMFILESET 1 (fs02)
> mode: single-writer queue: Normal MDS: <c0n1> QMem 0 CTL 577
> home: DK_VPN homeServer: 10.110.5.11 proto: nfs port: 2049 lastCmd: 16
> handler: Mounted Dirty refCount: 1
> queueTransfer: state: Idle senderVerified: 0 receiverVerified: 1 terminate: 0 psnapWait: 0
> remoteAttrs: AsyncLookups 0 tsfindinode: success 0 failed 0 totalTime 0.0 avgTime 0,000000 maxTime 0.0
> queue: delay 15 QLen 0+0 flushThds 0 maxFlushThds 48 numExec 8772518 qfs 0 iwo 0 err 78
> handlerCreateTime : 2020-04-27_11:14:57.415+0200 numCreateSnaps : 0 InflightAsyncLookups 0
> lastReplayTime : 2020-04-28_07:22:32.415+0200 lastSyncTime : 2020-04-27_15:09:57.415+0200
> i/o: readBuf: 33554432 writeBuf: 2097152 sparseReadThresh: 134217728 pReadThreads 64
> i/o: pReadChunkSize 33554432 pReadThresh: 1073741824 pWriteThresh: 1073741824
> i/o: prefetchThresh 0 (Prefetch)
> Mnt status: 0:0 1:0 2:0 3:0
> Export Map: 10.110.5.10/<c0n0> 10.110.5.11/<c0n1> 10.110.5.12/<c0n2> 10.110.5.13/<c0n9>
> Priority Queue: Empty (state: Active)
> Normal Queue: Empty (state: Active)
>
>
> Cluster Config Cache:
> maxFilesToCache 131072
> maxStatCache 524288
> afmDIO 2
> afmIOFlags 4096
> maxReceiverThreads 32
> afmNumReadThreads 64
> afmNumWriteThreads 8
> afmHardMemThreshold 26843545600
> maxBufferDescs 163840
> afmMaxWriteMergeLen 32212254720
> workerThreads 1024
>
>
> The entries in the gpfs log states "AFM: Home is taking longer to respond..." but its only AFM and the Cache AFM fileset which enteres a locked state. we have the same NFS exports from home mounted on the same gateway nodes to check when a file is transferred and they are all ok while the AFM lock is happening. a simple gpfs restart of the AFM Master node is enough to make AFM restart and continue for another week..
>
>
> The home target is exported through CES NFS from 4 CES nodes and a map is created at the Cache site to utilize the ParallelWrites feature.
>
>
> If there is anyone sitting around with some ideas/knowledge on how to tune this further for more stability then i would be happy if you could share your thoughts about it! :-)
>
>
> Many Thanks in Advance!
> Andi Christiansen
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20200428/5dc88d31/attachment.htm>
More information about the gpfsug-discuss
mailing list