[gpfsug-discuss] RDMA write error IBV_WC_RETRY_EXC_ERR
Iban Cabrillo
cabrillo at ifca.unican.es
Fri Jul 9 12:19:07 BST 2021
Dear,
Since a couple of hours we are seen lots off IB error at GPFS logs, on every IB node (gpfs version is 5.0.4-3 ):
2021-07-09_13:11:40.600+0200: [E] VERBS RDMA closed connection to 10.10.152.73 (node157) on mlx5_0 port 1 fabnum 0 index 251 cookie 648 RDMA write error IBV_WC_RETRY_EXC_ERR
2021-07-09_13:11:40.600+0200: [E] VERBS RDMA closed connection to 10.10.152.18 (node102) on mlx5_0 port 1 fabnum 0 index 227 cookie 687 RDMA write error IBV_WC_RETRY_EXC_ERR
2021-07-09_13:11:40.600+0200: [E] VERBS RDMA closed connection to 10.10.152.17 (node101) on mlx5_0 port 1 fabnum 0 index 298 cookie 693 RDMA write error IBV_WC_RETRY_EXC_ERR
2021-07-09_13:11:40.600+0200: [E] VERBS RDMA closed connection to 10.10.151.6 (node6) on mlx5_0 port 1 fabnum 0 index 18 cookie 696 RDMA write error IBV_WC_RETRY_EXC_ERR
2021-07-09_13:11:40.601+0200: [E] VERBS RDMA closed connection to 10.10.152.46 (node130) on mlx5_0 port 1 fabnum 0 index 254 cookie 680 RDMA write error IBV_WC_RETRY_EXC_ERR
2021-07-09_13:11:40.601+0200: [E] VERBS RDMA closed connection to 10.10.151.81 (node81) on mlx5_0 port 1 fabnum 0 index 289 cookie 679 RDMA read error IBV_WC_RETRY_EXC_ERR
and ofcourse long waiters:
=== mmdiag: waiters ===
Waiting 34.8493 sec since 13:11:35, ignored, thread 2935 VerbsReconnectThread: delaying for 25.150686000 more seconds, reason: delaying for next reconnect attempt
Waiting 34.6249 sec since 13:11:35, ignored, thread 10198 VerbsReconnectThread: delaying for 25.375072000 more seconds, reason: delaying for next reconnect attempt
Waiting 27.0957 sec since 13:11:43, ignored, thread 10052 VerbsReconnectThread: delaying for 32.904264000 more seconds, reason: delaying for next reconnect attempt
Waiting 14.8909 sec since 13:11:55, monitored, thread 23135 NSDThread: for RDMA write completion fast on node 10.10.151.65 <c0n258>
Waiting 14.8891 sec since 13:11:55, monitored, thread 23109 NSDThread: for RDMA write completion fast on node 10.10.152.32 <c0n247>
Waiting 14.8865 sec since 13:11:55, monitored, thread 23302 NSDThread: for RDMA write completion fast on node 10.10.150.1 <c0n29>
[common]
verbsRdma enable
verbsPorts mlx4_0/1/0
[gpfs02,gpfs04,gpfs05,gpfs06,gpfs07,gpfs08]
verbsPorts mlx5_0/1/0
[gpfs01]
verbsPorts mlx5_1/1/0
[gpfs03]
verbsPorts mlx5_0/1/0 mlx5_1/1/0
[common]
verbsRdma enable
verbsPorts mlx4_0/1/0
[gpfs02,gpfs04,gpfs05,gpfs06,gpfs07,gpfs08,wngpu001,wngpu002,wngpu003,wngpu004,wngpu005]
verbsPorts mlx5_0/1/0
[gpfs01]
verbsPorts mlx5_1/1/0
[gpfs03]
verbsPorts mlx5_0/1/0 mlx5_1/1/0
Any advise is welcomed
regards, I
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20210709/85468531/attachment.htm>
More information about the gpfsug-discuss
mailing list