[gpfsug-discuss] Server lost NSD mappings
Sven Oehme
oehmes at us.ibm.com
Wed Oct 29 20:02:53 GMT 2014
Hi,
i was asking for the content, not the result :-)
can you run cat /var/mmfs/etc/nsddevices
the 2nd output confirms at least that there is no correct label on the
disk, as it returns EFI
on a GNR system you get a few more infos , but at least you should see the
NSD descriptor string like i get on my system :
[root at gss02n1 ~]# dd if=/dev/sdaa bs=1k count=32 | strings
T7$V
e2d2s08
NSD descriptor for /dev/sdde created by GPFS Thu Oct 9 16:48:27 2014
32+0 records in
32+0 records out
32768 bytes (33 kB) copied, 0.0186765 s, 1.8 MB/s
while i still would like to see the nsddevices script i assume your NSD
descriptor is wiped and without a lot of manual labor and at least a
recent GPFS dump this is very hard if at all to recreate.
------------------------------------------
Sven Oehme
Scalable Storage Research
email: oehmes at us.ibm.com
Phone: +1 (408) 824-8904
IBM Almaden Research Lab
------------------------------------------
From: Jared David Baker <Jared.Baker at uwyo.edu>
To: gpfsug main discussion list <gpfsug-discuss at gpfsug.org>
Date: 10/29/2014 12:46 PM
Subject: Re: [gpfsug-discuss] Server lost NSD mappings
Sent by: gpfsug-discuss-bounces at gpfsug.org
Sven, output below:
--
[root at mmmnsd5 ~]# /var/mmfs/etc/nsddevices
mapper/dcs3800u31a_lun0 dmm
mapper/dcs3800u31a_lun10 dmm
mapper/dcs3800u31a_lun2 dmm
mapper/dcs3800u31a_lun4 dmm
mapper/dcs3800u31a_lun6 dmm
mapper/dcs3800u31a_lun8 dmm
mapper/dcs3800u31b_lun1 dmm
mapper/dcs3800u31b_lun11 dmm
mapper/dcs3800u31b_lun3 dmm
mapper/dcs3800u31b_lun5 dmm
mapper/dcs3800u31b_lun7 dmm
mapper/dcs3800u31b_lun9 dmm
[root at mmmnsd5 ~]#
--
--
[root at mmmnsd5 /]# dd if=/dev/dm-0 bs=1k count=32 | strings
32+0 records in
32+0 records out
32768 bytes (33 kB) copied, 0.000739083 s, 44.3 MB/s
EFI PART
system
[root at mmmnsd5 /]#
--
Thanks, Jared
From: gpfsug-discuss-bounces at gpfsug.org [
mailto:gpfsug-discuss-bounces at gpfsug.org] On Behalf Of Sven Oehme
Sent: Wednesday, October 29, 2014 1:41 PM
To: gpfsug main discussion list
Subject: Re: [gpfsug-discuss] Server lost NSD mappings
can you please post the content of your nsddevices script ?
also please run
dd if=/dev/dm-0 bs=1k count=32 |strings
and post the output
thx. Sven
------------------------------------------
Sven Oehme
Scalable Storage Research
email: oehmes at us.ibm.com
Phone: +1 (408) 824-8904
IBM Almaden Research Lab
------------------------------------------
From: Jared David Baker <Jared.Baker at uwyo.edu>
To: gpfsug main discussion list <gpfsug-discuss at gpfsug.org>
Date: 10/29/2014 12:27 PM
Subject: Re: [gpfsug-discuss] Server lost NSD mappings
Sent by: gpfsug-discuss-bounces at gpfsug.org
Thanks Ed,
I can see the multipath devices inside the OS after reboot. The storage is
all SAS attached. Two servers which can see the multipath LUNS for
failover, then export the gpfs filesystem to the compute cluster.
--
[root at mmmnsd5 ~]# multipath -l
dcs3800u31a_lun8 (360080e500029600c000001e953cf8291) dm-4 IBM,1813 FAStT
size=29T features='3 queue_if_no_path pg_init_retries 50' hwhandler='1
rdac' wp=rw
|-+- policy='round-robin 0' prio=0 status=active
| `- 0:0:0:8 sdi 8:128 active undef running
`-+- policy='round-robin 0' prio=0 status=enabled
`- 0:0:1:8 sdu 65:64 active undef running
dcs3800u31b_lun9 (360080e5000295c68000001c253cf8221) dm-9 IBM,1813 FAStT
size=29T features='3 queue_if_no_path pg_init_retries 50' hwhandler='1
rdac' wp=rw
|-+- policy='round-robin 0' prio=0 status=active
| `- 0:0:1:9 sdv 65:80 active undef running
`-+- policy='round-robin 0' prio=0 status=enabled
`- 0:0:0:9 sdj 8:144 active undef running
dcs3800u31a_lun6 (360080e500029600c000001e653cf8210) dm-3 IBM,1813 FAStT
size=29T features='3 queue_if_no_path pg_init_retries 50' hwhandler='1
rdac' wp=rw
|-+- policy='round-robin 0' prio=0 status=active
| `- 0:0:0:6 sdg 8:96 active undef running
`-+- policy='round-robin 0' prio=0 status=enabled
`- 0:0:1:6 sds 65:32 active undef running
mpathm (3600605b007ca57d01b1b8a7a1a107bdd) dm-12 IBM,ServeRAID M1115
size=558G features='0' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=0 status=active
`- 1:2:0:0 sdy 65:128 active undef running
dcs3800u31b_lun7 (360080e5000295c68000001bd53cf81a9) dm-8 IBM,1813 FAStT
size=29T features='3 queue_if_no_path pg_init_retries 50' hwhandler='1
rdac' wp=rw
|-+- policy='round-robin 0' prio=0 status=active
| `- 0:0:1:7 sdt 65:48 active undef running
`-+- policy='round-robin 0' prio=0 status=enabled
`- 0:0:0:7 sdh 8:112 active undef running
dcs3800u31a_lun10 (360080e500029600c000001ec53cf8301) dm-5 IBM,1813 FAStT
size=29T features='3 queue_if_no_path pg_init_retries 50' hwhandler='1
rdac' wp=rw
|-+- policy='round-robin 0' prio=0 status=active
| `- 0:0:0:10 sdk 8:160 active undef running
`-+- policy='round-robin 0' prio=0 status=enabled
`- 0:0:1:10 sdw 65:96 active undef running
dcs3800u31a_lun4 (360080e500029600c000001e353cf8189) dm-1 IBM,1813 FAStT
size=29T features='3 queue_if_no_path pg_init_retries 50' hwhandler='1
rdac' wp=rw
|-+- policy='round-robin 0' prio=0 status=active
| `- 0:0:0:4 sde 8:64 active undef running
`-+- policy='round-robin 0' prio=0 status=enabled
`- 0:0:1:4 sdq 65:0 active undef running
dcs3800u31b_lun5 (360080e5000295c68000001b853cf8125) dm-10 IBM,1813 FAStT
size=29T features='3 queue_if_no_path pg_init_retries 50' hwhandler='1
rdac' wp=rw
|-+- policy='round-robin 0' prio=0 status=active
| `- 0:0:1:5 sdr 65:16 active undef running
`-+- policy='round-robin 0' prio=0 status=enabled
`- 0:0:0:5 sdf 8:80 active undef running
dcs3800u31a_lun2 (360080e500029600c000001e053cf80f9) dm-2 IBM,1813 FAStT
size=29T features='3 queue_if_no_path pg_init_retries 50' hwhandler='1
rdac' wp=rw
|-+- policy='round-robin 0' prio=0 status=active
| `- 0:0:0:2 sdc 8:32 active undef running
`-+- policy='round-robin 0' prio=0 status=enabled
`- 0:0:1:2 sdo 8:224 active undef running
dcs3800u31b_lun11 (360080e5000295c68000001c753cf828e) dm-11 IBM,1813 FAStT
size=29T features='3 queue_if_no_path pg_init_retries 50' hwhandler='1
rdac' wp=rw
|-+- policy='round-robin 0' prio=0 status=active
| `- 0:0:1:11 sdx 65:112 active undef running
`-+- policy='round-robin 0' prio=0 status=enabled
`- 0:0:0:11 sdl 8:176 active undef running
dcs3800u31b_lun3 (360080e5000295c68000001b353cf8097) dm-6 IBM,1813 FAStT
size=29T features='3 queue_if_no_path pg_init_retries 50' hwhandler='1
rdac' wp=rw
|-+- policy='round-robin 0' prio=0 status=active
| `- 0:0:1:3 sdp 8:240 active undef running
`-+- policy='round-robin 0' prio=0 status=enabled
`- 0:0:0:3 sdd 8:48 active undef running
dcs3800u31a_lun0 (360080e500029600c000001da53cf7ec1) dm-0 IBM,1813 FAStT
size=29T features='3 queue_if_no_path pg_init_retries 50' hwhandler='1
rdac' wp=rw
|-+- policy='round-robin 0' prio=0 status=active
| `- 0:0:0:0 sda 8:0 active undef running
`-+- policy='round-robin 0' prio=0 status=enabled
`- 0:0:1:0 sdm 8:192 active undef running
dcs3800u31b_lun1 (360080e5000295c68000001ac53cf7e8d) dm-7 IBM,1813 FAStT
size=29T features='3 queue_if_no_path pg_init_retries 50' hwhandler='1
rdac' wp=rw
|-+- policy='round-robin 0' prio=0 status=active
| `- 0:0:1:1 sdn 8:208 active undef running
`-+- policy='round-robin 0' prio=0 status=enabled
`- 0:0:0:1 sdb 8:16 active undef running
[root at mmmnsd5 ~]#
--
--
[root at mmmnsd5 ~]# cat /proc/partitions
major minor #blocks name
8 48 31251951616 sdd
8 32 31251951616 sdc
8 80 31251951616 sdf
8 16 31251951616 sdb
8 128 31251951616 sdi
8 112 31251951616 sdh
8 96 31251951616 sdg
8 192 31251951616 sdm
8 240 31251951616 sdp
8 208 31251951616 sdn
8 144 31251951616 sdj
8 64 31251951616 sde
8 224 31251951616 sdo
8 160 31251951616 sdk
8 176 31251951616 sdl
65 0 31251951616 sdq
65 48 31251951616 sdt
65 16 31251951616 sdr
65 128 584960000 sdy
65 80 31251951616 sdv
65 96 31251951616 sdw
65 64 31251951616 sdu
65 112 31251951616 sdx
65 32 31251951616 sds
8 0 31251951616 sda
253 0 31251951616 dm-0
253 1 31251951616 dm-1
253 2 31251951616 dm-2
253 3 31251951616 dm-3
253 4 31251951616 dm-4
253 5 31251951616 dm-5
253 6 31251951616 dm-6
253 7 31251951616 dm-7
253 8 31251951616 dm-8
253 9 31251951616 dm-9
253 10 31251951616 dm-10
253 11 31251951616 dm-11
253 12 584960000 dm-12
253 13 524288 dm-13
253 14 16777216 dm-14
253 15 567657472 dm-15
[root at mmmnsd5 ~]#
--
The NSDs had no failure group defined on creation.
Regards,
Jared
-----Original Message-----
From: gpfsug-discuss-bounces at gpfsug.org [
mailto:gpfsug-discuss-bounces at gpfsug.org] On Behalf Of Ed Wahl
Sent: Wednesday, October 29, 2014 1:08 PM
To: gpfsug main discussion list
Subject: Re: [gpfsug-discuss] Server lost NSD mappings
Can you see the block devices from inside the OS after the reboot? I
don't see where you mention this. How is the storage attached to the
server? As a DCS37|800 can be FC/SAS/IB which is yours? Do the nodes
share the storage? All nsds in same failure group? I was quickly
brought to mind of a failed SRP_DAEMON lookup to IB storage from a badly
updated IB card but I would hope you'd notice the lack of block devices.
cat /proc/partitions ?
multipath -l ?
Our GPFS changes device mapper multipath names all the time (dm-127 one
day, dm-something else another), so that is no problem. But wacking the
volume label is a pain.
When hardware dies if you have nsds sharing the same LUNs you can just
transfer /var/mmfs/gen/mmsdrfs from another node and Bob's your uncle.
Ed Wahl
OSC
________________________________________
From: gpfsug-discuss-bounces at gpfsug.org
[gpfsug-discuss-bounces at gpfsug.org] on behalf of Jared David Baker
[Jared.Baker at uwyo.edu]
Sent: Wednesday, October 29, 2014 11:31 AM
To: gpfsug-discuss at gpfsug.org
Subject: [gpfsug-discuss] Server lost NSD mappings
Hello all,
I'm hoping that somebody can shed some light on a problem that I
experienced yesterday. I've been working with GPFS for a couple months as
an admin now, but I've come across a problem that I'm unable to see the
answer to. Hopefully the solution is not listed somewhere blatantly on the
web, but I spent a fair amount of time looking last night. Here is the
situation: yesterday, I needed to update some firmware on a Mellanox HCA
FDR14 card and reboot one of our GPFS servers and repeat for the sister
node (IBM x3550 and DCS3850) as HPSS for our main campus cluster. However,
upon reboot, the server seemed to lose the path mappings to the multipath
devices for the NSDs. Output below:
--
[root at mmmnsd5 ~]# mmlsnsd -m -f gscratch
Disk name NSD volume ID Device Node name Remarks
---------------------------------------------------------------------------------------
dcs3800u31a_lun0 0A62001B54235577 - mminsd5.infini (not
found) server node
dcs3800u31a_lun0 0A62001B54235577 - mminsd6.infini (not
found) server node
dcs3800u31a_lun10 0A62001C542355AA - mminsd6.infini (not
found) server node
dcs3800u31a_lun10 0A62001C542355AA - mminsd5.infini (not
found) server node
dcs3800u31a_lun2 0A62001C54235581 - mminsd6.infini (not
found) server node
dcs3800u31a_lun2 0A62001C54235581 - mminsd5.infini (not
found) server node
dcs3800u31a_lun4 0A62001B5423558B - mminsd5.infini (not
found) server node
dcs3800u31a_lun4 0A62001B5423558B - mminsd6.infini (not
found) server node
dcs3800u31a_lun6 0A62001C54235595 - mminsd6.infini (not
found) server node
dcs3800u31a_lun6 0A62001C54235595 - mminsd5.infini (not
found) server node
dcs3800u31a_lun8 0A62001B5423559F - mminsd5.infini (not
found) server node
dcs3800u31a_lun8 0A62001B5423559F - mminsd6.infini (not
found) server node
dcs3800u31b_lun1 0A62001B5423557C - mminsd5.infini (not
found) server node
dcs3800u31b_lun1 0A62001B5423557C - mminsd6.infini (not
found) server node
dcs3800u31b_lun11 0A62001C542355AF - mminsd6.infini (not
found) server node
dcs3800u31b_lun11 0A62001C542355AF - mminsd5.infini (not
found) server node
dcs3800u31b_lun3 0A62001C54235586 - mminsd6.infini (not
found) server node
dcs3800u31b_lun3 0A62001C54235586 - mminsd5.infini (not
found) server node
dcs3800u31b_lun5 0A62001B54235590 - mminsd5.infini (not
found) server node
dcs3800u31b_lun5 0A62001B54235590 - mminsd6.infini (not
found) server node
dcs3800u31b_lun7 0A62001C5423559A - mminsd6.infini (not
found) server node
dcs3800u31b_lun7 0A62001C5423559A - mminsd5.infini (not
found) server node
dcs3800u31b_lun9 0A62001B542355A4 - mminsd5.infini (not
found) server node
dcs3800u31b_lun9 0A62001B542355A4 - mminsd6.infini (not
found) server node
[root at mmmnsd5 ~]#
--
Also, the system was working fantastically before the reboot, but now I'm
unable to mount the GPFS filesystem. The disk names look like they are
there and mapped to the NSD volume ID, but there is no Device. I've
created the /var/mmfs/etc/nsddevices script and it has the following
output with user return 0:
--
[root at mmmnsd5 ~]# /var/mmfs/etc/nsddevices
mapper/dcs3800u31a_lun0 dmm
mapper/dcs3800u31a_lun10 dmm
mapper/dcs3800u31a_lun2 dmm
mapper/dcs3800u31a_lun4 dmm
mapper/dcs3800u31a_lun6 dmm
mapper/dcs3800u31a_lun8 dmm
mapper/dcs3800u31b_lun1 dmm
mapper/dcs3800u31b_lun11 dmm
mapper/dcs3800u31b_lun3 dmm
mapper/dcs3800u31b_lun5 dmm
mapper/dcs3800u31b_lun7 dmm
mapper/dcs3800u31b_lun9 dmm
[root at mmmnsd5 ~]#
--
That output looks correct to me based on the documentation. So I went
digging in the GPFS log file and found this relevant information:
--
Tue Oct 28 23:44:48.405 2014: I/O to NSD disk, dcs3800u31a_lun0, fails. No
such NSD locally found.
Tue Oct 28 23:44:48.481 2014: I/O to NSD disk, dcs3800u31b_lun1, fails. No
such NSD locally found.
Tue Oct 28 23:44:48.555 2014: I/O to NSD disk, dcs3800u31a_lun2, fails. No
such NSD locally found.
Tue Oct 28 23:44:48.629 2014: I/O to NSD disk, dcs3800u31b_lun3, fails. No
such NSD locally found.
Tue Oct 28 23:44:48.703 2014: I/O to NSD disk, dcs3800u31a_lun4, fails. No
such NSD locally found.
Tue Oct 28 23:44:48.775 2014: I/O to NSD disk, dcs3800u31b_lun5, fails. No
such NSD locally found.
Tue Oct 28 23:44:48.844 2014: I/O to NSD disk, dcs3800u31a_lun6, fails. No
such NSD locally found.
Tue Oct 28 23:44:48.919 2014: I/O to NSD disk, dcs3800u31b_lun7, fails. No
such NSD locally found.
Tue Oct 28 23:44:48.989 2014: I/O to NSD disk, dcs3800u31a_lun8, fails. No
such NSD locally found.
Tue Oct 28 23:44:49.060 2014: I/O to NSD disk, dcs3800u31b_lun9, fails. No
such NSD locally found.
Tue Oct 28 23:44:49.128 2014: I/O to NSD disk, dcs3800u31a_lun10, fails.
No such NSD locally found.
Tue Oct 28 23:44:49.199 2014: I/O to NSD disk, dcs3800u31b_lun11, fails.
No such NSD locally found.
--
Okay, so the NSDs don't seem to be able to be found, so I attempt to
rediscover the NSD by executing the command mmnsddiscover:
--
[root at mmmnsd5 ~]# mmnsddiscover
mmnsddiscover: Attempting to rediscover the disks. This may take a while
...
mmnsddiscover: Finished.
[root at mmmnsd5 ~]#
--
I was hoping that finished, but then upon restarting GPFS, there was no
success. Verifying with mmlsnsd -X -f gscratch
--
[root at mmmnsd5 ~]# mmlsnsd -X -f gscratch
Disk name NSD volume ID Device Devtype Node name Remarks
---------------------------------------------------------------------------------------------------
dcs3800u31a_lun0 0A62001B54235577 - - mminsd5.infini
(not found) server node
dcs3800u31a_lun0 0A62001B54235577 - - mminsd6.infini
(not found) server node
dcs3800u31a_lun10 0A62001C542355AA - - mminsd6.infini (not
found) server node
dcs3800u31a_lun10 0A62001C542355AA - - mminsd5.infini (not
found) server node
dcs3800u31a_lun2 0A62001C54235581 - - mminsd6.infini
(not found) server node
dcs3800u31a_lun2 0A62001C54235581 - - mminsd5.infini
(not found) server node
dcs3800u31a_lun4 0A62001B5423558B - - mminsd5.infini
(not found) server node
dcs3800u31a_lun4 0A62001B5423558B - - mminsd6.infini
(not found) server node
dcs3800u31a_lun6 0A62001C54235595 - - mminsd6.infini
(not found) server node
dcs3800u31a_lun6 0A62001C54235595 - - mminsd5.infini
(not found) server node
dcs3800u31a_lun8 0A62001B5423559F - - mminsd5.infini
(not found) server node
dcs3800u31a_lun8 0A62001B5423559F - - mminsd6.infini
(not found) server node
dcs3800u31b_lun1 0A62001B5423557C - - mminsd5.infini
(not found) server node
dcs3800u31b_lun1 0A62001B5423557C - - mminsd6.infini
(not found) server node
dcs3800u31b_lun11 0A62001C542355AF - - mminsd6.infini (not
found) server node
dcs3800u31b_lun11 0A62001C542355AF - - mminsd5.infini (not
found) server node
dcs3800u31b_lun3 0A62001C54235586 - - mminsd6.infini
(not found) server node
dcs3800u31b_lun3 0A62001C54235586 - - mminsd5.infini
(not found) server node
dcs3800u31b_lun5 0A62001B54235590 - - mminsd5.infini
(not found) server node
dcs3800u31b_lun5 0A62001B54235590 - - mminsd6.infini
(not found) server node
dcs3800u31b_lun7 0A62001C5423559A - - mminsd6.infini
(not found) server node
dcs3800u31b_lun7 0A62001C5423559A - - mminsd5.infini
(not found) server node
dcs3800u31b_lun9 0A62001B542355A4 - - mminsd5.infini
(not found) server node
dcs3800u31b_lun9 0A62001B542355A4 - - mminsd6.infini
(not found) server node
[root at mmmnsd5 ~]#
--
I'm wondering if somebody has seen this type of issue before? Will
recreating my NSDs destroy the filesystem? I'm thinking that all the data
is intact, but there is no crucial data on this file system yet, so I
could recreate the file system, but I would like to learn how to solve a
problem like this. Thanks for all help and information.
Regards,
Jared
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20141029/52ddf40d/attachment.htm>
More information about the gpfsug-discuss
mailing list