[gpfsug-discuss] Bad disk but not failed in DSS-G
Timm Stamer
timm.stamer at uni-oldenburg.de
Fri Jun 21 06:50:04 BST 2024
Hi JAB,
You have to run "... change --simulate-dead" or "... --simulate-
failing" to get the disk out of the system. Afterwards you can start
replacement procedure.
mmvdisk pdisk change --recovery-group dssg2 --pdisk e1d2s25 --simulate-
dead
Kind regards
Timm Stamer
Am Donnerstag, dem 20.06.2024 um 21:14 +0100 schrieb Jonathan Buzzard:
>
> So came to light because I was checking the mmbackup logs and found
> that
> we had not been getting any successful backups for several days and
> seeing lots of errors like this
>
> Wed Jun 19 21:45:28 2024 mmbackup:Error encountered in policy scan:
> [E]
> Error on gpfs_iopen([/gpfs/users/xxxyyyyy/.swr],68050746): Stale file
> handle
> Wed Jun 19 21:45:28 2024 mmbackup:Error encountered in policy scan:
> [E]
> Summary of errors:: _dirscan failures:3, _serious unclassified
> errors:3.
>
> After some digging around wondering what was going on I came across
> these being logged on one of the DSS-G nodes
>
> [Wed Jun 12 22:22:05 2024] blk_update_request: I/O error, dev sdbv,
> sector 9144672512 op 0x1:(WRITE) flags 0x700 phys_seg 17 prio class 0
>
> Yikes looks like I have a failed disk/ However if I do
>
> [root at gpfs2 ~]# mmvdisk pdisk list --recovery-group all --not-ok
> mmvdisk: All pdisks are ok.
>
> Clearly that's a load of rubbish.
>
> After a lot more prodding
>
> [root at gpfs2 ~]# mmvdisk pdisk list --recovery-group dssg2 --pdisk
> e1d2s25 -L
> pdisk:
> replacementPriority = 1000
> name = "e1d2s25"
> device =
> "//gpfs1/dev/sdft(notEnabled),//gpfs1/dev/sdfu(notEnabled),//gpfs2/de
> v/sdfb,//gpfs2/dev/sdbv"
> recoveryGroup = "dssg2"
> declusteredArray = "DA1"
> state = "ok"
> IOErrors = 444
> IOTimeouts = 8958
> mediaErrors = 15
>
>
> What on earth gives? Why has the disk not been failed? It's not great
> that a clearly bad disk is allowed to stick around in the file system
> and cause problems IMHO.
>
> When I try and prepare the disk for removal I get
>
> [root at gpfs2 ~]# mmvdisk pdisk replace --prepare --rg dssg2 --pdisk
> e1d2s25
> mmvdisk: Pdisk e1d2s25 of recovery group dssg2 is not currently
> scheduled for replacement.
> mmvdisk:
> mmvdisk:
> mmvdisk: Command failed. Examine previous error messages to determine
> cause.
>
> Do I have to use the --force option? I would like to get this disk
> out
> the file system ASAP.
>
>
>
> JAB.
>
> --
> Jonathan A. Buzzard Tel: +44141-5483420
> HPC System Administrator, ARCHIE-WeSt.
> University of Strathclyde, John Anderson Building, Glasgow. G4 0NG
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at gpfsug.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 7667 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20240621/1e5cc754/attachment.bin>
More information about the gpfsug-discuss
mailing list