[gpfsug-discuss] No space left on device, but plenty of quota space for inodes and blocks
Maloney, John Daniel
malone12 at illinois.edu
Thu Jun 6 22:01:20 BST 2024
Are you seeing the issues across the whole file system or in certain areas? That sounds like inode exhaustion to me (and based on it not being block exhaustion as you’ve demonstrated).
What does a “df -i /cluster” show you? Or if this is only in a certain area you can “cd” into that directory and run a “df -i .”
You may need to allocate more inodes to an independent inode fileset somewhere. Especially with something as old as 4.2.3 you won’t have auto-inode expansion for the filesets.
Best,
J.D. Maloney
Lead HPC Storage Engineer | Storage Enabling Technologies Group
National Center for Supercomputing Applications (NCSA)
From: gpfsug-discuss <gpfsug-discuss-bounces at gpfsug.org> on behalf of Rob Kudyba <rk3199 at columbia.edu>
Date: Thursday, June 6, 2024 at 3:50 PM
To: gpfsug-discuss at gpfsug.org <gpfsug-discuss at gpfsug.org>
Subject: [gpfsug-discuss] No space left on device, but plenty of quota space for inodes and blocks
Running GPFS 4.2.3 on a DDN GridScaler and users are getting the No space left on device message when trying to write to a file. In /var/adm/ras/mmfs.log the only recent errors are this:
2024-06-06_15:51:22.311-0400: mmcommon getContactNodes cluster failed. Return code -1.
2024-06-06_15:51:22.311-0400: The previous error was detected on node x.x.x.x (headnode).
2024-06-06_15:53:25.088-0400: mmcommon getContactNodes cluster failed. Return code -1.
2024-06-06_15:53:25.088-0400: The previous error was detected on node x.x.x.x (headnode).
according to https://www.ibm.com/docs/en/storage-scale/5.1.9?topic=messages-6027-615<https://urldefense.com/v3/__https:/www.ibm.com/docs/en/storage-scale/5.1.9?topic=messages-6027-615__;!!DZ3fjg!4ZyUNmTiGNp6C3Yls1wqW-RdRGa8n-ZmfZ0y0i-y6pce_ZIFSaefpOWvKIYIXspKjfREPtf3BRuO5VqAS6Y9UXQ$>
Check the preceding messages, and consult the earlier chapters of this document. A frequent cause for such errors is lack of space in /var.
We have plenty of space left.
/usr/lpp/mmfs/bin/mmlsdisk cluster
disk driver sector failure holds holds storage
name type size group metadata data status availability pool
------------ -------- ------ ----------- -------- ----- ------------- ------------ ------------
S01_MDT200_1 nsd 4096 200 Yes No ready up system
S01_MDT201_1 nsd 4096 201 Yes No ready up system
S01_DAT0001_1 nsd 4096 100 No Yes ready up data1
S01_DAT0002_1 nsd 4096 101 No Yes ready up data1
S01_DAT0003_1 nsd 4096 100 No Yes ready up data1
S01_DAT0004_1 nsd 4096 101 No Yes ready up data1
S01_DAT0005_1 nsd 4096 100 No Yes ready up data1
S01_DAT0006_1 nsd 4096 101 No Yes ready up data1
S01_DAT0007_1 nsd 4096 100 No Yes ready up data1
/usr/lpp/mmfs/bin/mmdf headnode
disk disk size failure holds holds free KB free KB
name in KB group metadata data in full blocks in fragments
--------------- ------------- -------- -------- ----- -------------------- -------------------
Disks in storage pool: system (Maximum disk size allowed is 14 TB)
S01_MDT200_1 1862270976 200 Yes No 969134848 ( 52%) 2948720 ( 0%)
S01_MDT201_1 1862270976 201 Yes No 969126144 ( 52%) 2957424 ( 0%)
------------- -------------------- -------------------
(pool total) 3724541952 1938260992 ( 52%) 5906144 ( 0%)
Disks in storage pool: data1 (Maximum disk size allowed is 578 TB)
S01_DAT0007_1 77510737920 100 No Yes 21080752128 ( 27%) 897723392 ( 1%)
S01_DAT0005_1 77510737920 100 No Yes 14507212800 ( 19%) 949412160 ( 1%)
S01_DAT0001_1 77510737920 100 No Yes 14503620608 ( 19%) 951327680 ( 1%)
S01_DAT0003_1 77510737920 100 No Yes 14509205504 ( 19%) 949340544 ( 1%)
S01_DAT0002_1 77510737920 101 No Yes 14504585216 ( 19%) 948377536 ( 1%)
S01_DAT0004_1 77510737920 101 No Yes 14503647232 ( 19%) 952892480 ( 1%)
S01_DAT0006_1 77510737920 101 No Yes 14504486912 ( 19%) 949072512 ( 1%)
------------- -------------------- -------------------
(pool total) 542575165440 108113510400 ( 20%) 6598146304 ( 1%)
============= ==================== ===================
(data) 542575165440 108113510400 ( 20%) 6598146304 ( 1%)
(metadata) 3724541952 1938260992 ( 52%) 5906144 ( 0%)
============= ==================== ===================
(total) 546299707392 110051771392 ( 22%) 6604052448 ( 1%)
Inode Information
-----------------
Total number of used inodes in all Inode spaces: 154807668
Total number of free inodes in all Inode spaces: 12964492
Total number of allocated inodes in all Inode spaces: 167772160
Total of Maximum number of inodes in all Inode spaces: 276971520
On the head node:
df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda4 430G 216G 215G 51% /
devtmpfs 47G 0 47G 0% /dev
tmpfs 47G 0 47G 0% /dev/shm
tmpfs 47G 4.1G 43G 9% /run
tmpfs 47G 0 47G 0% /sys/fs/cgroup
/dev/sda1 504M 114M 365M 24% /boot
/dev/sda2 100M 9.9M 90M 10% /boot/efi
x.x.x.:/nfs-share 430G 326G 105G 76% /nfs-share
cluster 506T 405T 101T 81% /cluster
tmpfs 9.3G 0 9.3G 0% /run/user/443748
tmpfs 9.3G 0 9.3G 0% /run/user/547288
tmpfs 9.3G 0 9.3G 0% /run/user/551336
tmpfs 9.3G 0 9.3G 0% /run/user/547289
The login nodes have plenty of space in /var:
/dev/sda3 50G 8.7G 42G 18% /var
What else should we check? We are just at 81% on the GPFS mounted file system but that should be enough for more space without these errors. Any recommended service(s) that we can restart?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20240606/175c7d81/attachment.htm>
More information about the gpfsug-discuss
mailing list