Hi
Cephadm Reef 18.2.1
Started draining 5 18-20 TB HDD OSDs (DB/WAL om NVMe) on one host.
Even with osd_max_backfills at 1 the OSDs get slow ops from time to
time which seems odd as we recently did a huge reshuffle[1]
involving the same host without seeing these slow ops.
I guess one difference is the disks were then only getting writes
when they were added and now they are only being used for reads as
they are being drained.
The slow ops eventually go away but I'm seeing stuck nfsd threads
from RBD exports lingering on forever. I have to reboot the NFS
server to get it going again, restarting nfs-server also just hangs.
Here's a stack trace from dmesg:
"
[Sat Apr 6 17:44:52 2024] INFO: task nfsd:52502 blocked for more
than 1245 seconds.
[Sat Apr 6 17:44:52 2024] Not tainted
5.14.0-362.8.1.test2.el9_3.x86_64 #1
[Sat Apr 6 17:44:52 2024] "echo 0 > /proc/sys/kernel/
hung_task_timeout_secs" disables this message.
[Sat Apr 6 17:44:52 2024] task:nfsd state:D stack:0
pid: 52502 ppid:2 flags:0x00004000
[Sat Apr 6 17:44:52 2024] Call Trace:
[Sat Apr 6 17:44:52 2024] <TASK>
[Sat Apr 6 17:44:52 2024] __schedule+0x20a/0x550
[Sat Apr 6 17:44:52 2024] schedule+0x2d/0x70
[Sat Apr 6 17:44:52 2024] schedule_timeout+0x11f/0x160
[Sat Apr 6 17:44:52 2024] ? xfs_trans_read_buf_map+0x133/0x300 [xfs]
[Sat Apr 6 17:44:52 2024] ?
xfs_btree_read_buf_block.constprop.0+0x9a/ 0xd0 [xfs]
[Sat Apr 6 17:44:52 2024] __down_common+0x11f/0x200
[Sat Apr 6 17:44:52 2024] ? xfs_btree_read_buf_block.constprop.
0+0x30/0xd0 [xfs]
[Sat Apr 6 17:44:52 2024] down+0x43/0x60
[Sat Apr 6 17:44:52 2024] xfs_buf_lock+0x2d/0xe0 [xfs]
[Sat Apr 6 17:44:52 2024] xfs_buf_find_lock+0x45/0xf0 [xfs]
[Sat Apr 6 17:44:52 2024] xfs_buf_lookup.constprop.0+0xe4/0x170 [xfs]
[Sat Apr 6 17:44:52 2024] xfs_buf_get_map+0xc1/0x3a0 [xfs]
[Sat Apr 6 17:44:52 2024] xfs_buf_read_map+0x54/0x290 [xfs]
[Sat Apr 6 17:44:52 2024] ? xfs_imap_to_bp+0x4e/0x70 [xfs]
[Sat Apr 6 17:44:52 2024] ? xfs_imap_lookup+0x173/0x1d0 [xfs]
[Sat Apr 6 17:44:52 2024] xfs_trans_read_buf_map+0x133/0x300 [xfs]
[Sat Apr 6 17:44:52 2024] ? xfs_imap_to_bp+0x4e/0x70 [xfs]
[Sat Apr 6 17:44:52 2024] xfs_imap_to_bp+0x4e/0x70 [xfs]
[Sat Apr 6 17:44:52 2024] xfs_iget_cache_miss+0xa2/0x370 [xfs]
[Sat Apr 6 17:44:52 2024] xfs_iget+0x19f/0x270 [xfs]
[Sat Apr 6 17:44:52 2024] ? __pfx_nfsd_acceptable+0x10/0x10 [nfsd]
[Sat Apr 6 17:44:52 2024] xfs_nfs_get_inode.isra.0+0x5e/0xa0 [xfs]
[Sat Apr 6 17:44:52 2024] xfs_fs_fh_to_dentry+0x48/0xb0 [xfs]
[Sat Apr 6 17:44:52 2024] exportfs_decode_fh_raw+0x60/0x2e0
[Sat Apr 6 17:44:52 2024] ? exp_find_key+0x99/0x1e0 [nfsd]
[Sat Apr 6 17:44:52 2024] ? rcu_nocb_try_bypass+0x4d/0x440
[Sat Apr 6 17:44:52 2024] ? __kmalloc+0x19b/0x370
[Sat Apr 6 17:44:52 2024] ? __pfx_put_cred_rcu+0x10/0x10
[Sat Apr 6 17:44:52 2024] ? call_rcu+0x114/0x310
[Sat Apr 6 17:44:52 2024] nfsd_set_fh_dentry+0x2b9/0x470 [nfsd]
[Sat Apr 6 17:44:52 2024] fh_verify+0x1b3/0x2f0 [nfsd]
[Sat Apr 6 17:44:52 2024] nfsd4_putfh+0x3e/0x70 [nfsd]
[Sat Apr 6 17:44:52 2024] nfsd4_proc_compound+0x44e/0x700 [nfsd]
[Sat Apr 6 17:44:52 2024] nfsd_dispatch+0x53/0x170 [nfsd]
[Sat Apr 6 17:44:52 2024] svc_process_common+0x357/0x640 [sunrpc]
[Sat Apr 6 17:44:52 2024] ? __pfx_nfsd_dispatch+0x10/0x10 [nfsd]
[Sat Apr 6 17:44:52 2024] ? __pfx_nfsd+0x10/0x10 [nfsd]
[Sat Apr 6 17:44:52 2024] svc_process+0x12d/0x180 [sunrpc]
[Sat Apr 6 17:44:52 2024] nfsd+0xd5/0x190 [nfsd]
[Sat Apr 6 17:44:52 2024] kthread+0xe0/0x100
[Sat Apr 6 17:44:52 2024] ? __pfx_kthread+0x10/0x10
[Sat Apr 6 17:44:52 2024] ret_from_fork+0x2c/0x50
[Sat Apr 6 17:44:52 2024] </TASK>
"
Stack:
"
[root@cogsworth ~]# cat /proc/52502/stack
[<0>] xfs_buf_lock+0x2d/0xe0 [xfs]
[<0>] xfs_buf_find_lock+0x45/0xf0 [xfs]
[<0>] xfs_buf_lookup.constprop.0+0xe4/0x170 [xfs]
[<0>] xfs_buf_get_map+0xc1/0x3a0 [xfs]
[<0>] xfs_buf_read_map+0x54/0x290 [xfs]
[<0>] xfs_trans_read_buf_map+0x133/0x300 [xfs]
[<0>] xfs_imap_to_bp+0x4e/0x70 [xfs]
[<0>] xfs_iget_cache_miss+0xa2/0x370 [xfs]
[<0>] xfs_iget+0x19f/0x270 [xfs]
[<0>] xfs_nfs_get_inode.isra.0+0x5e/0xa0 [xfs]
[<0>] xfs_fs_fh_to_dentry+0x48/0xb0 [xfs]
[<0>] exportfs_decode_fh_raw+0x60/0x2e0
[<0>] nfsd_set_fh_dentry+0x2b9/0x470 [nfsd]
[<0>] fh_verify+0x1b3/0x2f0 [nfsd]
[<0>] nfsd4_putfh+0x3e/0x70 [nfsd]
[<0>] nfsd4_proc_compound+0x44e/0x700 [nfsd]
[<0>] nfsd_dispatch+0x53/0x170 [nfsd]
[<0>] svc_process_common+0x357/0x640 [sunrpc]
[<0>] svc_process+0x12d/0x180 [sunrpc]
[<0>] nfsd+0xd5/0x190 [nfsd]
[<0>] kthread+0xe0/0x100
[<0>] ret_from_fork+0x2c/0x50
"
The nfsd treads do not recover even with nobackfill set, so the
cluster is essential idle:
"
[root@lazy ~]# ceph -s
cluster:
id: XXXXXXXXXXXXXXXXXXXXXXXXX
health: HEALTH_ERR
nobackfill,noscrub,nodeep-scrub flag(s) set
1 scrub errors
Possible data damage: 1 pg inconsistent
631 pgs not deep-scrubbed in time
services:
mon: 5 daemons, quorum lazy,jolly,happy,dopey,sleepy (age 2d)
mgr: jolly.tpgixt(active, since 3d), standbys: dopey.lxajvk, lazy.xuhetq
mds: 1/1 daemons up, 2 standby
osd: 537 osds: 537 up (since 8h), 537 in (since 10d); 917 remapped pgs
flags nobackfill,noscrub,nodeep-scrub
data:
volumes: 1/1 healthy
pools: 15 pools, 10849 pgs
objects: 548.99M objects, 1.1 PiB
usage: 1.9 PiB used, 2.3 PiB / 4.2 PiB avail
pgs: 97810419/3182590113 objects misplaced (3.073%)
9931 active+clean
893 active+remapped+backfill_wait
24 active+remapped+backfilling
1 active+clean+inconsistent
io:
client: 3.5 KiB/s rd, 2.0 MiB/s wr, 5 op/s rd, 115 op/s wr
"
Any ideas on how to get the nfsd threads to recover? There must be
stuck ceph I/O somewhere which never time out or something like that?