Re: NFS never recovers after slow ops

Torkil Svensgaard <torkil@xxxxxxxx> · Sat, 6 Apr 2024 20:07:33 +0200

On 06-04-2024 18:10, Torkil Svensgaard wrote:
Hi

Cephadm Reef 18.2.1

Started draining 5 18-20 TB HDD OSDs (DB/WAL om NVMe) on one host. Even 
with osd_max_backfills at 1 the OSDs get slow ops from time to time 
which seems odd as we recently did a huge reshuffle[1] involving the 
same host without seeing these slow ops.

I guess one difference is the disks were then only getting writes when 
they were added and now they are only being used for reads as they are 
being drained.

The slow ops eventually go away but I'm seeing stuck nfsd threads from 
RBD exports lingering on forever. I have to reboot the NFS server to get 
it going again, restarting nfs-server also just hangs.

Here's a stack trace from dmesg:

"
[Sat Apr  6 17:44:52 2024] INFO: task nfsd:52502 blocked for more than 
1245 seconds.
[Sat Apr  6 17:44:52 2024]       Not tainted 
5.14.0-362.8.1.test2.el9_3.x86_64 #1
[Sat Apr  6 17:44:52 2024] "echo 0 > /proc/sys/kernel/ 
hung_task_timeout_secs" disables this message.
[Sat Apr  6 17:44:52 2024] task:nfsd            state:D stack:0 pid: 
52502 ppid:2      flags:0x00004000
[Sat Apr  6 17:44:52 2024] Call Trace:
[Sat Apr  6 17:44:52 2024]  <TASK>
[Sat Apr  6 17:44:52 2024]  __schedule+0x20a/0x550
[Sat Apr  6 17:44:52 2024]  schedule+0x2d/0x70
[Sat Apr  6 17:44:52 2024]  schedule_timeout+0x11f/0x160
[Sat Apr  6 17:44:52 2024]  ? xfs_trans_read_buf_map+0x133/0x300 [xfs]
[Sat Apr  6 17:44:52 2024]  ? xfs_btree_read_buf_block.constprop.0+0x9a/ 
0xd0 [xfs]
[Sat Apr  6 17:44:52 2024]  __down_common+0x11f/0x200
[Sat Apr  6 17:44:52 2024]  ? xfs_btree_read_buf_block.constprop. 
0+0x30/0xd0 [xfs]
[Sat Apr  6 17:44:52 2024]  down+0x43/0x60
[Sat Apr  6 17:44:52 2024]  xfs_buf_lock+0x2d/0xe0 [xfs]
[Sat Apr  6 17:44:52 2024]  xfs_buf_find_lock+0x45/0xf0 [xfs]
[Sat Apr  6 17:44:52 2024]  xfs_buf_lookup.constprop.0+0xe4/0x170 [xfs]
[Sat Apr  6 17:44:52 2024]  xfs_buf_get_map+0xc1/0x3a0 [xfs]
[Sat Apr  6 17:44:52 2024]  xfs_buf_read_map+0x54/0x290 [xfs]
[Sat Apr  6 17:44:52 2024]  ? xfs_imap_to_bp+0x4e/0x70 [xfs]
[Sat Apr  6 17:44:52 2024]  ? xfs_imap_lookup+0x173/0x1d0 [xfs]
[Sat Apr  6 17:44:52 2024]  xfs_trans_read_buf_map+0x133/0x300 [xfs]
[Sat Apr  6 17:44:52 2024]  ? xfs_imap_to_bp+0x4e/0x70 [xfs]
[Sat Apr  6 17:44:52 2024]  xfs_imap_to_bp+0x4e/0x70 [xfs]
[Sat Apr  6 17:44:52 2024]  xfs_iget_cache_miss+0xa2/0x370 [xfs]
[Sat Apr  6 17:44:52 2024]  xfs_iget+0x19f/0x270 [xfs]
[Sat Apr  6 17:44:52 2024]  ? __pfx_nfsd_acceptable+0x10/0x10 [nfsd]
[Sat Apr  6 17:44:52 2024]  xfs_nfs_get_inode.isra.0+0x5e/0xa0 [xfs]
[Sat Apr  6 17:44:52 2024]  xfs_fs_fh_to_dentry+0x48/0xb0 [xfs]
[Sat Apr  6 17:44:52 2024]  exportfs_decode_fh_raw+0x60/0x2e0
[Sat Apr  6 17:44:52 2024]  ? exp_find_key+0x99/0x1e0 [nfsd]
[Sat Apr  6 17:44:52 2024]  ? rcu_nocb_try_bypass+0x4d/0x440
[Sat Apr  6 17:44:52 2024]  ? __kmalloc+0x19b/0x370
[Sat Apr  6 17:44:52 2024]  ? __pfx_put_cred_rcu+0x10/0x10
[Sat Apr  6 17:44:52 2024]  ? call_rcu+0x114/0x310
[Sat Apr  6 17:44:52 2024]  nfsd_set_fh_dentry+0x2b9/0x470 [nfsd]
[Sat Apr  6 17:44:52 2024]  fh_verify+0x1b3/0x2f0 [nfsd]
[Sat Apr  6 17:44:52 2024]  nfsd4_putfh+0x3e/0x70 [nfsd]
[Sat Apr  6 17:44:52 2024]  nfsd4_proc_compound+0x44e/0x700 [nfsd]
[Sat Apr  6 17:44:52 2024]  nfsd_dispatch+0x53/0x170 [nfsd]
[Sat Apr  6 17:44:52 2024]  svc_process_common+0x357/0x640 [sunrpc]
[Sat Apr  6 17:44:52 2024]  ? __pfx_nfsd_dispatch+0x10/0x10 [nfsd]
[Sat Apr  6 17:44:52 2024]  ? __pfx_nfsd+0x10/0x10 [nfsd]
[Sat Apr  6 17:44:52 2024]  svc_process+0x12d/0x180 [sunrpc]
[Sat Apr  6 17:44:52 2024]  nfsd+0xd5/0x190 [nfsd]
[Sat Apr  6 17:44:52 2024]  kthread+0xe0/0x100
[Sat Apr  6 17:44:52 2024]  ? __pfx_kthread+0x10/0x10
[Sat Apr  6 17:44:52 2024]  ret_from_fork+0x2c/0x50
[Sat Apr  6 17:44:52 2024]  </TASK>
"

Stack:

"
[root@cogsworth ~]# cat  /proc/52502/stack
[<0>] xfs_buf_lock+0x2d/0xe0 [xfs]
[<0>] xfs_buf_find_lock+0x45/0xf0 [xfs]
[<0>] xfs_buf_lookup.constprop.0+0xe4/0x170 [xfs]
[<0>] xfs_buf_get_map+0xc1/0x3a0 [xfs]
[<0>] xfs_buf_read_map+0x54/0x290 [xfs]
[<0>] xfs_trans_read_buf_map+0x133/0x300 [xfs]
[<0>] xfs_imap_to_bp+0x4e/0x70 [xfs]
[<0>] xfs_iget_cache_miss+0xa2/0x370 [xfs]
[<0>] xfs_iget+0x19f/0x270 [xfs]
[<0>] xfs_nfs_get_inode.isra.0+0x5e/0xa0 [xfs]
[<0>] xfs_fs_fh_to_dentry+0x48/0xb0 [xfs]
[<0>] exportfs_decode_fh_raw+0x60/0x2e0
[<0>] nfsd_set_fh_dentry+0x2b9/0x470 [nfsd]
[<0>] fh_verify+0x1b3/0x2f0 [nfsd]
[<0>] nfsd4_putfh+0x3e/0x70 [nfsd]
[<0>] nfsd4_proc_compound+0x44e/0x700 [nfsd]
[<0>] nfsd_dispatch+0x53/0x170 [nfsd]
[<0>] svc_process_common+0x357/0x640 [sunrpc]
[<0>] svc_process+0x12d/0x180 [sunrpc]
[<0>] nfsd+0xd5/0x190 [nfsd]
[<0>] kthread+0xe0/0x100
[<0>] ret_from_fork+0x2c/0x50
"

The nfsd treads do not recover even with nobackfill set, so the cluster 
is essential idle:

"
[root@lazy ~]# ceph -s
   cluster:
     id:     XXXXXXXXXXXXXXXXXXXXXXXXX
     health: HEALTH_ERR
             nobackfill,noscrub,nodeep-scrub flag(s) set
             1 scrub errors
             Possible data damage: 1 pg inconsistent
             631 pgs not deep-scrubbed in time

   services:
     mon: 5 daemons, quorum lazy,jolly,happy,dopey,sleepy (age 2d)
     mgr: jolly.tpgixt(active, since 3d), standbys: dopey.lxajvk, 
lazy.xuhetq
     mds: 1/1 daemons up, 2 standby
     osd: 537 osds: 537 up (since 8h), 537 in (since 10d); 917 remapped pgs
          flags nobackfill,noscrub,nodeep-scrub

   data:
     volumes: 1/1 healthy
     pools:   15 pools, 10849 pgs
     objects: 548.99M objects, 1.1 PiB
     usage:   1.9 PiB used, 2.3 PiB / 4.2 PiB avail
     pgs:     97810419/3182590113 objects misplaced (3.073%)
              9931 active+clean
              893  active+remapped+backfill_wait
              24   active+remapped+backfilling
              1    active+clean+inconsistent

   io:
     client:   3.5 KiB/s rd, 2.0 MiB/s wr, 5 op/s rd, 115 op/s wr
"

Any ideas on how to get the nfsd threads to recover? There must be stuck 
ceph I/O somewhere which never time out or something like that?

I tried restarting all OSDs and that cleared the blocked processes on 
the NFS server. That is a rather crude and cumbersome way to go about 
it, with some impact on production as well.

Is there some way to determine which OSDs are causing the problem when 
this happens? And if there is, shouldn't this information be published 
by "ceph -s"?

Mvh.

Torkil

  > Mvh.

Torkil

[1] https://www.spinics.net/lists/ceph-users/msg81549.html

--
Torkil Svensgaard
Systems Administrator
Danish Research Centre for Magnetic Resonance DRCMR, Section 714
Copenhagen University Hospital Amager and Hvidovre
Kettegaard Allé 30, 2650 Hvidovre, Denmark
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx