Re: [ceph-users] cephfs kernel client + snapshots slowness

Andras Pataki <apataki@xxxxxxxxxxxxxxxxxxxxx> · Mon, 13 Dec 2021 14:03:11 -0500

Hi Sebastian,

Thanks for the pointer to the tracker - I've added my report details to 
that as well.

After playing with mds_max_caps_per_client, I am now fairly convinced 
that the loop over caps in fs/ceph/snap.c in the kernel is over all (or 
most of) the caps that the kernel as at the time. Unfortunately I don't 
know the details of snaprealms well enough to understand why this would 
be or how one would go about not doing such an expensive loop so often.

Andras

On 12/10/21 12:39, Sebastian Knust wrote:
Hi,

I also see this behaviour and can more or less reproduce it running 
rsync or Bareos Backup tasks (anything stat-intense should do) on a 
specific directory. Unmounting and then remounting the filesystem 
fixes it, until it is caused again by a stat-intense task.

For me, I only saw two immediate solutions:
1) Use fuse client, which does not seem to exhibit this issue but is 
in general approx. one order of magnitude slower for metadata ops then 
kernel client
2) Do not use snapshots.

I grudgingly chose the second option, since 1) is too slow for my 
nightly backups.

A ticket has been opened at https://tracker.ceph.com/issues/44100, 
showing a few people running into these issues. Maybe your insights 
regarding the cap limit and the possible mitigation of either lowering 
the limit or raising it (above the critical number for your specific 
use case, whichever that might be) can help others as a workaround and 
might point the developers to the cause in the kernel client.

I myself will experiment with raising mds_max_caps_per_client, as I 
only have a handful of clients, all of which are heavily used. So 
thanks for recommending to look into that config value.

Cheers
Sebastian

On 10.12.21 17:33, Andras Pataki wrote:
Hi,

We've recently started using cephfs snapshots and are running into 
some issues with the kernel client.  It seems like traversing the 
file system and stat'ing files have become extremely slow.  Some (not 
all) stat operations went from microseconds to hundreds of 
milliseconds in duration.  The details are from a node that runs a 
file system scan:

The lstat calls of the file system scan take 200ms+, here is the 
strace -T -tt output:

07:05:28.909309 
lstat("/mnt/sdceph/users/apataki/home/3/cm-shared/sw/nix/store/n25qcsmcmraqylfmdnh5ns7fpw1dmfr8-python2.7-pystan-2.19.1.1/lib/python2.7/site-packages/pystan/stan/lib/stan_math/lib/boost_1.69.0/boost/units/base_units/us/tablespoon.hpp", 
{st_mode=S_IFREG|0444, st_size=1145, ...}) = 0 <0.272695>
07:05:29.182086 
lstat("/mnt/sdceph/users/apataki/home/3/cm-shared/sw/nix/store/n25qcsmcmraqylfmdnh5ns7fpw1dmfr8-python2.7-pystan-2.19.1.1/lib/python2.7/site-packages/pystan/stan/lib/stan_math/lib/boost_1.69.0/boost/units/base_units/us/teaspoon.hpp", 
{st_mode=S_IFREG|0444, st_size=1133, ...}) = 0 <0.268555>
07:05:29.450680 
lstat("/mnt/sdceph/users/apataki/home/3/cm-shared/sw/nix/store/n25qcsmcmraqylfmdnh5ns7fpw1dmfr8-python2.7-pystan-2.19.1.1/lib/python2.7/site-packages/pystan/stan/lib/stan_math/lib/boost_1.69.0/boost/units/base_units/us/ton.hpp", 
{st_mode=S_IFREG|0444, st_size=1107, ...}) = 0 <0.270040>
07:05:29.720758 
lstat("/mnt/sdceph/users/apataki/home/3/cm-shared/sw/nix/store/n25qcsmcmraqylfmdnh5ns7fpw1dmfr8-python2.7-pystan-2.19.1.1/lib/python2.7/site-packages/pystan/stan/lib/stan_math/lib/boost_1.69.0/boost/units/base_units/us/yard.hpp", 
{st_mode=S_IFREG|0444, st_size=1051, ...}) = 0 <0.268032>

The file system scanning process is constantly stuck in 'D' state 
with /proc/<pid>/stack:

[<0>] ceph_mdsc_wait_request+0x88/0x150 [ceph]
[<0>] ceph_mdsc_do_request+0x82/0x90 [ceph]
[<0>] ceph_d_revalidate+0x207/0x300 [ceph]
[<0>] lookup_fast+0x179/0x210
[<0>] walk_component+0x44/0x320
[<0>] path_lookupat+0x7b/0x220
[<0>] filename_lookup+0xa5/0x170
[<0>] vfs_statx+0x6e/0xd0
[<0>] __do_sys_newlstat+0x39/0x70
[<0>] do_syscall_64+0x4a/0xe0
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

One of the kernel threads is using close to 100% of a single core 
constantly:
     PID USER      PR  NI    VIRT    RES SHR S  %CPU  %MEM TIME+ 
COMMAND 1350556 root      20   0       0      0      0 R 100.0   0.0 
4:52.85 [kworker/18:0+ceph-msgr]

Dumping the kernel stacks (echo l > /proc/sysrq-trigger) shows 
typically waiting in ceph_queue_cap_snap or queue_realm_cap_snaps. A 
couple of examples:

Dec 10 07:36:56 popeye-mgr-0-10 kernel: CPU: 18 PID: 1350556 Comm: 
kworker/18:0 Not tainted 5.4.114.1.fi #1
Dec 10 07:36:56 popeye-mgr-0-10 kernel: Hardware name: Dell Inc. 
PowerEdge R640/0W23H8, BIOS 2.10.2 02/24/2021
Dec 10 07:36:56 popeye-mgr-0-10 kernel: Workqueue: ceph-msgr 
ceph_con_workfn [libceph]
Dec 10 07:36:56 popeye-mgr-0-10 kernel: RIP: 
0010:_raw_spin_lock+0xb/0x20
Dec 10 07:36:56 popeye-mgr-0-10 kernel: Code: 31 c0 ba ff 00 00 00 f0 
0f b1 17 75 01 c3 e9 ec 8f 98 ff 66 90 66 2e 0f 1f 84 00 00 00 00 00 
31 c0 ba 01 00 00 00 f0 0f b1 17 <75> 01 c3 89 c6 e9 0b 7e 98 ff 90 
66 2e 0f 1f 84 00 00 00 00 00 fa
Dec 10 07:36:56 popeye-mgr-0-10 kernel: RSP: 0018:ffffc9001b607c50 
EFLAGS: 00000246
Dec 10 07:36:56 popeye-mgr-0-10 kernel: RAX: 0000000000000000 RBX: 
ffff88de44ede7e8 RCX: 0000000000000000
Dec 10 07:36:56 popeye-mgr-0-10 kernel: RDX: 0000000000000001 RSI: 
0000000000000000 RDI: ffff88de44ede7f8
Dec 10 07:36:56 popeye-mgr-0-10 kernel: RBP: ffff88de0a43ca00 R08: 
ffff88dec0c63120 R09: ffff88de0a43ca00
Dec 10 07:36:56 popeye-mgr-0-10 kernel: R10: ffffffffa07e50e9 R11: 
ffff88de36aab878 R12: ffff88de1f3cab00
Dec 10 07:36:56 popeye-mgr-0-10 kernel: R13: ffff88de44edeb18 R14: 
ffff88de1f3cabc0 R15: ffff88de44edeb18
Dec 10 07:36:56 popeye-mgr-0-10 kernel: FS: 0000000000000000(0000) 
GS:ffff88dec0c40000(0000) knlGS:0000000000000000
Dec 10 07:36:56 popeye-mgr-0-10 kernel: CS:  0010 DS: 0000 ES: 0000 
CR0: 0000000080050033
Dec 10 07:36:56 popeye-mgr-0-10 kernel: CR2: 000055555584c300 CR3: 
0000005e92331003 CR4: 00000000007606e0
Dec 10 07:36:56 popeye-mgr-0-10 kernel: DR0: 0000000000000000 DR1: 
0000000000000000 DR2: 0000000000000000
Dec 10 07:36:56 popeye-mgr-0-10 kernel: DR3: 0000000000000000 DR6: 
00000000fffe0ff0 DR7: 0000000000000400
Dec 10 07:36:56 popeye-mgr-0-10 kernel: PKRU: 55555554
Dec 10 07:36:56 popeye-mgr-0-10 kernel: Call Trace:
Dec 10 07:36:56 popeye-mgr-0-10 kernel: 
ceph_queue_cap_snap+0x3e/0x3e0 [ceph]
Dec 10 07:36:56 popeye-mgr-0-10 kernel: 
queue_realm_cap_snaps+0x71/0xf0 [ceph]
Dec 10 07:36:56 popeye-mgr-0-10 kernel: 
ceph_update_snap_trace+0x382/0x400 [ceph]
Dec 10 07:36:56 popeye-mgr-0-10 kernel: handle_reply+0x209/0x770 [ceph]
Dec 10 07:36:56 popeye-mgr-0-10 kernel: dispatch+0xb2/0x1e0 [ceph]
Dec 10 07:36:56 popeye-mgr-0-10 kernel: process_message+0x7b/0x130 
[libceph]
Dec 10 07:36:56 popeye-mgr-0-10 kernel: try_read+0x340/0x5e0 [libceph]
Dec 10 07:36:56 popeye-mgr-0-10 kernel: ceph_con_workfn+0x79/0x210 
[libceph]
Dec 10 07:36:56 popeye-mgr-0-10 kernel: process_one_work+0x1a6/0x300
Dec 10 07:36:56 popeye-mgr-0-10 kernel: worker_thread+0x41/0x310
Dec 10 07:36:56 popeye-mgr-0-10 kernel: ? rescuer_thread+0x370/0x370
Dec 10 07:36:56 popeye-mgr-0-10 kernel: kthread+0x10b/0x130
Dec 10 07:36:56 popeye-mgr-0-10 kernel: ? 
kthread_create_worker_on_cpu+0x60/0x60
Dec 10 07:36:56 popeye-mgr-0-10 kernel: ret_from_fork+0x1f/0x30

Dec 10 07:40:17 popeye-mgr-0-10 kernel: CPU: 18 PID: 1350556 Comm: 
kworker/18:0 Not tainted 5.4.114.1.fi #1
Dec 10 07:40:17 popeye-mgr-0-10 kernel: Hardware name: Dell Inc. 
PowerEdge R640/0W23H8, BIOS 2.10.2 02/24/2021
Dec 10 07:40:17 popeye-mgr-0-10 kernel: Workqueue: ceph-msgr 
ceph_con_workfn [libceph]
Dec 10 07:40:17 popeye-mgr-0-10 kernel: RIP: 
0010:_raw_spin_lock+0xb/0x20
Dec 10 07:40:17 popeye-mgr-0-10 kernel: Code: 31 c0 ba ff 00 00 00 f0 
0f b1 17 75 01 c3 e9 ec 8f 98 ff 66 90 66 2e 0f 1f 84 00 00 00 00 00 
31 c0 ba 01 00 00 00 f0 0f b1 17 <75> 01 c3 89 c6 e9 0b 7e 98 ff 90 
66 2e 0f 1f 84 00 00 00 00 00 fa
Dec 10 07:40:17 popeye-mgr-0-10 kernel: RSP: 0018:ffffc9001b607ca0 
EFLAGS: 00000246
Dec 10 07:40:17 popeye-mgr-0-10 kernel: RAX: 0000000000000000 RBX: 
ffff88de62112ef0 RCX: 0000027f51c57994
Dec 10 07:40:17 popeye-mgr-0-10 kernel: RDX: 0000000000000001 RSI: 
0000000000000000 RDI: ffff88de455b4cd0
Dec 10 07:40:17 popeye-mgr-0-10 kernel: RBP: ffff88de62112bc0 R08: 
0000000000000000 R09: ffff88dbf2742600
Dec 10 07:40:17 popeye-mgr-0-10 kernel: R10: ffffffffa07e50e9 R11: 
ffff88de62112d88 R12: ffff88de455b4c00
Dec 10 07:40:17 popeye-mgr-0-10 kernel: R13: ffff88de62112ef0 R14: 
ffff88de455b4cc0 R15: ffff88de455b4cd0
Dec 10 07:40:17 popeye-mgr-0-10 kernel: FS: 0000000000000000(0000) 
GS:ffff88dec0c40000(0000) knlGS:0000000000000000
Dec 10 07:40:17 popeye-mgr-0-10 kernel: CS:  0010 DS: 0000 ES: 0000 
CR0: 0000000080050033
Dec 10 07:40:17 popeye-mgr-0-10 kernel: CR2: 000055555584c300 CR3: 
0000005e92331003 CR4: 00000000007606e0
Dec 10 07:40:17 popeye-mgr-0-10 kernel: DR0: 0000000000000000 DR1: 
0000000000000000 DR2: 0000000000000000
Dec 10 07:40:17 popeye-mgr-0-10 kernel: DR3: 0000000000000000 DR6: 
00000000fffe0ff0 DR7: 0000000000000400
Dec 10 07:40:17 popeye-mgr-0-10 kernel: PKRU: 55555554
Dec 10 07:40:17 popeye-mgr-0-10 kernel: Call Trace:
Dec 10 07:40:17 popeye-mgr-0-10 kernel: 
queue_realm_cap_snaps+0x79/0xf0 [ceph]
Dec 10 07:40:17 popeye-mgr-0-10 kernel: 
ceph_update_snap_trace+0x382/0x400 [ceph]
Dec 10 07:40:17 popeye-mgr-0-10 kernel: handle_reply+0x209/0x770 [ceph]
Dec 10 07:40:17 popeye-mgr-0-10 kernel: dispatch+0xb2/0x1e0 [ceph]
Dec 10 07:40:17 popeye-mgr-0-10 kernel: process_message+0x7b/0x130 
[libceph]
Dec 10 07:40:17 popeye-mgr-0-10 kernel: try_read+0x340/0x5e0 [libceph]
Dec 10 07:40:17 popeye-mgr-0-10 kernel: ceph_con_workfn+0x79/0x210 
[libceph]
Dec 10 07:40:17 popeye-mgr-0-10 kernel: process_one_work+0x1a6/0x300
Dec 10 07:40:17 popeye-mgr-0-10 kernel: worker_thread+0x41/0x310
Dec 10 07:40:17 popeye-mgr-0-10 kernel: ? rescuer_thread+0x370/0x370
Dec 10 07:40:17 popeye-mgr-0-10 kernel: kthread+0x10b/0x130
Dec 10 07:40:17 popeye-mgr-0-10 kernel: ? 
kthread_create_worker_on_cpu+0x60/0x60
Dec 10 07:40:17 popeye-mgr-0-10 kernel: ret_from_fork+0x1f/0x30

Turning on kernel debugging for fs/ceph/snap.c, there seems this loop 
in queue_realm_cap_snaps is really long for some inodes:

         list_for_each_entry(ci, &realm->inodes_with_caps, 
i_snap_realm_item) {
         ...
         }

[3199649.882973] ceph:  queue_realm_cap_snaps 00000000d313f1e4 
1001af8c8e7 inodes
[3199649.882974] ceph:  queue_cap_snap 00000000171f11a7 nothing 
dirty|writing
[3199649.882975] ceph:  queue_cap_snap 00000000938b9cd2 nothing 
dirty|writing
[3199649.882976] ceph:  queue_cap_snap 00000000615cf4dd nothing 
dirty|writing
[3199649.882977] ceph:  queue_cap_snap 000000000027e295 nothing 
dirty|writing
[3199649.882979] ceph:  queue_cap_snap 00000000ba18b2f8 nothing 
dirty|writing
[3199649.882980] ceph:  queue_cap_snap 000000007c9c80de nothing 
dirty|writing
[3199649.882981] ceph:  queue_cap_snap 00000000629b4b0e nothing 
dirty|writing
[3199649.882982] ceph:  queue_cap_snap 00000000ab330b37 nothing 
dirty|writing
[3199649.882983] ceph:  queue_cap_snap 00000000c7dbc320 nothing 
dirty|writing
[3199649.882985] ceph:  queue_cap_snap 0000000070a0598f nothing 
dirty|writing
[3199649.882986] ceph:  queue_cap_snap 00000000915b9e2e nothing 
dirty|writing
... (and a lot lot more of these) ...

At this point the client has about a million caps (running up against 
the default cap limit) - so potentially this loop is over all the 
caps (?), which could mean tens/hundreds of milliseconds? Indeed, 
reducing mds_max_caps_per_client by an order of magnitude does 
improve the lstat times by about an order of magnitude (which is 
still pretty slow - but supports this hypothesis).

The ceph cluster is Nautilus 14.2.20.  There are a total of 7 
snapshots in cephfs, all taken at the root of the cephfs tree (a 
rolling set of 7 previous daily snapshots).

I've tested this with a few kernels: two LTS ones, and one more 
recent stable one: 5.4.114, 5.10.73 and 5.14.16 with the same 
result.  Any ideas/suggestions?

Andras

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx