cephfs kernel client + snapshots slowness

Andras Pataki <apataki@xxxxxxxxxxxxxxxxxxxxx> · Fri, 10 Dec 2021 11:33:24 -0500

Hi,

We've recently started using cephfs snapshots and are running into some 
issues with the kernel client.  It seems like traversing the file system 
and stat'ing files have become extremely slow.  Some (not all) stat 
operations went from microseconds to hundreds of milliseconds in 
duration.  The details are from a node that runs a file system scan:

The lstat calls of the file system scan take 200ms+, here is the strace 
-T -tt output:

07:05:28.909309 
lstat("/mnt/sdceph/users/apataki/home/3/cm-shared/sw/nix/store/n25qcsmcmraqylfmdnh5ns7fpw1dmfr8-python2.7-pystan-2.19.1.1/lib/python2.7/site-packages/pystan/stan/lib/stan_math/lib/boost_1.69.0/boost/units/base_units/us/tablespoon.hpp", 
{st_mode=S_IFREG|0444, st_size=1145, ...}) = 0 <0.272695>
07:05:29.182086 
lstat("/mnt/sdceph/users/apataki/home/3/cm-shared/sw/nix/store/n25qcsmcmraqylfmdnh5ns7fpw1dmfr8-python2.7-pystan-2.19.1.1/lib/python2.7/site-packages/pystan/stan/lib/stan_math/lib/boost_1.69.0/boost/units/base_units/us/teaspoon.hpp", 
{st_mode=S_IFREG|0444, st_size=1133, ...}) = 0 <0.268555>
07:05:29.450680 
lstat("/mnt/sdceph/users/apataki/home/3/cm-shared/sw/nix/store/n25qcsmcmraqylfmdnh5ns7fpw1dmfr8-python2.7-pystan-2.19.1.1/lib/python2.7/site-packages/pystan/stan/lib/stan_math/lib/boost_1.69.0/boost/units/base_units/us/ton.hpp", 
{st_mode=S_IFREG|0444, st_size=1107, ...}) = 0 <0.270040>
07:05:29.720758 
lstat("/mnt/sdceph/users/apataki/home/3/cm-shared/sw/nix/store/n25qcsmcmraqylfmdnh5ns7fpw1dmfr8-python2.7-pystan-2.19.1.1/lib/python2.7/site-packages/pystan/stan/lib/stan_math/lib/boost_1.69.0/boost/units/base_units/us/yard.hpp", 
{st_mode=S_IFREG|0444, st_size=1051, ...}) = 0 <0.268032>

The file system scanning process is constantly stuck in 'D' state with 
/proc/<pid>/stack:

[<0>] ceph_mdsc_wait_request+0x88/0x150 [ceph]
[<0>] ceph_mdsc_do_request+0x82/0x90 [ceph]
[<0>] ceph_d_revalidate+0x207/0x300 [ceph]
[<0>] lookup_fast+0x179/0x210
[<0>] walk_component+0x44/0x320
[<0>] path_lookupat+0x7b/0x220
[<0>] filename_lookup+0xa5/0x170
[<0>] vfs_statx+0x6e/0xd0
[<0>] __do_sys_newlstat+0x39/0x70
[<0>] do_syscall_64+0x4a/0xe0
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

One of the kernel threads is using close to 100% of a single core 
constantly:
    PID USER      PR  NI    VIRT    RES SHR S  %CPU  %MEM     TIME+ 
COMMAND 1350556 root      20   0       0      0      0 R 100.0   0.0 
4:52.85 [kworker/18:0+ceph-msgr]

Dumping the kernel stacks (echo l > /proc/sysrq-trigger) shows typically 
waiting in ceph_queue_cap_snap or queue_realm_cap_snaps. A couple of 
examples:

Dec 10 07:36:56 popeye-mgr-0-10 kernel: CPU: 18 PID: 1350556 Comm: 
kworker/18:0 Not tainted 5.4.114.1.fi #1
Dec 10 07:36:56 popeye-mgr-0-10 kernel: Hardware name: Dell Inc. 
PowerEdge R640/0W23H8, BIOS 2.10.2 02/24/2021
Dec 10 07:36:56 popeye-mgr-0-10 kernel: Workqueue: ceph-msgr 
ceph_con_workfn [libceph]
Dec 10 07:36:56 popeye-mgr-0-10 kernel: RIP: 0010:_raw_spin_lock+0xb/0x20
Dec 10 07:36:56 popeye-mgr-0-10 kernel: Code: 31 c0 ba ff 00 00 00 f0 0f 
b1 17 75 01 c3 e9 ec 8f 98 ff 66 90 66 2e 0f 1f 84 00 00 00 00 00 31 c0 
ba 01 00 00 00 f0 0f b1 17 <75> 01 c3 89 c6 e9 0b 7e 98 ff 90 66 2e 0f 
1f 84 00 00 00 00 00 fa
Dec 10 07:36:56 popeye-mgr-0-10 kernel: RSP: 0018:ffffc9001b607c50 
EFLAGS: 00000246
Dec 10 07:36:56 popeye-mgr-0-10 kernel: RAX: 0000000000000000 RBX: 
ffff88de44ede7e8 RCX: 0000000000000000
Dec 10 07:36:56 popeye-mgr-0-10 kernel: RDX: 0000000000000001 RSI: 
0000000000000000 RDI: ffff88de44ede7f8
Dec 10 07:36:56 popeye-mgr-0-10 kernel: RBP: ffff88de0a43ca00 R08: 
ffff88dec0c63120 R09: ffff88de0a43ca00
Dec 10 07:36:56 popeye-mgr-0-10 kernel: R10: ffffffffa07e50e9 R11: 
ffff88de36aab878 R12: ffff88de1f3cab00
Dec 10 07:36:56 popeye-mgr-0-10 kernel: R13: ffff88de44edeb18 R14: 
ffff88de1f3cabc0 R15: ffff88de44edeb18
Dec 10 07:36:56 popeye-mgr-0-10 kernel: FS:  0000000000000000(0000) 
GS:ffff88dec0c40000(0000) knlGS:0000000000000000
Dec 10 07:36:56 popeye-mgr-0-10 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 
0000000080050033
Dec 10 07:36:56 popeye-mgr-0-10 kernel: CR2: 000055555584c300 CR3: 
0000005e92331003 CR4: 00000000007606e0
Dec 10 07:36:56 popeye-mgr-0-10 kernel: DR0: 0000000000000000 DR1: 
0000000000000000 DR2: 0000000000000000
Dec 10 07:36:56 popeye-mgr-0-10 kernel: DR3: 0000000000000000 DR6: 
00000000fffe0ff0 DR7: 0000000000000400
Dec 10 07:36:56 popeye-mgr-0-10 kernel: PKRU: 55555554
Dec 10 07:36:56 popeye-mgr-0-10 kernel: Call Trace:
Dec 10 07:36:56 popeye-mgr-0-10 kernel: ceph_queue_cap_snap+0x3e/0x3e0 
[ceph]
Dec 10 07:36:56 popeye-mgr-0-10 kernel: queue_realm_cap_snaps+0x71/0xf0 
[ceph]
Dec 10 07:36:56 popeye-mgr-0-10 kernel: 
ceph_update_snap_trace+0x382/0x400 [ceph]
Dec 10 07:36:56 popeye-mgr-0-10 kernel: handle_reply+0x209/0x770 [ceph]
Dec 10 07:36:56 popeye-mgr-0-10 kernel: dispatch+0xb2/0x1e0 [ceph]
Dec 10 07:36:56 popeye-mgr-0-10 kernel: process_message+0x7b/0x130 [libceph]
Dec 10 07:36:56 popeye-mgr-0-10 kernel: try_read+0x340/0x5e0 [libceph]
Dec 10 07:36:56 popeye-mgr-0-10 kernel: ceph_con_workfn+0x79/0x210 [libceph]
Dec 10 07:36:56 popeye-mgr-0-10 kernel: process_one_work+0x1a6/0x300
Dec 10 07:36:56 popeye-mgr-0-10 kernel: worker_thread+0x41/0x310
Dec 10 07:36:56 popeye-mgr-0-10 kernel: ? rescuer_thread+0x370/0x370
Dec 10 07:36:56 popeye-mgr-0-10 kernel: kthread+0x10b/0x130
Dec 10 07:36:56 popeye-mgr-0-10 kernel: ? 
kthread_create_worker_on_cpu+0x60/0x60
Dec 10 07:36:56 popeye-mgr-0-10 kernel: ret_from_fork+0x1f/0x30

Dec 10 07:40:17 popeye-mgr-0-10 kernel: CPU: 18 PID: 1350556 Comm: 
kworker/18:0 Not tainted 5.4.114.1.fi #1
Dec 10 07:40:17 popeye-mgr-0-10 kernel: Hardware name: Dell Inc. 
PowerEdge R640/0W23H8, BIOS 2.10.2 02/24/2021
Dec 10 07:40:17 popeye-mgr-0-10 kernel: Workqueue: ceph-msgr 
ceph_con_workfn [libceph]
Dec 10 07:40:17 popeye-mgr-0-10 kernel: RIP: 0010:_raw_spin_lock+0xb/0x20
Dec 10 07:40:17 popeye-mgr-0-10 kernel: Code: 31 c0 ba ff 00 00 00 f0 0f 
b1 17 75 01 c3 e9 ec 8f 98 ff 66 90 66 2e 0f 1f 84 00 00 00 00 00 31 c0 
ba 01 00 00 00 f0 0f b1 17 <75> 01 c3 89 c6 e9 0b 7e 98 ff 90 66 2e 0f 
1f 84 00 00 00 00 00 fa
Dec 10 07:40:17 popeye-mgr-0-10 kernel: RSP: 0018:ffffc9001b607ca0 
EFLAGS: 00000246
Dec 10 07:40:17 popeye-mgr-0-10 kernel: RAX: 0000000000000000 RBX: 
ffff88de62112ef0 RCX: 0000027f51c57994
Dec 10 07:40:17 popeye-mgr-0-10 kernel: RDX: 0000000000000001 RSI: 
0000000000000000 RDI: ffff88de455b4cd0
Dec 10 07:40:17 popeye-mgr-0-10 kernel: RBP: ffff88de62112bc0 R08: 
0000000000000000 R09: ffff88dbf2742600
Dec 10 07:40:17 popeye-mgr-0-10 kernel: R10: ffffffffa07e50e9 R11: 
ffff88de62112d88 R12: ffff88de455b4c00
Dec 10 07:40:17 popeye-mgr-0-10 kernel: R13: ffff88de62112ef0 R14: 
ffff88de455b4cc0 R15: ffff88de455b4cd0
Dec 10 07:40:17 popeye-mgr-0-10 kernel: FS:  0000000000000000(0000) 
GS:ffff88dec0c40000(0000) knlGS:0000000000000000
Dec 10 07:40:17 popeye-mgr-0-10 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 
0000000080050033
Dec 10 07:40:17 popeye-mgr-0-10 kernel: CR2: 000055555584c300 CR3: 
0000005e92331003 CR4: 00000000007606e0
Dec 10 07:40:17 popeye-mgr-0-10 kernel: DR0: 0000000000000000 DR1: 
0000000000000000 DR2: 0000000000000000
Dec 10 07:40:17 popeye-mgr-0-10 kernel: DR3: 0000000000000000 DR6: 
00000000fffe0ff0 DR7: 0000000000000400
Dec 10 07:40:17 popeye-mgr-0-10 kernel: PKRU: 55555554
Dec 10 07:40:17 popeye-mgr-0-10 kernel: Call Trace:
Dec 10 07:40:17 popeye-mgr-0-10 kernel: queue_realm_cap_snaps+0x79/0xf0 
[ceph]
Dec 10 07:40:17 popeye-mgr-0-10 kernel: 
ceph_update_snap_trace+0x382/0x400 [ceph]
Dec 10 07:40:17 popeye-mgr-0-10 kernel: handle_reply+0x209/0x770 [ceph]
Dec 10 07:40:17 popeye-mgr-0-10 kernel: dispatch+0xb2/0x1e0 [ceph]
Dec 10 07:40:17 popeye-mgr-0-10 kernel: process_message+0x7b/0x130 [libceph]
Dec 10 07:40:17 popeye-mgr-0-10 kernel: try_read+0x340/0x5e0 [libceph]
Dec 10 07:40:17 popeye-mgr-0-10 kernel: ceph_con_workfn+0x79/0x210 [libceph]
Dec 10 07:40:17 popeye-mgr-0-10 kernel: process_one_work+0x1a6/0x300
Dec 10 07:40:17 popeye-mgr-0-10 kernel: worker_thread+0x41/0x310
Dec 10 07:40:17 popeye-mgr-0-10 kernel: ? rescuer_thread+0x370/0x370
Dec 10 07:40:17 popeye-mgr-0-10 kernel: kthread+0x10b/0x130
Dec 10 07:40:17 popeye-mgr-0-10 kernel: ? 
kthread_create_worker_on_cpu+0x60/0x60
Dec 10 07:40:17 popeye-mgr-0-10 kernel: ret_from_fork+0x1f/0x30

Turning on kernel debugging for fs/ceph/snap.c, there seems this loop in 
queue_realm_cap_snaps is really long for some inodes:

        list_for_each_entry(ci, &realm->inodes_with_caps, 
i_snap_realm_item) {
        ...
        }

[3199649.882973] ceph:  queue_realm_cap_snaps 00000000d313f1e4 
1001af8c8e7 inodes
[3199649.882974] ceph:  queue_cap_snap 00000000171f11a7 nothing 
dirty|writing
[3199649.882975] ceph:  queue_cap_snap 00000000938b9cd2 nothing 
dirty|writing
[3199649.882976] ceph:  queue_cap_snap 00000000615cf4dd nothing 
dirty|writing
[3199649.882977] ceph:  queue_cap_snap 000000000027e295 nothing 
dirty|writing
[3199649.882979] ceph:  queue_cap_snap 00000000ba18b2f8 nothing 
dirty|writing
[3199649.882980] ceph:  queue_cap_snap 000000007c9c80de nothing 
dirty|writing
[3199649.882981] ceph:  queue_cap_snap 00000000629b4b0e nothing 
dirty|writing
[3199649.882982] ceph:  queue_cap_snap 00000000ab330b37 nothing 
dirty|writing
[3199649.882983] ceph:  queue_cap_snap 00000000c7dbc320 nothing 
dirty|writing
[3199649.882985] ceph:  queue_cap_snap 0000000070a0598f nothing 
dirty|writing
[3199649.882986] ceph:  queue_cap_snap 00000000915b9e2e nothing 
dirty|writing
... (and a lot lot more of these) ...

At this point the client has about a million caps (running up against 
the default cap limit) - so potentially this loop is over all the caps 
(?), which could mean tens/hundreds of milliseconds? Indeed, reducing 
mds_max_caps_per_client by an order of magnitude does improve the lstat 
times by about an order of magnitude (which is still pretty slow - but 
supports this hypothesis).

The ceph cluster is Nautilus 14.2.20.  There are a total of 7 snapshots 
in cephfs, all taken at the root of the cephfs tree (a rolling set of 7 
previous daily snapshots).

I've tested this with a few kernels: two LTS ones, and one more recent 
stable one: 5.4.114, 5.10.73 and 5.14.16 with the same result.  Any 
ideas/suggestions?

Andras

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx