Hi,
We've recently started using cephfs snapshots and are running into some
issues with the kernel client. It seems like traversing the file system
and stat'ing files have become extremely slow. Some (not all) stat
operations went from microseconds to hundreds of milliseconds in
duration. The details are from a node that runs a file system scan:
The lstat calls of the file system scan take 200ms+, here is the strace
-T -tt output:
07:05:28.909309
lstat("/mnt/sdceph/users/apataki/home/3/cm-shared/sw/nix/store/n25qcsmcmraqylfmdnh5ns7fpw1dmfr8-python2.7-pystan-2.19.1.1/lib/python2.7/site-packages/pystan/stan/lib/stan_math/lib/boost_1.69.0/boost/units/base_units/us/tablespoon.hpp",
{st_mode=S_IFREG|0444, st_size=1145, ...}) = 0 <0.272695>
07:05:29.182086
lstat("/mnt/sdceph/users/apataki/home/3/cm-shared/sw/nix/store/n25qcsmcmraqylfmdnh5ns7fpw1dmfr8-python2.7-pystan-2.19.1.1/lib/python2.7/site-packages/pystan/stan/lib/stan_math/lib/boost_1.69.0/boost/units/base_units/us/teaspoon.hpp",
{st_mode=S_IFREG|0444, st_size=1133, ...}) = 0 <0.268555>
07:05:29.450680
lstat("/mnt/sdceph/users/apataki/home/3/cm-shared/sw/nix/store/n25qcsmcmraqylfmdnh5ns7fpw1dmfr8-python2.7-pystan-2.19.1.1/lib/python2.7/site-packages/pystan/stan/lib/stan_math/lib/boost_1.69.0/boost/units/base_units/us/ton.hpp",
{st_mode=S_IFREG|0444, st_size=1107, ...}) = 0 <0.270040>
07:05:29.720758
lstat("/mnt/sdceph/users/apataki/home/3/cm-shared/sw/nix/store/n25qcsmcmraqylfmdnh5ns7fpw1dmfr8-python2.7-pystan-2.19.1.1/lib/python2.7/site-packages/pystan/stan/lib/stan_math/lib/boost_1.69.0/boost/units/base_units/us/yard.hpp",
{st_mode=S_IFREG|0444, st_size=1051, ...}) = 0 <0.268032>
The file system scanning process is constantly stuck in 'D' state with
/proc/<pid>/stack:
[<0>] ceph_mdsc_wait_request+0x88/0x150 [ceph]
[<0>] ceph_mdsc_do_request+0x82/0x90 [ceph]
[<0>] ceph_d_revalidate+0x207/0x300 [ceph]
[<0>] lookup_fast+0x179/0x210
[<0>] walk_component+0x44/0x320
[<0>] path_lookupat+0x7b/0x220
[<0>] filename_lookup+0xa5/0x170
[<0>] vfs_statx+0x6e/0xd0
[<0>] __do_sys_newlstat+0x39/0x70
[<0>] do_syscall_64+0x4a/0xe0
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
One of the kernel threads is using close to 100% of a single core
constantly:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
COMMAND 1350556 root 20 0 0 0 0 R 100.0 0.0
4:52.85 [kworker/18:0+ceph-msgr]
Dumping the kernel stacks (echo l > /proc/sysrq-trigger) shows typically
waiting in ceph_queue_cap_snap or queue_realm_cap_snaps. A couple of
examples:
Dec 10 07:36:56 popeye-mgr-0-10 kernel: CPU: 18 PID: 1350556 Comm:
kworker/18:0 Not tainted 5.4.114.1.fi #1
Dec 10 07:36:56 popeye-mgr-0-10 kernel: Hardware name: Dell Inc.
PowerEdge R640/0W23H8, BIOS 2.10.2 02/24/2021
Dec 10 07:36:56 popeye-mgr-0-10 kernel: Workqueue: ceph-msgr
ceph_con_workfn [libceph]
Dec 10 07:36:56 popeye-mgr-0-10 kernel: RIP: 0010:_raw_spin_lock+0xb/0x20
Dec 10 07:36:56 popeye-mgr-0-10 kernel: Code: 31 c0 ba ff 00 00 00 f0 0f
b1 17 75 01 c3 e9 ec 8f 98 ff 66 90 66 2e 0f 1f 84 00 00 00 00 00 31 c0
ba 01 00 00 00 f0 0f b1 17 <75> 01 c3 89 c6 e9 0b 7e 98 ff 90 66 2e 0f
1f 84 00 00 00 00 00 fa
Dec 10 07:36:56 popeye-mgr-0-10 kernel: RSP: 0018:ffffc9001b607c50
EFLAGS: 00000246
Dec 10 07:36:56 popeye-mgr-0-10 kernel: RAX: 0000000000000000 RBX:
ffff88de44ede7e8 RCX: 0000000000000000
Dec 10 07:36:56 popeye-mgr-0-10 kernel: RDX: 0000000000000001 RSI:
0000000000000000 RDI: ffff88de44ede7f8
Dec 10 07:36:56 popeye-mgr-0-10 kernel: RBP: ffff88de0a43ca00 R08:
ffff88dec0c63120 R09: ffff88de0a43ca00
Dec 10 07:36:56 popeye-mgr-0-10 kernel: R10: ffffffffa07e50e9 R11:
ffff88de36aab878 R12: ffff88de1f3cab00
Dec 10 07:36:56 popeye-mgr-0-10 kernel: R13: ffff88de44edeb18 R14:
ffff88de1f3cabc0 R15: ffff88de44edeb18
Dec 10 07:36:56 popeye-mgr-0-10 kernel: FS: 0000000000000000(0000)
GS:ffff88dec0c40000(0000) knlGS:0000000000000000
Dec 10 07:36:56 popeye-mgr-0-10 kernel: CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
Dec 10 07:36:56 popeye-mgr-0-10 kernel: CR2: 000055555584c300 CR3:
0000005e92331003 CR4: 00000000007606e0
Dec 10 07:36:56 popeye-mgr-0-10 kernel: DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
Dec 10 07:36:56 popeye-mgr-0-10 kernel: DR3: 0000000000000000 DR6:
00000000fffe0ff0 DR7: 0000000000000400
Dec 10 07:36:56 popeye-mgr-0-10 kernel: PKRU: 55555554
Dec 10 07:36:56 popeye-mgr-0-10 kernel: Call Trace:
Dec 10 07:36:56 popeye-mgr-0-10 kernel: ceph_queue_cap_snap+0x3e/0x3e0
[ceph]
Dec 10 07:36:56 popeye-mgr-0-10 kernel: queue_realm_cap_snaps+0x71/0xf0
[ceph]
Dec 10 07:36:56 popeye-mgr-0-10 kernel:
ceph_update_snap_trace+0x382/0x400 [ceph]
Dec 10 07:36:56 popeye-mgr-0-10 kernel: handle_reply+0x209/0x770 [ceph]
Dec 10 07:36:56 popeye-mgr-0-10 kernel: dispatch+0xb2/0x1e0 [ceph]
Dec 10 07:36:56 popeye-mgr-0-10 kernel: process_message+0x7b/0x130 [libceph]
Dec 10 07:36:56 popeye-mgr-0-10 kernel: try_read+0x340/0x5e0 [libceph]
Dec 10 07:36:56 popeye-mgr-0-10 kernel: ceph_con_workfn+0x79/0x210 [libceph]
Dec 10 07:36:56 popeye-mgr-0-10 kernel: process_one_work+0x1a6/0x300
Dec 10 07:36:56 popeye-mgr-0-10 kernel: worker_thread+0x41/0x310
Dec 10 07:36:56 popeye-mgr-0-10 kernel: ? rescuer_thread+0x370/0x370
Dec 10 07:36:56 popeye-mgr-0-10 kernel: kthread+0x10b/0x130
Dec 10 07:36:56 popeye-mgr-0-10 kernel: ?
kthread_create_worker_on_cpu+0x60/0x60
Dec 10 07:36:56 popeye-mgr-0-10 kernel: ret_from_fork+0x1f/0x30
Dec 10 07:40:17 popeye-mgr-0-10 kernel: CPU: 18 PID: 1350556 Comm:
kworker/18:0 Not tainted 5.4.114.1.fi #1
Dec 10 07:40:17 popeye-mgr-0-10 kernel: Hardware name: Dell Inc.
PowerEdge R640/0W23H8, BIOS 2.10.2 02/24/2021
Dec 10 07:40:17 popeye-mgr-0-10 kernel: Workqueue: ceph-msgr
ceph_con_workfn [libceph]
Dec 10 07:40:17 popeye-mgr-0-10 kernel: RIP: 0010:_raw_spin_lock+0xb/0x20
Dec 10 07:40:17 popeye-mgr-0-10 kernel: Code: 31 c0 ba ff 00 00 00 f0 0f
b1 17 75 01 c3 e9 ec 8f 98 ff 66 90 66 2e 0f 1f 84 00 00 00 00 00 31 c0
ba 01 00 00 00 f0 0f b1 17 <75> 01 c3 89 c6 e9 0b 7e 98 ff 90 66 2e 0f
1f 84 00 00 00 00 00 fa
Dec 10 07:40:17 popeye-mgr-0-10 kernel: RSP: 0018:ffffc9001b607ca0
EFLAGS: 00000246
Dec 10 07:40:17 popeye-mgr-0-10 kernel: RAX: 0000000000000000 RBX:
ffff88de62112ef0 RCX: 0000027f51c57994
Dec 10 07:40:17 popeye-mgr-0-10 kernel: RDX: 0000000000000001 RSI:
0000000000000000 RDI: ffff88de455b4cd0
Dec 10 07:40:17 popeye-mgr-0-10 kernel: RBP: ffff88de62112bc0 R08:
0000000000000000 R09: ffff88dbf2742600
Dec 10 07:40:17 popeye-mgr-0-10 kernel: R10: ffffffffa07e50e9 R11:
ffff88de62112d88 R12: ffff88de455b4c00
Dec 10 07:40:17 popeye-mgr-0-10 kernel: R13: ffff88de62112ef0 R14:
ffff88de455b4cc0 R15: ffff88de455b4cd0
Dec 10 07:40:17 popeye-mgr-0-10 kernel: FS: 0000000000000000(0000)
GS:ffff88dec0c40000(0000) knlGS:0000000000000000
Dec 10 07:40:17 popeye-mgr-0-10 kernel: CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
Dec 10 07:40:17 popeye-mgr-0-10 kernel: CR2: 000055555584c300 CR3:
0000005e92331003 CR4: 00000000007606e0
Dec 10 07:40:17 popeye-mgr-0-10 kernel: DR0: 0000000000000000 DR1:
0000000000000000 DR2: 0000000000000000
Dec 10 07:40:17 popeye-mgr-0-10 kernel: DR3: 0000000000000000 DR6:
00000000fffe0ff0 DR7: 0000000000000400
Dec 10 07:40:17 popeye-mgr-0-10 kernel: PKRU: 55555554
Dec 10 07:40:17 popeye-mgr-0-10 kernel: Call Trace:
Dec 10 07:40:17 popeye-mgr-0-10 kernel: queue_realm_cap_snaps+0x79/0xf0
[ceph]
Dec 10 07:40:17 popeye-mgr-0-10 kernel:
ceph_update_snap_trace+0x382/0x400 [ceph]
Dec 10 07:40:17 popeye-mgr-0-10 kernel: handle_reply+0x209/0x770 [ceph]
Dec 10 07:40:17 popeye-mgr-0-10 kernel: dispatch+0xb2/0x1e0 [ceph]
Dec 10 07:40:17 popeye-mgr-0-10 kernel: process_message+0x7b/0x130 [libceph]
Dec 10 07:40:17 popeye-mgr-0-10 kernel: try_read+0x340/0x5e0 [libceph]
Dec 10 07:40:17 popeye-mgr-0-10 kernel: ceph_con_workfn+0x79/0x210 [libceph]
Dec 10 07:40:17 popeye-mgr-0-10 kernel: process_one_work+0x1a6/0x300
Dec 10 07:40:17 popeye-mgr-0-10 kernel: worker_thread+0x41/0x310
Dec 10 07:40:17 popeye-mgr-0-10 kernel: ? rescuer_thread+0x370/0x370
Dec 10 07:40:17 popeye-mgr-0-10 kernel: kthread+0x10b/0x130
Dec 10 07:40:17 popeye-mgr-0-10 kernel: ?
kthread_create_worker_on_cpu+0x60/0x60
Dec 10 07:40:17 popeye-mgr-0-10 kernel: ret_from_fork+0x1f/0x30
Turning on kernel debugging for fs/ceph/snap.c, there seems this loop in
queue_realm_cap_snaps is really long for some inodes:
list_for_each_entry(ci, &realm->inodes_with_caps,
i_snap_realm_item) {
...
}
[3199649.882973] ceph: queue_realm_cap_snaps 00000000d313f1e4
1001af8c8e7 inodes
[3199649.882974] ceph: queue_cap_snap 00000000171f11a7 nothing
dirty|writing
[3199649.882975] ceph: queue_cap_snap 00000000938b9cd2 nothing
dirty|writing
[3199649.882976] ceph: queue_cap_snap 00000000615cf4dd nothing
dirty|writing
[3199649.882977] ceph: queue_cap_snap 000000000027e295 nothing
dirty|writing
[3199649.882979] ceph: queue_cap_snap 00000000ba18b2f8 nothing
dirty|writing
[3199649.882980] ceph: queue_cap_snap 000000007c9c80de nothing
dirty|writing
[3199649.882981] ceph: queue_cap_snap 00000000629b4b0e nothing
dirty|writing
[3199649.882982] ceph: queue_cap_snap 00000000ab330b37 nothing
dirty|writing
[3199649.882983] ceph: queue_cap_snap 00000000c7dbc320 nothing
dirty|writing
[3199649.882985] ceph: queue_cap_snap 0000000070a0598f nothing
dirty|writing
[3199649.882986] ceph: queue_cap_snap 00000000915b9e2e nothing
dirty|writing
... (and a lot lot more of these) ...
At this point the client has about a million caps (running up against
the default cap limit) - so potentially this loop is over all the caps
(?), which could mean tens/hundreds of milliseconds? Indeed, reducing
mds_max_caps_per_client by an order of magnitude does improve the lstat
times by about an order of magnitude (which is still pretty slow - but
supports this hypothesis).
The ceph cluster is Nautilus 14.2.20. There are a total of 7 snapshots
in cephfs, all taken at the root of the cephfs tree (a rolling set of 7
previous daily snapshots).
I've tested this with a few kernels: two LTS ones, and one more recent
stable one: 5.4.114, 5.10.73 and 5.14.16 with the same result. Any
ideas/suggestions?
Andras
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx