cephfs kernel client + snapshots slowness

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

We've recently started using cephfs snapshots and are running into some issues with the kernel client.  It seems like traversing the file system and stat'ing files have become extremely slow.  Some (not all) stat operations went from microseconds to hundreds of milliseconds in duration.  The details are from a node that runs a file system scan:

The lstat calls of the file system scan take 200ms+, here is the strace -T -tt output:

07:05:28.909309 lstat("/mnt/sdceph/users/apataki/home/3/cm-shared/sw/nix/store/n25qcsmcmraqylfmdnh5ns7fpw1dmfr8-python2.7-pystan-2.19.1.1/lib/python2.7/site-packages/pystan/stan/lib/stan_math/lib/boost_1.69.0/boost/units/base_units/us/tablespoon.hpp", {st_mode=S_IFREG|0444, st_size=1145, ...}) = 0 <0.272695> 07:05:29.182086 lstat("/mnt/sdceph/users/apataki/home/3/cm-shared/sw/nix/store/n25qcsmcmraqylfmdnh5ns7fpw1dmfr8-python2.7-pystan-2.19.1.1/lib/python2.7/site-packages/pystan/stan/lib/stan_math/lib/boost_1.69.0/boost/units/base_units/us/teaspoon.hpp", {st_mode=S_IFREG|0444, st_size=1133, ...}) = 0 <0.268555> 07:05:29.450680 lstat("/mnt/sdceph/users/apataki/home/3/cm-shared/sw/nix/store/n25qcsmcmraqylfmdnh5ns7fpw1dmfr8-python2.7-pystan-2.19.1.1/lib/python2.7/site-packages/pystan/stan/lib/stan_math/lib/boost_1.69.0/boost/units/base_units/us/ton.hpp", {st_mode=S_IFREG|0444, st_size=1107, ...}) = 0 <0.270040> 07:05:29.720758 lstat("/mnt/sdceph/users/apataki/home/3/cm-shared/sw/nix/store/n25qcsmcmraqylfmdnh5ns7fpw1dmfr8-python2.7-pystan-2.19.1.1/lib/python2.7/site-packages/pystan/stan/lib/stan_math/lib/boost_1.69.0/boost/units/base_units/us/yard.hpp", {st_mode=S_IFREG|0444, st_size=1051, ...}) = 0 <0.268032>

The file system scanning process is constantly stuck in 'D' state with /proc/<pid>/stack:

[<0>] ceph_mdsc_wait_request+0x88/0x150 [ceph]
[<0>] ceph_mdsc_do_request+0x82/0x90 [ceph]
[<0>] ceph_d_revalidate+0x207/0x300 [ceph]
[<0>] lookup_fast+0x179/0x210
[<0>] walk_component+0x44/0x320
[<0>] path_lookupat+0x7b/0x220
[<0>] filename_lookup+0xa5/0x170
[<0>] vfs_statx+0x6e/0xd0
[<0>] __do_sys_newlstat+0x39/0x70
[<0>] do_syscall_64+0x4a/0xe0
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

One of the kernel threads is using close to 100% of a single core constantly:     PID USER      PR  NI    VIRT    RES SHR S  %CPU  %MEM     TIME+ COMMAND 1350556 root      20   0       0      0      0 R 100.0   0.0 4:52.85 [kworker/18:0+ceph-msgr]

Dumping the kernel stacks (echo l > /proc/sysrq-trigger) shows typically waiting in ceph_queue_cap_snap or queue_realm_cap_snaps. A couple of examples:

Dec 10 07:36:56 popeye-mgr-0-10 kernel: CPU: 18 PID: 1350556 Comm: kworker/18:0 Not tainted 5.4.114.1.fi #1 Dec 10 07:36:56 popeye-mgr-0-10 kernel: Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS 2.10.2 02/24/2021 Dec 10 07:36:56 popeye-mgr-0-10 kernel: Workqueue: ceph-msgr ceph_con_workfn [libceph]
Dec 10 07:36:56 popeye-mgr-0-10 kernel: RIP: 0010:_raw_spin_lock+0xb/0x20
Dec 10 07:36:56 popeye-mgr-0-10 kernel: Code: 31 c0 ba ff 00 00 00 f0 0f b1 17 75 01 c3 e9 ec 8f 98 ff 66 90 66 2e 0f 1f 84 00 00 00 00 00 31 c0 ba 01 00 00 00 f0 0f b1 17 <75> 01 c3 89 c6 e9 0b 7e 98 ff 90 66 2e 0f 1f 84 00 00 00 00 00 fa Dec 10 07:36:56 popeye-mgr-0-10 kernel: RSP: 0018:ffffc9001b607c50 EFLAGS: 00000246 Dec 10 07:36:56 popeye-mgr-0-10 kernel: RAX: 0000000000000000 RBX: ffff88de44ede7e8 RCX: 0000000000000000 Dec 10 07:36:56 popeye-mgr-0-10 kernel: RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff88de44ede7f8 Dec 10 07:36:56 popeye-mgr-0-10 kernel: RBP: ffff88de0a43ca00 R08: ffff88dec0c63120 R09: ffff88de0a43ca00 Dec 10 07:36:56 popeye-mgr-0-10 kernel: R10: ffffffffa07e50e9 R11: ffff88de36aab878 R12: ffff88de1f3cab00 Dec 10 07:36:56 popeye-mgr-0-10 kernel: R13: ffff88de44edeb18 R14: ffff88de1f3cabc0 R15: ffff88de44edeb18 Dec 10 07:36:56 popeye-mgr-0-10 kernel: FS:  0000000000000000(0000) GS:ffff88dec0c40000(0000) knlGS:0000000000000000 Dec 10 07:36:56 popeye-mgr-0-10 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Dec 10 07:36:56 popeye-mgr-0-10 kernel: CR2: 000055555584c300 CR3: 0000005e92331003 CR4: 00000000007606e0 Dec 10 07:36:56 popeye-mgr-0-10 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Dec 10 07:36:56 popeye-mgr-0-10 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Dec 10 07:36:56 popeye-mgr-0-10 kernel: PKRU: 55555554
Dec 10 07:36:56 popeye-mgr-0-10 kernel: Call Trace:
Dec 10 07:36:56 popeye-mgr-0-10 kernel: ceph_queue_cap_snap+0x3e/0x3e0 [ceph] Dec 10 07:36:56 popeye-mgr-0-10 kernel: queue_realm_cap_snaps+0x71/0xf0 [ceph] Dec 10 07:36:56 popeye-mgr-0-10 kernel: ceph_update_snap_trace+0x382/0x400 [ceph]
Dec 10 07:36:56 popeye-mgr-0-10 kernel: handle_reply+0x209/0x770 [ceph]
Dec 10 07:36:56 popeye-mgr-0-10 kernel: dispatch+0xb2/0x1e0 [ceph]
Dec 10 07:36:56 popeye-mgr-0-10 kernel: process_message+0x7b/0x130 [libceph]
Dec 10 07:36:56 popeye-mgr-0-10 kernel: try_read+0x340/0x5e0 [libceph]
Dec 10 07:36:56 popeye-mgr-0-10 kernel: ceph_con_workfn+0x79/0x210 [libceph]
Dec 10 07:36:56 popeye-mgr-0-10 kernel: process_one_work+0x1a6/0x300
Dec 10 07:36:56 popeye-mgr-0-10 kernel: worker_thread+0x41/0x310
Dec 10 07:36:56 popeye-mgr-0-10 kernel: ? rescuer_thread+0x370/0x370
Dec 10 07:36:56 popeye-mgr-0-10 kernel: kthread+0x10b/0x130
Dec 10 07:36:56 popeye-mgr-0-10 kernel: ? kthread_create_worker_on_cpu+0x60/0x60
Dec 10 07:36:56 popeye-mgr-0-10 kernel: ret_from_fork+0x1f/0x30

Dec 10 07:40:17 popeye-mgr-0-10 kernel: CPU: 18 PID: 1350556 Comm: kworker/18:0 Not tainted 5.4.114.1.fi #1 Dec 10 07:40:17 popeye-mgr-0-10 kernel: Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS 2.10.2 02/24/2021 Dec 10 07:40:17 popeye-mgr-0-10 kernel: Workqueue: ceph-msgr ceph_con_workfn [libceph]
Dec 10 07:40:17 popeye-mgr-0-10 kernel: RIP: 0010:_raw_spin_lock+0xb/0x20
Dec 10 07:40:17 popeye-mgr-0-10 kernel: Code: 31 c0 ba ff 00 00 00 f0 0f b1 17 75 01 c3 e9 ec 8f 98 ff 66 90 66 2e 0f 1f 84 00 00 00 00 00 31 c0 ba 01 00 00 00 f0 0f b1 17 <75> 01 c3 89 c6 e9 0b 7e 98 ff 90 66 2e 0f 1f 84 00 00 00 00 00 fa Dec 10 07:40:17 popeye-mgr-0-10 kernel: RSP: 0018:ffffc9001b607ca0 EFLAGS: 00000246 Dec 10 07:40:17 popeye-mgr-0-10 kernel: RAX: 0000000000000000 RBX: ffff88de62112ef0 RCX: 0000027f51c57994 Dec 10 07:40:17 popeye-mgr-0-10 kernel: RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff88de455b4cd0 Dec 10 07:40:17 popeye-mgr-0-10 kernel: RBP: ffff88de62112bc0 R08: 0000000000000000 R09: ffff88dbf2742600 Dec 10 07:40:17 popeye-mgr-0-10 kernel: R10: ffffffffa07e50e9 R11: ffff88de62112d88 R12: ffff88de455b4c00 Dec 10 07:40:17 popeye-mgr-0-10 kernel: R13: ffff88de62112ef0 R14: ffff88de455b4cc0 R15: ffff88de455b4cd0 Dec 10 07:40:17 popeye-mgr-0-10 kernel: FS:  0000000000000000(0000) GS:ffff88dec0c40000(0000) knlGS:0000000000000000 Dec 10 07:40:17 popeye-mgr-0-10 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Dec 10 07:40:17 popeye-mgr-0-10 kernel: CR2: 000055555584c300 CR3: 0000005e92331003 CR4: 00000000007606e0 Dec 10 07:40:17 popeye-mgr-0-10 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Dec 10 07:40:17 popeye-mgr-0-10 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Dec 10 07:40:17 popeye-mgr-0-10 kernel: PKRU: 55555554
Dec 10 07:40:17 popeye-mgr-0-10 kernel: Call Trace:
Dec 10 07:40:17 popeye-mgr-0-10 kernel: queue_realm_cap_snaps+0x79/0xf0 [ceph] Dec 10 07:40:17 popeye-mgr-0-10 kernel: ceph_update_snap_trace+0x382/0x400 [ceph]
Dec 10 07:40:17 popeye-mgr-0-10 kernel: handle_reply+0x209/0x770 [ceph]
Dec 10 07:40:17 popeye-mgr-0-10 kernel: dispatch+0xb2/0x1e0 [ceph]
Dec 10 07:40:17 popeye-mgr-0-10 kernel: process_message+0x7b/0x130 [libceph]
Dec 10 07:40:17 popeye-mgr-0-10 kernel: try_read+0x340/0x5e0 [libceph]
Dec 10 07:40:17 popeye-mgr-0-10 kernel: ceph_con_workfn+0x79/0x210 [libceph]
Dec 10 07:40:17 popeye-mgr-0-10 kernel: process_one_work+0x1a6/0x300
Dec 10 07:40:17 popeye-mgr-0-10 kernel: worker_thread+0x41/0x310
Dec 10 07:40:17 popeye-mgr-0-10 kernel: ? rescuer_thread+0x370/0x370
Dec 10 07:40:17 popeye-mgr-0-10 kernel: kthread+0x10b/0x130
Dec 10 07:40:17 popeye-mgr-0-10 kernel: ? kthread_create_worker_on_cpu+0x60/0x60
Dec 10 07:40:17 popeye-mgr-0-10 kernel: ret_from_fork+0x1f/0x30

Turning on kernel debugging for fs/ceph/snap.c, there seems this loop in queue_realm_cap_snaps is really long for some inodes:

        list_for_each_entry(ci, &realm->inodes_with_caps, i_snap_realm_item) {
        ...
        }

[3199649.882973] ceph:  queue_realm_cap_snaps 00000000d313f1e4 1001af8c8e7 inodes [3199649.882974] ceph:  queue_cap_snap 00000000171f11a7 nothing dirty|writing [3199649.882975] ceph:  queue_cap_snap 00000000938b9cd2 nothing dirty|writing [3199649.882976] ceph:  queue_cap_snap 00000000615cf4dd nothing dirty|writing [3199649.882977] ceph:  queue_cap_snap 000000000027e295 nothing dirty|writing [3199649.882979] ceph:  queue_cap_snap 00000000ba18b2f8 nothing dirty|writing [3199649.882980] ceph:  queue_cap_snap 000000007c9c80de nothing dirty|writing [3199649.882981] ceph:  queue_cap_snap 00000000629b4b0e nothing dirty|writing [3199649.882982] ceph:  queue_cap_snap 00000000ab330b37 nothing dirty|writing [3199649.882983] ceph:  queue_cap_snap 00000000c7dbc320 nothing dirty|writing [3199649.882985] ceph:  queue_cap_snap 0000000070a0598f nothing dirty|writing [3199649.882986] ceph:  queue_cap_snap 00000000915b9e2e nothing dirty|writing
... (and a lot lot more of these) ...

At this point the client has about a million caps (running up against the default cap limit) - so potentially this loop is over all the caps (?), which could mean tens/hundreds of milliseconds? Indeed, reducing mds_max_caps_per_client by an order of magnitude does improve the lstat times by about an order of magnitude (which is still pretty slow - but supports this hypothesis).

The ceph cluster is Nautilus 14.2.20.  There are a total of 7 snapshots in cephfs, all taken at the root of the cephfs tree (a rolling set of 7 previous daily snapshots).

I've tested this with a few kernels: two LTS ones, and one more recent stable one: 5.4.114, 5.10.73 and 5.14.16 with the same result.  Any ideas/suggestions?

Andras

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx




[Index of Archives]     [CEPH Users]     [Ceph Devel]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux