Re: cephfs kernel client + snapshots slowness

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Sebastian,

Thanks for the pointer to the tracker - I've added my report details to that as well.

After playing with mds_max_caps_per_client, I am now fairly convinced that the loop over caps in fs/ceph/snap.c in the kernel is over all (or most of) the caps that the kernel as at the time. Unfortunately I don't know the details of snaprealms well enough to understand why this would be or how one would go about not doing such an expensive loop so often.

Andras


On 12/10/21 12:39, Sebastian Knust wrote:
Hi,

I also see this behaviour and can more or less reproduce it running rsync or Bareos Backup tasks (anything stat-intense should do) on a specific directory. Unmounting and then remounting the filesystem fixes it, until it is caused again by a stat-intense task.

For me, I only saw two immediate solutions:
1) Use fuse client, which does not seem to exhibit this issue but is in general approx. one order of magnitude slower for metadata ops then kernel client
2) Do not use snapshots.

I grudgingly chose the second option, since 1) is too slow for my nightly backups.

A ticket has been opened at https://tracker.ceph.com/issues/44100, showing a few people running into these issues. Maybe your insights regarding the cap limit and the possible mitigation of either lowering the limit or raising it (above the critical number for your specific use case, whichever that might be) can help others as a workaround and might point the developers to the cause in the kernel client.

I myself will experiment with raising mds_max_caps_per_client, as I only have a handful of clients, all of which are heavily used. So thanks for recommending to look into that config value.

Cheers
Sebastian


On 10.12.21 17:33, Andras Pataki wrote:
Hi,

We've recently started using cephfs snapshots and are running into some issues with the kernel client.  It seems like traversing the file system and stat'ing files have become extremely slow.  Some (not all) stat operations went from microseconds to hundreds of milliseconds in duration.  The details are from a node that runs a file system scan:

The lstat calls of the file system scan take 200ms+, here is the strace -T -tt output:

07:05:28.909309 lstat("/mnt/sdceph/users/apataki/home/3/cm-shared/sw/nix/store/n25qcsmcmraqylfmdnh5ns7fpw1dmfr8-python2.7-pystan-2.19.1.1/lib/python2.7/site-packages/pystan/stan/lib/stan_math/lib/boost_1.69.0/boost/units/base_units/us/tablespoon.hpp", {st_mode=S_IFREG|0444, st_size=1145, ...}) = 0 <0.272695> 07:05:29.182086 lstat("/mnt/sdceph/users/apataki/home/3/cm-shared/sw/nix/store/n25qcsmcmraqylfmdnh5ns7fpw1dmfr8-python2.7-pystan-2.19.1.1/lib/python2.7/site-packages/pystan/stan/lib/stan_math/lib/boost_1.69.0/boost/units/base_units/us/teaspoon.hpp", {st_mode=S_IFREG|0444, st_size=1133, ...}) = 0 <0.268555> 07:05:29.450680 lstat("/mnt/sdceph/users/apataki/home/3/cm-shared/sw/nix/store/n25qcsmcmraqylfmdnh5ns7fpw1dmfr8-python2.7-pystan-2.19.1.1/lib/python2.7/site-packages/pystan/stan/lib/stan_math/lib/boost_1.69.0/boost/units/base_units/us/ton.hpp", {st_mode=S_IFREG|0444, st_size=1107, ...}) = 0 <0.270040> 07:05:29.720758 lstat("/mnt/sdceph/users/apataki/home/3/cm-shared/sw/nix/store/n25qcsmcmraqylfmdnh5ns7fpw1dmfr8-python2.7-pystan-2.19.1.1/lib/python2.7/site-packages/pystan/stan/lib/stan_math/lib/boost_1.69.0/boost/units/base_units/us/yard.hpp", {st_mode=S_IFREG|0444, st_size=1051, ...}) = 0 <0.268032>

The file system scanning process is constantly stuck in 'D' state with /proc/<pid>/stack:

[<0>] ceph_mdsc_wait_request+0x88/0x150 [ceph]
[<0>] ceph_mdsc_do_request+0x82/0x90 [ceph]
[<0>] ceph_d_revalidate+0x207/0x300 [ceph]
[<0>] lookup_fast+0x179/0x210
[<0>] walk_component+0x44/0x320
[<0>] path_lookupat+0x7b/0x220
[<0>] filename_lookup+0xa5/0x170
[<0>] vfs_statx+0x6e/0xd0
[<0>] __do_sys_newlstat+0x39/0x70
[<0>] do_syscall_64+0x4a/0xe0
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

One of the kernel threads is using close to 100% of a single core constantly:      PID USER      PR  NI    VIRT    RES SHR S  %CPU  %MEM TIME+ COMMAND 1350556 root      20   0       0      0      0 R 100.0   0.0 4:52.85 [kworker/18:0+ceph-msgr]

Dumping the kernel stacks (echo l > /proc/sysrq-trigger) shows typically waiting in ceph_queue_cap_snap or queue_realm_cap_snaps. A couple of examples:

Dec 10 07:36:56 popeye-mgr-0-10 kernel: CPU: 18 PID: 1350556 Comm: kworker/18:0 Not tainted 5.4.114.1.fi #1 Dec 10 07:36:56 popeye-mgr-0-10 kernel: Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS 2.10.2 02/24/2021 Dec 10 07:36:56 popeye-mgr-0-10 kernel: Workqueue: ceph-msgr ceph_con_workfn [libceph] Dec 10 07:36:56 popeye-mgr-0-10 kernel: RIP: 0010:_raw_spin_lock+0xb/0x20 Dec 10 07:36:56 popeye-mgr-0-10 kernel: Code: 31 c0 ba ff 00 00 00 f0 0f b1 17 75 01 c3 e9 ec 8f 98 ff 66 90 66 2e 0f 1f 84 00 00 00 00 00 31 c0 ba 01 00 00 00 f0 0f b1 17 <75> 01 c3 89 c6 e9 0b 7e 98 ff 90 66 2e 0f 1f 84 00 00 00 00 00 fa Dec 10 07:36:56 popeye-mgr-0-10 kernel: RSP: 0018:ffffc9001b607c50 EFLAGS: 00000246 Dec 10 07:36:56 popeye-mgr-0-10 kernel: RAX: 0000000000000000 RBX: ffff88de44ede7e8 RCX: 0000000000000000 Dec 10 07:36:56 popeye-mgr-0-10 kernel: RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff88de44ede7f8 Dec 10 07:36:56 popeye-mgr-0-10 kernel: RBP: ffff88de0a43ca00 R08: ffff88dec0c63120 R09: ffff88de0a43ca00 Dec 10 07:36:56 popeye-mgr-0-10 kernel: R10: ffffffffa07e50e9 R11: ffff88de36aab878 R12: ffff88de1f3cab00 Dec 10 07:36:56 popeye-mgr-0-10 kernel: R13: ffff88de44edeb18 R14: ffff88de1f3cabc0 R15: ffff88de44edeb18 Dec 10 07:36:56 popeye-mgr-0-10 kernel: FS: 0000000000000000(0000) GS:ffff88dec0c40000(0000) knlGS:0000000000000000 Dec 10 07:36:56 popeye-mgr-0-10 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Dec 10 07:36:56 popeye-mgr-0-10 kernel: CR2: 000055555584c300 CR3: 0000005e92331003 CR4: 00000000007606e0 Dec 10 07:36:56 popeye-mgr-0-10 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Dec 10 07:36:56 popeye-mgr-0-10 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Dec 10 07:36:56 popeye-mgr-0-10 kernel: PKRU: 55555554
Dec 10 07:36:56 popeye-mgr-0-10 kernel: Call Trace:
Dec 10 07:36:56 popeye-mgr-0-10 kernel: ceph_queue_cap_snap+0x3e/0x3e0 [ceph] Dec 10 07:36:56 popeye-mgr-0-10 kernel: queue_realm_cap_snaps+0x71/0xf0 [ceph] Dec 10 07:36:56 popeye-mgr-0-10 kernel: ceph_update_snap_trace+0x382/0x400 [ceph]
Dec 10 07:36:56 popeye-mgr-0-10 kernel: handle_reply+0x209/0x770 [ceph]
Dec 10 07:36:56 popeye-mgr-0-10 kernel: dispatch+0xb2/0x1e0 [ceph]
Dec 10 07:36:56 popeye-mgr-0-10 kernel: process_message+0x7b/0x130 [libceph]
Dec 10 07:36:56 popeye-mgr-0-10 kernel: try_read+0x340/0x5e0 [libceph]
Dec 10 07:36:56 popeye-mgr-0-10 kernel: ceph_con_workfn+0x79/0x210 [libceph]
Dec 10 07:36:56 popeye-mgr-0-10 kernel: process_one_work+0x1a6/0x300
Dec 10 07:36:56 popeye-mgr-0-10 kernel: worker_thread+0x41/0x310
Dec 10 07:36:56 popeye-mgr-0-10 kernel: ? rescuer_thread+0x370/0x370
Dec 10 07:36:56 popeye-mgr-0-10 kernel: kthread+0x10b/0x130
Dec 10 07:36:56 popeye-mgr-0-10 kernel: ? kthread_create_worker_on_cpu+0x60/0x60
Dec 10 07:36:56 popeye-mgr-0-10 kernel: ret_from_fork+0x1f/0x30

Dec 10 07:40:17 popeye-mgr-0-10 kernel: CPU: 18 PID: 1350556 Comm: kworker/18:0 Not tainted 5.4.114.1.fi #1 Dec 10 07:40:17 popeye-mgr-0-10 kernel: Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS 2.10.2 02/24/2021 Dec 10 07:40:17 popeye-mgr-0-10 kernel: Workqueue: ceph-msgr ceph_con_workfn [libceph] Dec 10 07:40:17 popeye-mgr-0-10 kernel: RIP: 0010:_raw_spin_lock+0xb/0x20 Dec 10 07:40:17 popeye-mgr-0-10 kernel: Code: 31 c0 ba ff 00 00 00 f0 0f b1 17 75 01 c3 e9 ec 8f 98 ff 66 90 66 2e 0f 1f 84 00 00 00 00 00 31 c0 ba 01 00 00 00 f0 0f b1 17 <75> 01 c3 89 c6 e9 0b 7e 98 ff 90 66 2e 0f 1f 84 00 00 00 00 00 fa Dec 10 07:40:17 popeye-mgr-0-10 kernel: RSP: 0018:ffffc9001b607ca0 EFLAGS: 00000246 Dec 10 07:40:17 popeye-mgr-0-10 kernel: RAX: 0000000000000000 RBX: ffff88de62112ef0 RCX: 0000027f51c57994 Dec 10 07:40:17 popeye-mgr-0-10 kernel: RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff88de455b4cd0 Dec 10 07:40:17 popeye-mgr-0-10 kernel: RBP: ffff88de62112bc0 R08: 0000000000000000 R09: ffff88dbf2742600 Dec 10 07:40:17 popeye-mgr-0-10 kernel: R10: ffffffffa07e50e9 R11: ffff88de62112d88 R12: ffff88de455b4c00 Dec 10 07:40:17 popeye-mgr-0-10 kernel: R13: ffff88de62112ef0 R14: ffff88de455b4cc0 R15: ffff88de455b4cd0 Dec 10 07:40:17 popeye-mgr-0-10 kernel: FS: 0000000000000000(0000) GS:ffff88dec0c40000(0000) knlGS:0000000000000000 Dec 10 07:40:17 popeye-mgr-0-10 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Dec 10 07:40:17 popeye-mgr-0-10 kernel: CR2: 000055555584c300 CR3: 0000005e92331003 CR4: 00000000007606e0 Dec 10 07:40:17 popeye-mgr-0-10 kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Dec 10 07:40:17 popeye-mgr-0-10 kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Dec 10 07:40:17 popeye-mgr-0-10 kernel: PKRU: 55555554
Dec 10 07:40:17 popeye-mgr-0-10 kernel: Call Trace:
Dec 10 07:40:17 popeye-mgr-0-10 kernel: queue_realm_cap_snaps+0x79/0xf0 [ceph] Dec 10 07:40:17 popeye-mgr-0-10 kernel: ceph_update_snap_trace+0x382/0x400 [ceph]
Dec 10 07:40:17 popeye-mgr-0-10 kernel: handle_reply+0x209/0x770 [ceph]
Dec 10 07:40:17 popeye-mgr-0-10 kernel: dispatch+0xb2/0x1e0 [ceph]
Dec 10 07:40:17 popeye-mgr-0-10 kernel: process_message+0x7b/0x130 [libceph]
Dec 10 07:40:17 popeye-mgr-0-10 kernel: try_read+0x340/0x5e0 [libceph]
Dec 10 07:40:17 popeye-mgr-0-10 kernel: ceph_con_workfn+0x79/0x210 [libceph]
Dec 10 07:40:17 popeye-mgr-0-10 kernel: process_one_work+0x1a6/0x300
Dec 10 07:40:17 popeye-mgr-0-10 kernel: worker_thread+0x41/0x310
Dec 10 07:40:17 popeye-mgr-0-10 kernel: ? rescuer_thread+0x370/0x370
Dec 10 07:40:17 popeye-mgr-0-10 kernel: kthread+0x10b/0x130
Dec 10 07:40:17 popeye-mgr-0-10 kernel: ? kthread_create_worker_on_cpu+0x60/0x60
Dec 10 07:40:17 popeye-mgr-0-10 kernel: ret_from_fork+0x1f/0x30

Turning on kernel debugging for fs/ceph/snap.c, there seems this loop in queue_realm_cap_snaps is really long for some inodes:

         list_for_each_entry(ci, &realm->inodes_with_caps, i_snap_realm_item) {
         ...
         }

[3199649.882973] ceph:  queue_realm_cap_snaps 00000000d313f1e4 1001af8c8e7 inodes [3199649.882974] ceph:  queue_cap_snap 00000000171f11a7 nothing dirty|writing [3199649.882975] ceph:  queue_cap_snap 00000000938b9cd2 nothing dirty|writing [3199649.882976] ceph:  queue_cap_snap 00000000615cf4dd nothing dirty|writing [3199649.882977] ceph:  queue_cap_snap 000000000027e295 nothing dirty|writing [3199649.882979] ceph:  queue_cap_snap 00000000ba18b2f8 nothing dirty|writing [3199649.882980] ceph:  queue_cap_snap 000000007c9c80de nothing dirty|writing [3199649.882981] ceph:  queue_cap_snap 00000000629b4b0e nothing dirty|writing [3199649.882982] ceph:  queue_cap_snap 00000000ab330b37 nothing dirty|writing [3199649.882983] ceph:  queue_cap_snap 00000000c7dbc320 nothing dirty|writing [3199649.882985] ceph:  queue_cap_snap 0000000070a0598f nothing dirty|writing [3199649.882986] ceph:  queue_cap_snap 00000000915b9e2e nothing dirty|writing
... (and a lot lot more of these) ...

At this point the client has about a million caps (running up against the default cap limit) - so potentially this loop is over all the caps (?), which could mean tens/hundreds of milliseconds? Indeed, reducing mds_max_caps_per_client by an order of magnitude does improve the lstat times by about an order of magnitude (which is still pretty slow - but supports this hypothesis).

The ceph cluster is Nautilus 14.2.20.  There are a total of 7 snapshots in cephfs, all taken at the root of the cephfs tree (a rolling set of 7 previous daily snapshots).

I've tested this with a few kernels: two LTS ones, and one more recent stable one: 5.4.114, 5.10.73 and 5.14.16 with the same result.  Any ideas/suggestions?

Andras

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx


_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux