Re: High cephfs MDS latency and CPU load

Andras Pataki <apataki@xxxxxxxxxxxxxxxxxxxxx> · Mon, 8 Nov 2021 11:58:56 -0500

Currently this issue creates a painful problem for us - having gone from 
a fairly smoothly working cephfs setup to one that comes to a screeching 
halt several times a day for hours when people start removing trees of 
files.  I've filed a tracker for the problem: 
https://tracker.ceph.com/issues/53192 with as much detail as I know 
currently.  It seems like there is an issue with removes inside the MDS 
when removing files that were there before snapshots existed - even 
after the snapshots are completely removed.  Unfortunately I don't 
understand enough about snapshots and how snapshots are treated within 
the MDS to come up with any conclusions here.

Andras

On 11/5/21 09:26, Andras Pataki wrote:
Also, just to add a little more detail (the problem is back again 
today on our cluster).  The cause seems to be a single 'rm -rf' of a 
directory on a node.  If I stop that process, the latencies 
immediately drop, restart it, they immediately go back up by a factor 
of 50 to 100 on the MDS.  stracing the process shows that some unlinks 
are really really slow (0.3s and up), but most are reasonably ok (1ms 
or so):

    09:23:54.854668 unlinkat(7, "ebdyc.p", 0) = 0 <0.001106>
    *09:23:54.855864 unlinkat(7, "Dyy.p", 0) = 0 <0.352603>*
    09:23:55.208547 unlinkat(7, "u.p", 0)   = 0 <0.002474>
    09:23:55.211098 unlinkat(7, "Dxy.p", 0) = 0 <0.001317>
    09:23:55.212491 unlinkat(7, "v.p", 0)   = 0 <0.001437>
    09:23:55.214005 unlinkat(7, "Dxx.p", 0) = 0 <0.001506>
    09:23:55.215622 unlinkat(6, "timestep_35050", AT_REMOVEDIR) = 0
    <0.001874>
    *09:23:55.221117 unlinkat(7, "ebdyc.p", 0) = 0 <0.335875>*
    09:23:55.557076 unlinkat(7, "Dyy.p", 0) = 0 <0.001657>
    09:23:55.558809 unlinkat(7, "u.p", 0)   = 0 <0.001450>
    09:23:55.560349 unlinkat(7, "Dxy.p", 0) = 0 <0.001507>
    09:23:55.561940 unlinkat(7, "v.p", 0)   = 0 <0.001349>
    09:23:55.563365 unlinkat(7, "Dxx.p", 0) = 0 <0.001182>
    09:23:55.564668 unlinkat(6, "timestep_45250", AT_REMOVEDIR) = 0
    <0.002633>
    09:23:55.885885 unlinkat(7, "ebdyc.p", 0) = 0 <0.005142>
    09:23:55.891111 unlinkat(7, "Dyy.p", 0) = 0 <0.001656>
    09:23:55.892844 unlinkat(7, "u.p", 0)   = 0 <0.001685>
    09:23:55.894589 unlinkat(7, "Dxy.p", 0) = 0 <0.001407>
    09:23:55.896056 unlinkat(7, "v.p", 0)   = 0 <0.001520>
    09:23:55.897669 unlinkat(7, "Dxx.p", 0) = 0 <0.001601>

The client here is ceph-fuse 15.2.11 if that matters.

Andras

On 11/5/21 05:23, Andras Pataki wrote:
Hi Patrick,

Thanks for looking at the issue - over the past few days this has 
given us quite a bit of headache unfortunately.  A couple of 
questions if you don't mind - both for my education and to avoid 
hitting this issue until we can upgrade.

Do you think the issue is related to snapshots at all?  I started 
creating snapshots this week on this cluster, and this problem 
started showing up this week as well - and the stack trace has 
"snapshotty" things in it like snap realm splitting/opening.  Or is 
this just a coincidence perhaps?

When is this code path triggered (Server::_unlink_local_finish -> 
CInode::early_pop_projected_snaprealm -> 
CInode::pop_projected_snaprealm -> CInode::open_snaprealm -> 
SnapRealm::split_at)?

If this is indeed related to snapshots - somehow the problem reared 
its head more than a day after I removed all snapshots from the 
system.  Do things from snapshots stay around in the MDS well after 
they no longer exist?

There seems to be another issue with kernel clients performing really 
poorly after a snapshot has been created (even after it is removed 
afterwards).  I see a kworker thread like the following constantly 
chewing CPU when running metadata operations (like scanning the file 
system, rsync or such)
  84851 root      20   0       0      0      0 R 100.0   0.0 24:33.07 
[kworker/2:0+ceph-msgr]

    Nov  5 02:19:14 popeye-mgr-0-10 kernel: Call Trace:
    Nov  5 02:19:14 popeye-mgr-0-10 kernel:
    rebuild_snap_realms+0x1e/0x70 [ceph]
    Nov  5 02:19:14 popeye-mgr-0-10 kernel:
    rebuild_snap_realms+0x38/0x70 [ceph]
    Nov  5 02:19:14 popeye-mgr-0-10 kernel:
    ceph_update_snap_trace+0x213/0x400 [ceph]
    Nov  5 02:19:14 popeye-mgr-0-10 kernel: handle_reply+0x209/0x770
    [ceph]
    Nov  5 02:19:14 popeye-mgr-0-10 kernel: dispatch+0xb2/0x1e0 [ceph]
    Nov  5 02:19:14 popeye-mgr-0-10 kernel:
    process_message+0x7b/0x130 [libceph]
    Nov  5 02:19:14 popeye-mgr-0-10 kernel: try_read+0x340/0x5e0
    [libceph]
    Nov  5 02:19:14 popeye-mgr-0-10 kernel:
    ceph_con_workfn+0x79/0x210 [libceph]
    Nov  5 02:19:14 popeye-mgr-0-10 kernel: process_one_work+0x1a6/0x300
    Nov  5 02:19:14 popeye-mgr-0-10 kernel: worker_thread+0x41/0x310

I see another message about this on the mailing list from last night 
("One cephFS snapshot kills performance").  I've tested various 
kernels on this, two LTS ones: 5.4.114, 5.10.73 and the latest 
stable: 5.14.16.  They all exhibit this behavior.

If you point me to some places to learn about how 
snapshots/snaprealms work (especially in conjunction with the MDS) - 
that'd be really helpful as well.

Thanks!

Andras

On 11/4/21 11:40, Patrick Donnelly wrote:
Hi Andras,

On Wed, Nov 3, 2021 at 10:18 AM Andras Pataki
<apataki@xxxxxxxxxxxxxxxxxxxxx>  wrote:
Hi cephers,

Recently we've started using cephfs snapshots more - and seem to be
running into a rather annoying performance issue with the MDS.  The
cluster in question is on Nautilus 14.2.20.

Typically, the MDS processes a few thousand requests per second with all
operations showing latencies in the few millisecond range (in mds perf
dump) and the system seems quite responsive (directory listings, general
metadata operations feel quick).  Every so often, the MDS transitions
into a period of super high latency: 0.1 to 2 seconds per operation (as
measured by increases in the latency counters in mds perf dump).  During
these high latency periods, the request rate is about the same (low
1000s requests/s) - but one thread of the MDS called 'fn_anonymous' is
100% busy.  Pointing the debugger to it and getting a stack trace at
random times always shows a similar picture:
Thanks for the report and useful stack trace. This is probably
corrected by the new use of a "fair" mutex in the MDS:

https://tracker.ceph.com/issues/52441

The fix will be in 16.2.7.

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx