Re: High cephfs MDS latency and CPU load

Andras Pataki <apataki@xxxxxxxxxxxxxxxxxxxxx> · Fri, 5 Nov 2021 05:23:52 -0400

Hi Patrick,

Thanks for looking at the issue - over the past few days this has given 
us quite a bit of headache unfortunately.  A couple of questions if you 
don't mind - both for my education and to avoid hitting this issue until 
we can upgrade.

Do you think the issue is related to snapshots at all?  I started 
creating snapshots this week on this cluster, and this problem started 
showing up this week as well - and the stack trace has "snapshotty" 
things in it like snap realm splitting/opening.  Or is this just a 
coincidence perhaps?

When is this code path triggered (Server::_unlink_local_finish -> 
CInode::early_pop_projected_snaprealm -> CInode::pop_projected_snaprealm 
-> CInode::open_snaprealm -> SnapRealm::split_at)?

If this is indeed related to snapshots - somehow the problem reared its 
head more than a day after I removed all snapshots from the system.  Do 
things from snapshots stay around in the MDS well after they no longer 
exist?

There seems to be another issue with kernel clients performing really 
poorly after a snapshot has been created (even after it is removed 
afterwards).  I see a kworker thread like the following constantly 
chewing CPU when running metadata operations (like scanning the file 
system, rsync or such)
  84851 root      20   0       0      0      0 R 100.0   0.0 24:33.07 
[kworker/2:0+ceph-msgr]

   Nov  5 02:19:14 popeye-mgr-0-10 kernel: Call Trace:
   Nov  5 02:19:14 popeye-mgr-0-10 kernel:
   rebuild_snap_realms+0x1e/0x70 [ceph]
   Nov  5 02:19:14 popeye-mgr-0-10 kernel:
   rebuild_snap_realms+0x38/0x70 [ceph]
   Nov  5 02:19:14 popeye-mgr-0-10 kernel:
   ceph_update_snap_trace+0x213/0x400 [ceph]
   Nov  5 02:19:14 popeye-mgr-0-10 kernel: handle_reply+0x209/0x770 [ceph]
   Nov  5 02:19:14 popeye-mgr-0-10 kernel: dispatch+0xb2/0x1e0 [ceph]
   Nov  5 02:19:14 popeye-mgr-0-10 kernel: process_message+0x7b/0x130
   [libceph]
   Nov  5 02:19:14 popeye-mgr-0-10 kernel: try_read+0x340/0x5e0 [libceph]
   Nov  5 02:19:14 popeye-mgr-0-10 kernel: ceph_con_workfn+0x79/0x210
   [libceph]
   Nov  5 02:19:14 popeye-mgr-0-10 kernel: process_one_work+0x1a6/0x300
   Nov  5 02:19:14 popeye-mgr-0-10 kernel: worker_thread+0x41/0x310

I see another message about this on the mailing list from last night 
("One cephFS snapshot kills performance").  I've tested various kernels 
on this, two LTS ones: 5.4.114, 5.10.73 and the latest stable: 5.14.16.  
They all exhibit this behavior.

If you point me to some places to learn about how snapshots/snaprealms 
work (especially in conjunction with the MDS) - that'd be really helpful 
as well.

Thanks!

Andras

On 11/4/21 11:40, Patrick Donnelly wrote:
Hi Andras,

On Wed, Nov 3, 2021 at 10:18 AM Andras Pataki
<apataki@xxxxxxxxxxxxxxxxxxxxx>  wrote:
Hi cephers,

Recently we've started using cephfs snapshots more - and seem to be
running into a rather annoying performance issue with the MDS.  The
cluster in question is on Nautilus 14.2.20.

Typically, the MDS processes a few thousand requests per second with all
operations showing latencies in the few millisecond range (in mds perf
dump) and the system seems quite responsive (directory listings, general
metadata operations feel quick).  Every so often, the MDS transitions
into a period of super high latency: 0.1 to 2 seconds per operation (as
measured by increases in the latency counters in mds perf dump).  During
these high latency periods, the request rate is about the same (low
1000s requests/s) - but one thread of the MDS called 'fn_anonymous' is
100% busy.  Pointing the debugger to it and getting a stack trace at
random times always shows a similar picture:
Thanks for the report and useful stack trace. This is probably
corrected by the new use of a "fair" mutex in the MDS:

https://tracker.ceph.com/issues/52441

The fix will be in 16.2.7.

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx