Re: 6.5 CephFS client - ceph_cap_reclaim_work [ceph] / ceph_con_workfn [libceph] hogged CPU

Xiubo Li <xiubli@xxxxxxxxxx> · Wed, 8 Nov 2023 14:45:25 +0800

On 11/8/23 14:41, Stefan Kooman wrote:
On 08-11-2023 01:33, Xiubo Li wrote:

On 9/18/23 20:42, Stefan Kooman wrote:
On 13-09-2023 16:49, Stefan Kooman wrote:
On 13-09-2023 14:58, Ilya Dryomov wrote:
On Wed, Sep 13, 2023 at 9:20 AM Stefan Kooman <stefan@xxxxxx> wrote:

Hi,

Since the 6.5 kernel addressed the issue with regards to 
regression in
the readahead handling code... we went ahead and installed this 
kernel
for a couple of mail / web clusters (Ubuntu 6.5.1-060501-generic
#202309020842 SMP PREEMPT_DYNAMIC Sat Sep  2 08:48:34 UTC 2023 
x86_64
x86_64 x86_64 GNU/Linux). Since then we occasionally see the 
following
being logged by the kernel:

[Sun Sep 10 07:19:00 2023] workqueue: delayed_work [ceph] hogged 
CPU for
   >10000us 4 times, consider switching to WQ_UNBOUND
[Sun Sep 10 08:41:24 2023] workqueue: ceph_con_workfn [libceph] 
hogged
CPU for >10000us 4 times, consider switching to WQ_UNBOUND
[Sun Sep 10 11:05:55 2023] workqueue: delayed_work [ceph] hogged 
CPU for
   >10000us 8 times, consider switching to WQ_UNBOUND
[Sun Sep 10 12:54:38 2023] workqueue: ceph_con_workfn [libceph] 
hogged
CPU for >10000us 8 times, consider switching to WQ_UNBOUND
[Sun Sep 10 19:06:37 2023] workqueue: ceph_con_workfn [libceph] 
hogged
CPU for >10000us 16 times, consider switching to WQ_UNBOUND
[Mon Sep 11 10:53:33 2023] workqueue: ceph_con_workfn [libceph] 
hogged
CPU for >10000us 32 times, consider switching to WQ_UNBOUND
[Tue Sep 12 10:14:03 2023] workqueue: ceph_con_workfn [libceph] 
hogged
CPU for >10000us 64 times, consider switching to WQ_UNBOUND
[Tue Sep 12 11:14:33 2023] workqueue: ceph_cap_reclaim_work [ceph]
hogged CPU for >10000us 4 times, consider switching to WQ_UNBOUND

We wonder if this is a new phenomenon, or that it's rather logged 
in the
new kernel and it was not before.

Hi Stefan,

This is something that wasn't logged in older kernels. The kernel
workqueue infrastructure is considering Ceph work items CPU intensive
and reports that in dmesg.  This is new in 6.5 kernel, the threshold
can be tweaked with workqueue.cpu_intensive_thresh_us parameter.

Thanks. I was just looking into it (WQ_UNBOUND), alloc_workqueue(), 
etc. The patch by Tejun Heo on workqueue also mentions this:

* Concurrency-managed per-cpu work items that hog CPUs and delay the
    execution of other work items are now automatically detected 
and excluded from concurrency management. Reporting on such work 
items can also be enabled through a config option.

This does imply that the Ceph work items are "excluded from 
concurrency management", is that correct? And if so, what does that 
mean in practice? Might this make the process of returning / 
claiming caps to the MDS slower?

In 6.6-rc1 more workqueue work is done and more fine tuning seems 
possible. If there are any recommendations from a cephfs kernel 
client perspective on what a good policy would be, we would love to 
hear about that.

For now we will just disable the detection 
(cpu_intensive_thresh_us=0) and see how it goes.

Well, the positive thing to mention is that we don't see this 
happening anymore. However, in the same time window as last week, an 
event happened. This time it was the MDS that got OOM killed. At the 
time it was killed it was consuming 249.9 GiB (03:51:33). Just 
before that it consumed ~ 200 GiB. It acquired ~ 50 GiB of ram in 35 
seconds (according to metrics). What MDS / client behaviour can 
trigger such an amount of increased memory usage?

There are several cases, for example when the pools are near full it 
could pile the journal logs in MDCache or due to the issues in 
https://github.com/ceph/ceph/pull/44180.

Did you see any warning from ceph status or any suspect logs from the 
MDS ?

We did for sure not hit a near full condition. We did observe a large 
amount of bandwidth associated with this kclient. The MDS is sending ~ 
650 Mb/s of traffic to this one client alone (so more than 80% of all 
MDS traffic goes to one client).

However, no ceph warnings or MDS logging prior to this event. The 
event starts, MDS continues to work ... until it's killed ~ half a 
minute later.

MDS log settings:

    "debug_mds": "0/0",
    "debug_mds_balancer": "1/5",
    "debug_mds_locker": "1/5",
    "debug_mds_log": "1/5",
    "debug_mds_log_expire": "1/5",
    "debug_mds_migrator": "1/5",

Note that we do not see any anomaly regarding CephFS (meta)data read / 
write traffic. So no excessive writes to a CephFS metadata pool for 
example.

Note, that after we downgraded the kclients again (5.12.19) we did not 
observe this behavior. So we might be able to reproduce, but we rather 
don't want to ;-).

Okay, BTW how man active MDSs you are running ?

Thanks

- Xiubo

Gr. Stefan

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx