Re: 6.5 CephFS client - ceph_cap_reclaim_work [ceph] / ceph_con_workfn [libceph] hogged CPU

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 14-09-2023 03:27, Xiubo Li wrote:

< ----- snip ------>

Hi Stefan,

Yeah, as I remembered before I have seen something like this only once in the cephfs qa tests together with other issues, but I just thought it wasn't the root cause so I didn't spent time on it.

Just went through the kernel ceph code, such as for 'ceph_cap_reclaim_work' workqueue, it will take the spin lock and if there are too many dentries and directories need to be interated it may hog the cpu for a long time. Anyway this can be improved.

These clients have between 400K - 550K CAPS, so quite a bit. It would be great if this can be improved upon. Not sure if the client is a bottleneck for releasing caps in this case. IIRC the MDS internally throttles the amount of caps it can process (per session). Anyway it would be good if the client is not the bottleneck.


However, we have hit a few OOM situations since we switched to the new
kernel because of ceph_cap_reclaim_work events (OOM is because Apache
threads keep piling up as it cannot access CephFS). We then also see MDS
slow ops reported. This might be related to a backup job that is running
on a backup server. We did not observe this behavior on 5.12.19 kernel.

We haven't touch the reclaim related code for a long time. I just suspect the backup slowed down the perf.

BTW, do you have the MDS logs about the slow requests ? What are they ?

Sure, here they are:

2023-09-10T03:55:26.057+0200 7f4136001700 0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 71.454548 secs 2023-09-10T03:55:26.057+0200 7f4136001700 0 log_channel(cluster) log [WRN] : slow request 32.863580 seconds old, received at 2023-09-10T03:54:53.197778+0200: client_request(client.5352998631:4256989 create #0x1000047707e/temp-write-test-64fd21ed302df3-60967340 2023-09-1 0T03:54:53.190783+0200 caller_uid=3774, caller_gid=3774{3774,}) currently failed to wrlock, waiting 2023-09-10T03:55:31.057+0200 7f4136001700 0 log_channel(cluster) log [WRN] : 2 slow requests, 1 included below; oldest blocked for > 76.454620 secs 2023-09-10T03:55:31.057+0200 7f4136001700 0 log_channel(cluster) log [WRN] : slow request 33.468639 seconds old, received at 2023-09-10T03:54:57.592791+0200: client_request(client.534532764 7:6092614 mkdir #0x1031e9171d1/2 2023-09-10T03:54:40.421282+0200 caller_uid=3747, caller_gid=3747{3747,}) currently failed to rdlock, waiting 2023-09-10T03:55:36.057+0200 7f4136001700 0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 81.454678 secs 2023-09-10T03:55:46.058+0200 7f4136001700 0 log_channel(cluster) log [WRN] : 6 slow requests, 5 included below; oldest blocked for > 91.454864 secs 2023-09-10T03:55:46.058+0200 7f4136001700 0 log_channel(cluster) log [WRN] : slow request 43.481559 seconds old, received at 2023-09-10T03:55:02.580113+0200: client_request(client.5352998631:4257081 create #0x100003d98e2/temp-write-test-64fd21f68d9407-10132291 2023-09-10T03:55:02.575054+0200 caller_uid=3588, caller_gid=3588{3588,}) currently failed to wrlock, waiting 2023-09-10T03:55:46.058+0200 7f4136001700 0 log_channel(cluster) log [WRN] : slow request 39.773875 seconds old, received at 2023-09-10T03:55:06.287798+0200: client_request(client.5345383113:3667543 lookup #0x100cdc56321/config.php 2023-09-10T03:55:06.284700+0200 caller_uid=3736, caller_gid=3736{3736,}) currently failed to rdlock, waiting 2023-09-10T03:55:46.058+0200 7f4136001700 0 log_channel(cluster) log [WRN] : slow request 46.176989 seconds old, received at 2023-09-10T03:54:59.884683+0200: client_request(client.5349764075:3342783 getattr As #0x102c1ca48fc 2023-09-10T03:54:59.880459+0200 caller_uid=4093, caller_gid=4093{4093,}) currently failed to rdlock, waiting 2023-09-10T03:55:46.058+0200 7f4136001700 0 log_channel(cluster) log [WRN] : slow request 35.393109 seconds old, received at 2023-09-10T03:55:10.668564+0200: client_request(client.5345383113:3667575 lookup #0x100022c947b/some-file.php 2023-09-10T03:55:10.664810+0200 caller_uid=3747, caller_gid=3747{3747,}) currently failed to rdlock, waiting 2023-09-10T03:55:46.058+0200 7f4136001700 0 log_channel(cluster) log [WRN] : slow request 91.454863 seconds old, received at 2023-09-10T03:54:14.606810+0200: client_request(client.5337202717:50413925 lookup #0x1031c61953d/some-javascript-file.js 2023-09-10T03:54:14.603107+0200 caller_uid=0, caller_gid=0{0,}) currently failed to rdlock, waiting 2023-09-10T03:56:51.055+0200 7f4136001700 0 log_channel(cluster) log [WRN] : 2 slow requests, 2 included below; oldest blocked for > 156.455763 secs 2023-09-10T03:56:51.055+0200 7f4136001700 0 log_channel(cluster) log [WRN] : slow request 34.015403 seconds old, received at 2023-09-10T03:56:17.047169+0200: client_request(client.5349764075:3343607 create #0x100aa849bdb/temp-write-test-64fd22410b6b45-35598524 2023-09-10T03:56:17.046403+0200 caller_uid=3983, caller_gid=3983{3983,}) currently failed to wrlock, waiting 2023-09-10T03:56:51.055+0200 7f4136001700 0 log_channel(cluster) log [WRN] : slow request 55.242778 seconds old, received at 2023-09-10T03:55:55.819794+0200: client_request(client.5345383113:3667692 getattr AsLsXsFs #0x100022c9480 2023-09-10T03:55:55.813948+0200 caller_uid=3747, caller_gid=3747{3747,}) currently dispatched 2023-09-10T03:56:56.055+0200 7f4136001700 0 log_channel(cluster) log [WRN] : 2 slow requests, 1 included below; oldest blocked for > 161.455822 secs 2023-09-10T03:56:56.055+0200 7f4136001700 0 log_channel(cluster) log [WRN] : slow request 60.242837 seconds old, received at 2023-09-10T03:55:55.819794+0200: client_request(client.5345383113:3667692 getattr AsLsXsFs #0x100022c9480 2023-09-10T03:55:55.813948+0200 caller_uid=3747, caller_gid=3747{3747,}) currently dispatched 2023-09-10T03:56:56.135+0200 7f4136001700 0 log_channel(cluster) log [WRN] : client.5345327647 isn't responding to mclientcaps(revoke), ino 0x1031e0642cf pending pAsLsXsxFsxcrwb issued pAsxLsXsxFsxcrwb, sent 60.318560 seconds ago 2023-09-10T03:56:56.135+0200 7f4136001700 0 log_channel(cluster) log [WRN] : client.5345327647 isn't responding to mclientcaps(revoke), ino 0x1027e766cd6 pending pAsLsXsxFsxcrwb issued pAsxLsXsxFsxcrwb, sent 60.318543 seconds ago 2023-09-10T03:56:56.135+0200 7f4136001700 0 log_channel(cluster) log [WRN] : client.5345327647 isn't responding to mclientcaps(revoke), ino 0x100022c9480 pending pAsLsXs issued pAsLsXsFs, sent 60.317304 seconds ago 2023-09-10T03:56:56.135+0200 7f4136001700 0 log_channel(cluster) log [WRN] : client.5345327647 isn't responding to mclientcaps(revoke), ino 0x1027e1fc2e5 pending pAsLsXs issued pAsLsXsFsc, sent 60.311698 seconds ago 2023-09-10T03:57:01.056+0200 7f4136001700 0 log_channel(cluster) log [WRN] : 2 slow requests, 0 included below; oldest blocked for > 166.455888 secs 2023-09-10T03:57:01.112+0200 7f4136001700 0 log_channel(cluster) log [WRN] : evicting unresponsive client the-ubuntu-jammy-webserver-with-6.5-kernel (5345327647), after 85.5223 seconds 2023-09-10T03:57:01.112+0200 7f4136001700 0 log_channel(cluster) log [INF] : Evicting (and blocklisting) client session 5345327647 (v1:[some:ip:v6:addr::ess]:0/2080458178) 2023-09-10T03:57:01.116+0200 7f4136001700 0 log_channel(cluster) log [WRN] : client.5345327647 isn't responding to mclientcaps(revoke), ino 0x1031c610d5a pending pAsLsXsxFsxcrwb issued pAsxLsXsxFsxcrwb, sent 62.595705 seconds ago 2023-09-10T06:32:02.438+0200 7f4136001700 0 log_channel(cluster) log [WRN] : client.3470094190 isn't responding to mclientcaps(revoke), ino 0x1031e1bb30f pending pAsLsXsFr issued pAsLsXsFscr, sent 62.186458 seconds ago

The "temp-write-test-string" write requests are apparently from Wordpress sites that perform "let's test if I can write" requests.

Thanks,

Gr. Stefan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux