Daily failed capability releases, slow ops, fully stuck IO

"Kuhring, Mathias" <mathias.kuhring@xxxxxxxxxxxxxx> · Fri, 24 Feb 2023 10:29:44 +0000

Dear Ceph community,

since about two or three weeks, we have CephFS clients regularly failing 
to respond to capability releases accompanied OSD slow ops. By now, this 
happens daily every time clients get more active (e.g. during nightly 
backups).

We mostly observe it with a handful of highly active clients, so 
correlating with IO volume. But we have over 250 clients which mount the 
CephFS and plan to get them all more active soon. What's worrying me 
further, it doesn't seem to effect only the clients which fail to 
respond to the capability release. But also other clients get just stuck 
accessing data on the CephFS.

So far I've been tracking down the corresponding OSDs via the client 
(`cat /sys/kernel/debug/ceph/*/osdc`) and restarted them one by one. But 
since this is now a regular/systemic issue, this is obviously no 
sustainable solution. This would be usually a handful of OSDs per client 
and I couldn't observe any particular pattern of involved OSDs, yet.

Our cluster still runs on CentOS 7 with kernel 
3.10.0-1160.42.2.el7.x86_64 using cephadm with ceph version 17.2.1 
(ec95624474b1871a821a912b8c3af68f8f8e7aa1) quincy (stable).

Most active clients are currently on kernel versions such as:
4.18.0-348.el8.0.2.x86_64, 4.18.0-348.2.1.el8_5.x86_64, 
4.18.0-348.7.1.el8_5.x86_64

I picked up some logging ideas from an older issue with similar symptoms:
https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/CKTIM6LF274RVHSSCSDNCQR35PYSTLEK/
But this has been already fixed in the kernel client and I don't have 
similar things in the logs.

But also I'm not sure if the things I digging up in the logs are 
actually useful. And if I'm actually looking in the right places.

So, I enabled "debug_ms 1" for the OSDs as suggested in the other thread.
But this filled up our host disks pretty fast, leading to e.g. monitors 
crashing.
I disabled the debug messages again and trimmed logs to free up space.
But I made copies of two OSD logs files which were involved in a 
capability release / slow requests issue.
They are quite big now (~3GB) and even if I remove things like ping stuff,
I have more than 1 million lines just for the morning until the disk 
space was full (around 7 hours).
So now I'm wondering how to filter/look for the right things here.

When I grep for "error", I get a few of these messages:
{"log":"debug 2023-02-22T06:18:08.113+0000 7f15c5fff700  1 -- 
[v2:192.168.1.13:6881/4149819408,v1:192.168.1.13:6884/4149819408] 
\u003c== osd.161 v2:192.168.1.31:6835/1012436344 182573 ==== 
pg_update_log_missing(3.1a6s2 epoch 646235/644895 rep_tid 1014320 
entries 646235'7672108 (0'0) error 
3:65836dde:::10016e9b7c8.00000000:head by mds.0.1221974:8515830 0.000000 
-2 ObjectCleanRegions clean_offsets: [0~18446744073709551615], 
clean_omap: 1, new_object: 0 trim_to 646178'7662340 roll_forward_to 
646192'7672106) v3 ==== 261+0+0 (crc 0 0 0) 0x562d55e52380 con 
0x562d8a2de400\n","stream":"stderr","time":"2023-02-22T06:18:08.115002765Z"}

And if I grep for "failed", I get a couple of those:
{"log":"debug 2023-02-22T06:15:25.242+0000 7f58bbf7c700  1 -- 
[v2:172.16.62.11:6829/3509070161,v1:172.16.62.11:6832/3509070161] 
\u003e\u003e 172.16.62.10:0/3127362489 conn(0x55ba06bf3c00 
msgr2=0x55b9ce07e580 crc :-1 s=STATE_CONNECTION_ESTABLISHED 
l=1).read_until read 
failed\n","stream":"stderr","time":"2023-02-22T06:15:25.243808392Z"}
{"log":"debug 2023-02-22T06:15:25.242+0000 7f58bbf7c700  1 --2- 
[v2:172.16.62.11:6829/3509070161,v1:172.16.62.11:6832/3509070161] 
\u003e\u003e 172.16.62.10:0/3127362489 conn(0x55ba06bf3c00 
0x55b9ce07e580 crc :-1 s=READY pgs=2096664 cs=0 l=1 rev1=1 crypto rx=0 
tx=0 comp rx=0 tx=0).handle_read_frame_preamble_main read frame preamble 
failed r=-1 ((1) Operation not 
permitted)\n","stream":"stderr","time":"2023-02-22T06:15:25.243813528Z"}

Not sure, if they are related to the issue.

In the kernel logs of the client (dmesg, journalctl or /var/log/messages),
there seem to be no errors or any stack traces in the relevant time periods.
The only thing I can see is me restarting the relevant OSDs:
[Mi Feb 22 07:29:59 2023] libceph: osd90 down
[Mi Feb 22 07:30:34 2023] libceph: osd90 up
[Mi Feb 22 07:31:55 2023] libceph: osd93 down
[Mi Feb 22 08:37:50 2023] libceph: osd93 up

I noticed a socket closed for another client, but I assume that's more 
related to monitors failing due to full disks:
[Mi Feb 22 05:59:52 2023] libceph: mon2 (1)172.16.62.12:6789 socket 
closed (con state OPEN)
[Mi Feb 22 05:59:52 2023] libceph: mon2 (1)172.16.62.12:6789 session 
lost, hunting for new mon
[Mi Feb 22 05:59:52 2023] libceph: mon3 (1)172.16.62.13:6789 session 
established

I would appreciate if anybody has a suggestion where I should look next.
Thank you for your help

Best Wishes,
Mathias
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx