Dear Ceph community, since about two or three weeks, we have CephFS clients regularly failing to respond to capability releases accompanied OSD slow ops. By now, this happens daily every time clients get more active (e.g. during nightly backups). We mostly observe it with a handful of highly active clients, so correlating with IO volume. But we have over 250 clients which mount the CephFS and plan to get them all more active soon. What's worrying me further, it doesn't seem to effect only the clients which fail to respond to the capability release. But also other clients get just stuck accessing data on the CephFS. So far I've been tracking down the corresponding OSDs via the client (`cat /sys/kernel/debug/ceph/*/osdc`) and restarted them one by one. But since this is now a regular/systemic issue, this is obviously no sustainable solution. This would be usually a handful of OSDs per client and I couldn't observe any particular pattern of involved OSDs, yet. Our cluster still runs on CentOS 7 with kernel 3.10.0-1160.42.2.el7.x86_64 using cephadm with ceph version 17.2.1 (ec95624474b1871a821a912b8c3af68f8f8e7aa1) quincy (stable). Most active clients are currently on kernel versions such as: 4.18.0-348.el8.0.2.x86_64, 4.18.0-348.2.1.el8_5.x86_64, 4.18.0-348.7.1.el8_5.x86_64 I picked up some logging ideas from an older issue with similar symptoms: https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/CKTIM6LF274RVHSSCSDNCQR35PYSTLEK/ But this has been already fixed in the kernel client and I don't have similar things in the logs. But also I'm not sure if the things I digging up in the logs are actually useful. And if I'm actually looking in the right places. So, I enabled "debug_ms 1" for the OSDs as suggested in the other thread. But this filled up our host disks pretty fast, leading to e.g. monitors crashing. I disabled the debug messages again and trimmed logs to free up space. But I made copies of two OSD logs files which were involved in a capability release / slow requests issue. They are quite big now (~3GB) and even if I remove things like ping stuff, I have more than 1 million lines just for the morning until the disk space was full (around 7 hours). So now I'm wondering how to filter/look for the right things here. When I grep for "error", I get a few of these messages: {"log":"debug 2023-02-22T06:18:08.113+0000 7f15c5fff700 1 -- [v2:192.168.1.13:6881/4149819408,v1:192.168.1.13:6884/4149819408] \u003c== osd.161 v2:192.168.1.31:6835/1012436344 182573 ==== pg_update_log_missing(3.1a6s2 epoch 646235/644895 rep_tid 1014320 entries 646235'7672108 (0'0) error 3:65836dde:::10016e9b7c8.00000000:head by mds.0.1221974:8515830 0.000000 -2 ObjectCleanRegions clean_offsets: [0~18446744073709551615], clean_omap: 1, new_object: 0 trim_to 646178'7662340 roll_forward_to 646192'7672106) v3 ==== 261+0+0 (crc 0 0 0) 0x562d55e52380 con 0x562d8a2de400\n","stream":"stderr","time":"2023-02-22T06:18:08.115002765Z"} And if I grep for "failed", I get a couple of those: {"log":"debug 2023-02-22T06:15:25.242+0000 7f58bbf7c700 1 -- [v2:172.16.62.11:6829/3509070161,v1:172.16.62.11:6832/3509070161] \u003e\u003e 172.16.62.10:0/3127362489 conn(0x55ba06bf3c00 msgr2=0x55b9ce07e580 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed\n","stream":"stderr","time":"2023-02-22T06:15:25.243808392Z"} {"log":"debug 2023-02-22T06:15:25.242+0000 7f58bbf7c700 1 --2- [v2:172.16.62.11:6829/3509070161,v1:172.16.62.11:6832/3509070161] \u003e\u003e 172.16.62.10:0/3127362489 conn(0x55ba06bf3c00 0x55b9ce07e580 crc :-1 s=READY pgs=2096664 cs=0 l=1 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).handle_read_frame_preamble_main read frame preamble failed r=-1 ((1) Operation not permitted)\n","stream":"stderr","time":"2023-02-22T06:15:25.243813528Z"} Not sure, if they are related to the issue. In the kernel logs of the client (dmesg, journalctl or /var/log/messages), there seem to be no errors or any stack traces in the relevant time periods. The only thing I can see is me restarting the relevant OSDs: [Mi Feb 22 07:29:59 2023] libceph: osd90 down [Mi Feb 22 07:30:34 2023] libceph: osd90 up [Mi Feb 22 07:31:55 2023] libceph: osd93 down [Mi Feb 22 08:37:50 2023] libceph: osd93 up I noticed a socket closed for another client, but I assume that's more related to monitors failing due to full disks: [Mi Feb 22 05:59:52 2023] libceph: mon2 (1)172.16.62.12:6789 socket closed (con state OPEN) [Mi Feb 22 05:59:52 2023] libceph: mon2 (1)172.16.62.12:6789 session lost, hunting for new mon [Mi Feb 22 05:59:52 2023] libceph: mon3 (1)172.16.62.13:6789 session established I would appreciate if anybody has a suggestion where I should look next. Thank you for your help Best Wishes, Mathias _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx