Hey Ilya, I'm not sure if the things I find in the logs are actually anything related or useful. But I'm not really sure, if I'm looking in the right places. I enabled "debug_ms 1" for the OSDs as suggested above. But this filled up our host disks pretty fast, leading to e.g. monitors crashing. I disabled the debug messages again and trimmed logs to free up space. But I made copies of two OSD logs files which were involved in another capability release / slow requests issue. They are quite big now (~3GB) and even if I remove things like ping stuff, I have more than 1 million lines just for the morning until the disk space was full (around 7 hours). So now I'm wondering how to filter/look for the right things here. When I grep for "error", I get a few of these messages: {"log":"debug 2023-02-22T06:18:08.113+0000 7f15c5fff700 1 -- [v2:192.168.1.13:6881/4149819408,v1:192.168.1.13:6884/4149819408] \u003c== osd.161 v2:192.168.1.31:6835/1012436344 182573 ==== pg_update_log_missing(3.1a6s2 epoch 646235/644895 rep_tid 1014320 entries 646235'7672108 (0'0) error 3:65836dde:::10016e9b7c8.00000000:head by mds.0.1221974:8515830 0.000000 -2 ObjectCleanRegions clean_offsets: [0~18446744073709551615], clean_omap: 1, new_object: 0 trim_to 646178'7662340 roll_forward_to 646192'7672106) v3 ==== 261+0+0 (crc 0 0 0) 0x562d55e52380 con 0x562d8a2de400\n","stream":"stderr","time":"2023-02-22T06:18:08.115002765Z"} And if I grep for "failed", I get a couple of those: {"log":"debug 2023-02-22T06:15:25.242+0000 7f58bbf7c700 1 -- [v2:172.16.62.11:6829/3509070161,v1:172.16.62.11:6832/3509070161] \u003e\u003e 172.16.62.10:0/3127362489 conn(0x55ba06bf3c00 msgr2=0x55b9ce07e580 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed\n","stream":"stderr","time":"2023-02-22T06:15:25.243808392Z"} {"log":"debug 2023-02-22T06:15:25.242+0000 7f58bbf7c700 1 --2- [v2:172.16.62.11:6829/3509070161,v1:172.16.62.11:6832/3509070161] \u003e\u003e 172.16.62.10:0/3127362489 conn(0x55ba06bf3c00 0x55b9ce07e580 crc :-1 s=READY pgs=2096664 cs=0 l=1 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).handle_read_frame_preamble_main read frame preamble failed r=-1 ((1) Operation not permitted)\n","stream":"stderr","time":"2023-02-22T06:15:25.243813528Z"} Not sure, if they are related to the issue. In the kernel logs of the client (dmesg, journalctl or /var/log/messages), there seem to be no errors or any stack traces in the relevant time periods. The only thing I can see is our restart of the relevant OSDs: [Mi Feb 22 07:29:59 2023] libceph: osd90 down [Mi Feb 22 07:30:34 2023] libceph: osd90 up [Mi Feb 22 07:31:55 2023] libceph: osd93 down [Mi Feb 22 08:37:50 2023] libceph: osd93 up I noticed a socket closed for another client, but I assume that's more related to monitors failing due to full disks: [Mi Feb 22 05:59:52 2023] libceph: mon2 (1)172.16.62.12:6789 socket closed (con state OPEN) [Mi Feb 22 05:59:52 2023] libceph: mon2 (1)172.16.62.12:6789 session lost, hunting for new mon [Mi Feb 22 05:59:52 2023] libceph: mon3 (1)172.16.62.13:6789 session established Best, Mathias On 2/21/2023 11:42 AM, Ilya Dryomov wrote: On Tue, Feb 21, 2023 at 1:01 AM Xiubo Li <xiubli@xxxxxxxxxx><mailto:xiubli@xxxxxxxxxx> wrote: On 20/02/2023 22:28, Kuhring, Mathias wrote: Hey Dan, hey Ilya I know this issue is two years old already, but we are having similar issues. Do you know, if the fixes got ever backported to RHEL kernels? It's already backported to RHEL 8 long time ago since kernel-4.18.0-154.el8. Not looking for el7 but rather el8 fixes. Wondering if the patches were backported and we shouldn't actually see these issues. Or if you could maybe resolve them with a kernel upgrade. Most active clients are currently on kernel versions such as: 4.18.0-348.el8.0.2.x86_64, 4.18.0-348.2.1.el8_5.x86_64, 4.18.0-348.7.1.el8_5.x86_64 While the cluster runs with kernel 3.10.0-1160.42.2.el7.x86_64 and cephadm with ceph version 17.2.1 (ec95624474b1871a821a912b8c3af68f8f8e7aa1) quincy (stable). It seems not backported to el7 yet. "Yet" might be misleading here -- I don't think there is/was ever a plan to backport these fixes to RHEL 7. Not sure, if the cluster kernel is actually relevant here for OSD <> kernel client connection. If you are seeing page allocation failures only on the kernel client nodes, then it's not relevant. Unless the stack trace is the same as in the original tracker [1] or Dan's paste [2] (note ceph_osdmap_decode() -> osdmap_set_max_osd() -> krealloc() sequence), you are hitting a different issue. Pasting the entire splat(s) from the kernel log would be a good start. [1] https://tracker.ceph.com/issues/40481 [2] https://pastebin.com/neyah54k Thanks, Ilya -- Mathias Kuhring Dr. rer. nat. Bioinformatician HPC & Core Unit Bioinformatics Berlin Institute of Health at Charité (BIH) E-Mail: mathias.kuhring@xxxxxxxxxxxxxx<mailto:mathias.kuhring@xxxxxxxxxxxxxx> Mobile: +49 172 3475576 _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx