Re: [ext] Re: Re: kernel client osdc ops stuck and mds slow reqs

Ilya Dryomov <idryomov@xxxxxxxxx> · Thu, 23 Feb 2023 17:03:46 +0100

On Thu, Feb 23, 2023 at 3:31 PM Kuhring, Mathias
<mathias.kuhring@xxxxxxxxxxxxxx> wrote:
>
> Hey Ilya,
>
> I'm not sure if the things I find in the logs are actually anything related or useful.
> But I'm not really sure, if I'm looking in the right places.
>
> I enabled "debug_ms 1" for the OSDs as suggested above.
> But this filled up our host disks pretty fast,  leading to e.g. monitors crashing.
> I disabled the debug messages again and trimmed logs to free up space.
> But I made copies of two OSD logs files which were involved in another capability release / slow requests issue.
> They are quite big now (~3GB) and even if I remove things like ping stuff,
> I have more than 1 million lines just for the morning until the disk space was full (around 7 hours).
> So now I'm wondering how to filter/look for the right things here.
>
> When I grep for "error", I get a few of these messages:
> {"log":"debug 2023-02-22T06:18:08.113+0000 7f15c5fff700  1 -- [v2:192.168.1.13:6881/4149819408,v1:192.168.1.13:6884/4149819408] \u003c== osd.161 v2:192.168.1.31:6835/1012436344 182573 ==== pg_update_log_missing(3.1a6s2 epoch 646235/644895 rep_tid 1014320 entries 646235'7672108 (0'0) error    3:65836dde:::10016e9b7c8.00000000:head by mds.0.1221974:8515830 0.000000 -2 ObjectCleanRegions clean_offsets: [0~18446744073709551615], clean_omap: 1, new_object: 0 trim_to 646178'7662340 roll_forward_to 646192'7672106) v3 ==== 261+0+0 (crc 0 0 0) 0x562d55e52380 con 0x562d8a2de400\n","stream":"stderr","time":"2023-02-22T06:18:08.115002765Z"}
>
> And if I grep for "failed", I get a couple of those:
> {"log":"debug 2023-02-22T06:15:25.242+0000 7f58bbf7c700  1 -- [v2:172.16.62.11:6829/3509070161,v1:172.16.62.11:6832/3509070161] \u003e\u003e 172.16.62.10:0/3127362489 conn(0x55ba06bf3c00 msgr2=0x55b9ce07e580 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed\n","stream":"stderr","time":"2023-02-22T06:15:25.243808392Z"}
> {"log":"debug 2023-02-22T06:15:25.242+0000 7f58bbf7c700  1 --2- [v2:172.16.62.11:6829/3509070161,v1:172.16.62.11:6832/3509070161] \u003e\u003e 172.16.62.10:0/3127362489 conn(0x55ba06bf3c00 0x55b9ce07e580 crc :-1 s=READY pgs=2096664 cs=0 l=1 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).handle_read_frame_preamble_main read frame preamble failed r=-1 ((1) Operation not permitted)\n","stream":"stderr","time":"2023-02-22T06:15:25.243813528Z"}
>
> Not sure, if they are related to the issue.
>
> In the kernel logs of the client (dmesg, journalctl or /var/log/messages),
> there seem to be no errors or any stack traces in the relevant time periods.

Hi Mathias,

Then it's very unlikely to be a kernel client issue meaning that you
don't need to worry about your kernel versions.

> The only thing I can see is our restart of the relevant OSDs:
> [Mi Feb 22 07:29:59 2023] libceph: osd90 down
> [Mi Feb 22 07:30:34 2023] libceph: osd90 up
> [Mi Feb 22 07:31:55 2023] libceph: osd93 down
> [Mi Feb 22 08:37:50 2023] libceph: osd93 up
>
> I noticed a socket closed for another client, but I assume that's more related to monitors failing due to full disks:
> [Mi Feb 22 05:59:52 2023] libceph: mon2 (1)172.16.62.12:6789 socket closed (con state OPEN)
> [Mi Feb 22 05:59:52 2023] libceph: mon2 (1)172.16.62.12:6789 session lost, hunting for new mon
> [Mi Feb 22 05:59:52 2023] libceph: mon3 (1)172.16.62.13:6789 session established

Yeah, these are expected when a monitor or an OSD goes down.

Thanks,

                Ilya
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx