On Thu, Feb 23, 2023 at 3:31 PM Kuhring, Mathias <mathias.kuhring@xxxxxxxxxxxxxx> wrote: > > Hey Ilya, > > I'm not sure if the things I find in the logs are actually anything related or useful. > But I'm not really sure, if I'm looking in the right places. > > I enabled "debug_ms 1" for the OSDs as suggested above. > But this filled up our host disks pretty fast, leading to e.g. monitors crashing. > I disabled the debug messages again and trimmed logs to free up space. > But I made copies of two OSD logs files which were involved in another capability release / slow requests issue. > They are quite big now (~3GB) and even if I remove things like ping stuff, > I have more than 1 million lines just for the morning until the disk space was full (around 7 hours). > So now I'm wondering how to filter/look for the right things here. > > When I grep for "error", I get a few of these messages: > {"log":"debug 2023-02-22T06:18:08.113+0000 7f15c5fff700 1 -- [v2:192.168.1.13:6881/4149819408,v1:192.168.1.13:6884/4149819408] \u003c== osd.161 v2:192.168.1.31:6835/1012436344 182573 ==== pg_update_log_missing(3.1a6s2 epoch 646235/644895 rep_tid 1014320 entries 646235'7672108 (0'0) error 3:65836dde:::10016e9b7c8.00000000:head by mds.0.1221974:8515830 0.000000 -2 ObjectCleanRegions clean_offsets: [0~18446744073709551615], clean_omap: 1, new_object: 0 trim_to 646178'7662340 roll_forward_to 646192'7672106) v3 ==== 261+0+0 (crc 0 0 0) 0x562d55e52380 con 0x562d8a2de400\n","stream":"stderr","time":"2023-02-22T06:18:08.115002765Z"} > > And if I grep for "failed", I get a couple of those: > {"log":"debug 2023-02-22T06:15:25.242+0000 7f58bbf7c700 1 -- [v2:172.16.62.11:6829/3509070161,v1:172.16.62.11:6832/3509070161] \u003e\u003e 172.16.62.10:0/3127362489 conn(0x55ba06bf3c00 msgr2=0x55b9ce07e580 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed\n","stream":"stderr","time":"2023-02-22T06:15:25.243808392Z"} > {"log":"debug 2023-02-22T06:15:25.242+0000 7f58bbf7c700 1 --2- [v2:172.16.62.11:6829/3509070161,v1:172.16.62.11:6832/3509070161] \u003e\u003e 172.16.62.10:0/3127362489 conn(0x55ba06bf3c00 0x55b9ce07e580 crc :-1 s=READY pgs=2096664 cs=0 l=1 rev1=1 crypto rx=0 tx=0 comp rx=0 tx=0).handle_read_frame_preamble_main read frame preamble failed r=-1 ((1) Operation not permitted)\n","stream":"stderr","time":"2023-02-22T06:15:25.243813528Z"} > > Not sure, if they are related to the issue. > > In the kernel logs of the client (dmesg, journalctl or /var/log/messages), > there seem to be no errors or any stack traces in the relevant time periods. Hi Mathias, Then it's very unlikely to be a kernel client issue meaning that you don't need to worry about your kernel versions. > The only thing I can see is our restart of the relevant OSDs: > [Mi Feb 22 07:29:59 2023] libceph: osd90 down > [Mi Feb 22 07:30:34 2023] libceph: osd90 up > [Mi Feb 22 07:31:55 2023] libceph: osd93 down > [Mi Feb 22 08:37:50 2023] libceph: osd93 up > > I noticed a socket closed for another client, but I assume that's more related to monitors failing due to full disks: > [Mi Feb 22 05:59:52 2023] libceph: mon2 (1)172.16.62.12:6789 socket closed (con state OPEN) > [Mi Feb 22 05:59:52 2023] libceph: mon2 (1)172.16.62.12:6789 session lost, hunting for new mon > [Mi Feb 22 05:59:52 2023] libceph: mon3 (1)172.16.62.13:6789 session established Yeah, these are expected when a monitor or an OSD goes down. Thanks, Ilya _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx