Re: octopus rbd cluster just stopped out of nowhere (>20k slow ops)

Boris Behrens <bb@xxxxxxxxx> · Sun, 4 Dec 2022 21:00:03 +0100

@Marius:
no swap at all. I rather buy more  memory than use swap :)

Am So., 4. Dez. 2022 um 20:10 Uhr schrieb Marius Leustean <
marius.leus@xxxxxxxxx>:

> Hi Boris
>
> Do you have swap enabled on any of the OSD hosts? That may slow down
> RocksDB drastically.
>
> On Sun, Dec 4, 2022 at 8:59 PM Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx>
> wrote:
>
>> Hi Boris,
>>
>> These waits seem to be all over the place.  Usually, in the main ceph.log
>> you see "implicated OSD" messages - I would try to find some commonality
>> with either a host, switch, or something like that.  Can be bad
>> ports/NICs,
>> LACP problems, even bad cables sometimes.  I try to isolate an area that
>> is
>> problematic.  Sometimes rebooting OSD hosts one at a time.  Rebooting
>> switches (if stacked/MLAG) one at a time.  Something has got to be there,
>> which makes the problem go away.
>> --
>> Alex Gorbachev
>> https://alextelescope.blogspot.com
>>
>>
>>
>> On Sun, Dec 4, 2022 at 6:08 AM Boris Behrens <bb@xxxxxxxxx> wrote:
>>
>> > Hi Alex,
>> > I am searching for a log line that points me in the right direction.
>> From
>> > what I've seen, I could find a specific Host, OSD, PG that was leading
>> to
>> > this problem.
>> > But maybe I am looking at the wrong logs.
>> >
>> > I have around 150k lines that look like this:
>> > ceph.log.timeframe:2022-12-02T18:19:59.877920+0100 osd.122 (osd.122)
>> 5195
>> > : cluster [WRN] 14 slow requests (by type [ 'delayed' : 2 'waiting for
>> sub
>> > ops' : 12 ] most affected pool [ 'rbd' : 14 ])
>> > ceph.log.timeframe:2022-12-02T18:19:59.905505+0100 osd.118 (osd.118)
>> 21011
>> > : cluster [WRN] 256 slow requests (by type [ 'delayed' : 243 'waiting
>> for
>> > sub ops' : 13 ] most affected pool [ 'rbd' : 256 ])
>> > ceph.log.timeframe:2022-12-02T18:19:59.928599+0100 osd.120 (osd.120)
>> 19800
>> > : cluster [WRN] 71 slow requests (by type [ 'delayed' : 15 'waiting for
>> sub
>> > ops' : 56 ] most affected pool [ 'rbd' : 71 ])
>> > ceph.log.timeframe:2022-12-02T18:19:59.968535+0100 osd.54 (osd.54) 6960
>> :
>> > cluster [WRN] 38 slow requests (by type [ 'delayed' : 21 'waiting for
>> sub
>> > ops' : 17 ] most affected pool [ 'rbd' : 38 ])
>> > ceph.log.timeframe:2022-12-02T18:19:59.973174+0100 osd.97 (osd.97)
>> 16792 :
>> > cluster [WRN] 19 slow requests (by type [ 'delayed' : 11 'waiting for
>> sub
>> > ops' : 8 ] most affected pool [ 'rbd' : 19 ])
>> > ceph.log.timeframe:2022-12-02T18:19:59.978565+0100 osd.42 (osd.42) 5724
>> :
>> > cluster [WRN] 12 slow requests (by type [ 'delayed' : 5 'waiting for sub
>> > ops' : 7 ] most affected pool [ 'rbd' : 12 ])
>> > ceph.log.timeframe:2022-12-02T18:19:59.980684+0100 osd.98 (osd.98)
>> 18471 :
>> > cluster [WRN] 35 slow requests (by type [ 'delayed' : 3 'waiting for sub
>> > ops' : 32 ] most affected pool [ 'rbd' : 35 ])
>> > ceph.log.timeframe:2022-12-02T18:19:59.992514+0100 osd.77 (osd.77)
>> 11319 :
>> > cluster [WRN] 256 slow requests (by type [ 'delayed' : 232 'waiting for
>> sub
>> > ops' : 24 ] most affected pool [ 'rbd' : 256 ])
>> >
>> > and around 50k that look like this:
>> > ceph-osd.99.log.timeframe:2022-12-02T18:19:59.605+0100 7ff8f96ba700 -1
>> > osd.99 945870 get_health_metrics reporting 9 slow ops, oldest is
>> > osd_op(client.171194478.0:4862294 8.cf5
>> > 8:af34e5b1:::rbd_header.47d6a06b8b4567:head [watch ping cookie
>> > 18446462598732840961 gen 26] snapc 0=[] ondisk+write+known_if_redirected
>> > e945870)
>> > ceph-osd.92.log.timeframe:2022-12-02T18:14:57.415+0100 7f9e8e4fd700 -1
>> > osd.92 945870 get_health_metrics reporting 6 slow ops, oldest is
>> > osd_op(client.177840485.0:141305 8.159f
>> > 8:f9adda1f:::rbd_data.82f60d356b4e4a.000000000000a1c2:head [write
>> > 1900544~147456 in=147456b] snapc 0=[] ondisk+write+known_if_redirected
>> > e945868)
>> >
>> > Cheers
>> >  Boris
>> >
>> > Am So., 4. Dez. 2022 um 03:15 Uhr schrieb Alex Gorbachev <
>> > ag@xxxxxxxxxxxxxxxxxxx>:
>> >
>> >> Boris, I have seen one problematic OSD cause this issue on all OSD with
>> >> which its PGs peered.  The solution was to take out the slow OSD,
>> >> immediately all slow ops stopped.  I found it by observing common OSDs
>> in
>> >> reported slow ops.  Not saying this is your issue, but it may be a
>> >> possibility.  Good luck!
>> >>
>> >> --
>> >> Alex Gorbachev
>> >> https://alextelescope.blogspot.com
>> >>
>> >
>> >
>> > --
>> > Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
>> > groÃƒ¼en Saal.
>> >
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>

-- 
Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im
groÃƒ¼en Saal.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx