@Marius: no swap at all. I rather buy more memory than use swap :) Am So., 4. Dez. 2022 um 20:10 Uhr schrieb Marius Leustean < marius.leus@xxxxxxxxx>: > Hi Boris > > Do you have swap enabled on any of the OSD hosts? That may slow down > RocksDB drastically. > > On Sun, Dec 4, 2022 at 8:59 PM Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> > wrote: > >> Hi Boris, >> >> These waits seem to be all over the place. Usually, in the main ceph.log >> you see "implicated OSD" messages - I would try to find some commonality >> with either a host, switch, or something like that. Can be bad >> ports/NICs, >> LACP problems, even bad cables sometimes. I try to isolate an area that >> is >> problematic. Sometimes rebooting OSD hosts one at a time. Rebooting >> switches (if stacked/MLAG) one at a time. Something has got to be there, >> which makes the problem go away. >> -- >> Alex Gorbachev >> https://alextelescope.blogspot.com >> >> >> >> On Sun, Dec 4, 2022 at 6:08 AM Boris Behrens <bb@xxxxxxxxx> wrote: >> >> > Hi Alex, >> > I am searching for a log line that points me in the right direction. >> From >> > what I've seen, I could find a specific Host, OSD, PG that was leading >> to >> > this problem. >> > But maybe I am looking at the wrong logs. >> > >> > I have around 150k lines that look like this: >> > ceph.log.timeframe:2022-12-02T18:19:59.877920+0100 osd.122 (osd.122) >> 5195 >> > : cluster [WRN] 14 slow requests (by type [ 'delayed' : 2 'waiting for >> sub >> > ops' : 12 ] most affected pool [ 'rbd' : 14 ]) >> > ceph.log.timeframe:2022-12-02T18:19:59.905505+0100 osd.118 (osd.118) >> 21011 >> > : cluster [WRN] 256 slow requests (by type [ 'delayed' : 243 'waiting >> for >> > sub ops' : 13 ] most affected pool [ 'rbd' : 256 ]) >> > ceph.log.timeframe:2022-12-02T18:19:59.928599+0100 osd.120 (osd.120) >> 19800 >> > : cluster [WRN] 71 slow requests (by type [ 'delayed' : 15 'waiting for >> sub >> > ops' : 56 ] most affected pool [ 'rbd' : 71 ]) >> > ceph.log.timeframe:2022-12-02T18:19:59.968535+0100 osd.54 (osd.54) 6960 >> : >> > cluster [WRN] 38 slow requests (by type [ 'delayed' : 21 'waiting for >> sub >> > ops' : 17 ] most affected pool [ 'rbd' : 38 ]) >> > ceph.log.timeframe:2022-12-02T18:19:59.973174+0100 osd.97 (osd.97) >> 16792 : >> > cluster [WRN] 19 slow requests (by type [ 'delayed' : 11 'waiting for >> sub >> > ops' : 8 ] most affected pool [ 'rbd' : 19 ]) >> > ceph.log.timeframe:2022-12-02T18:19:59.978565+0100 osd.42 (osd.42) 5724 >> : >> > cluster [WRN] 12 slow requests (by type [ 'delayed' : 5 'waiting for sub >> > ops' : 7 ] most affected pool [ 'rbd' : 12 ]) >> > ceph.log.timeframe:2022-12-02T18:19:59.980684+0100 osd.98 (osd.98) >> 18471 : >> > cluster [WRN] 35 slow requests (by type [ 'delayed' : 3 'waiting for sub >> > ops' : 32 ] most affected pool [ 'rbd' : 35 ]) >> > ceph.log.timeframe:2022-12-02T18:19:59.992514+0100 osd.77 (osd.77) >> 11319 : >> > cluster [WRN] 256 slow requests (by type [ 'delayed' : 232 'waiting for >> sub >> > ops' : 24 ] most affected pool [ 'rbd' : 256 ]) >> > >> > and around 50k that look like this: >> > ceph-osd.99.log.timeframe:2022-12-02T18:19:59.605+0100 7ff8f96ba700 -1 >> > osd.99 945870 get_health_metrics reporting 9 slow ops, oldest is >> > osd_op(client.171194478.0:4862294 8.cf5 >> > 8:af34e5b1:::rbd_header.47d6a06b8b4567:head [watch ping cookie >> > 18446462598732840961 gen 26] snapc 0=[] ondisk+write+known_if_redirected >> > e945870) >> > ceph-osd.92.log.timeframe:2022-12-02T18:14:57.415+0100 7f9e8e4fd700 -1 >> > osd.92 945870 get_health_metrics reporting 6 slow ops, oldest is >> > osd_op(client.177840485.0:141305 8.159f >> > 8:f9adda1f:::rbd_data.82f60d356b4e4a.000000000000a1c2:head [write >> > 1900544~147456 in=147456b] snapc 0=[] ondisk+write+known_if_redirected >> > e945868) >> > >> > Cheers >> > Boris >> > >> > Am So., 4. Dez. 2022 um 03:15 Uhr schrieb Alex Gorbachev < >> > ag@xxxxxxxxxxxxxxxxxxx>: >> > >> >> Boris, I have seen one problematic OSD cause this issue on all OSD with >> >> which its PGs peered. The solution was to take out the slow OSD, >> >> immediately all slow ops stopped. I found it by observing common OSDs >> in >> >> reported slow ops. Not saying this is your issue, but it may be a >> >> possibility. Good luck! >> >> >> >> -- >> >> Alex Gorbachev >> >> https://alextelescope.blogspot.com >> >> >> > >> > >> > -- >> > Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im >> > groüen Saal. >> > >> _______________________________________________ >> ceph-users mailing list -- ceph-users@xxxxxxx >> To unsubscribe send an email to ceph-users-leave@xxxxxxx >> > -- Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im groüen Saal. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx