Thanks a lot Josh. That really seems like my problem. That does not look healthy in the cluster. oof. ~# ceph tell osd.* perf dump |grep 'osd_pglog\|^osd\.[0-9]' osd.0: { "osd_pglog_bytes": 459617868, "osd_pglog_items": 2955043, osd.1: { "osd_pglog_bytes": 598414548, "osd_pglog_items": 4315956, osd.2: { "osd_pglog_bytes": 357056504, "osd_pglog_items": 1942486, osd.3: { "osd_pglog_bytes": 436198324, "osd_pglog_items": 2863501, osd.4: { "osd_pglog_bytes": 373516972, "osd_pglog_items": 2127588, osd.5: { "osd_pglog_bytes": 335471560, "osd_pglog_items": 1822608, osd.6: { "osd_pglog_bytes": 391814808, "osd_pglog_items": 2394209, osd.7: { "osd_pglog_bytes": 541849048, "osd_pglog_items": 3880437, ... Am Di., 21. Feb. 2023 um 18:21 Uhr schrieb Josh Baergen < jbaergen@xxxxxxxxxxxxxxxx>: > Hi Boris, > > This sounds a bit like https://tracker.ceph.com/issues/53729. > https://tracker.ceph.com/issues/53729#note-65 might help you diagnose > whether this is the case. > > Josh > > On Tue, Feb 21, 2023 at 9:29 AM Boris Behrens <bb@xxxxxxxxx> wrote: > > > > Hi, > > today I wanted to increase the PGs from 2k -> 4k and random OSDs went > > offline in the cluster. > > After some investigation we saw, that the OSDs got OOM killed (I've seen > a > > host that went from 90GB used memory to 190GB before OOM kills happen). > > > > We have around 24 SSD OSDs per host and 128GB/190GB/265GB memory in these > > hosts. All of them experienced OOM kills. > > All hosts are octopus / ubuntu 20.04. > > > > And on every step new OSDs crashed with OOM. (We now set the > pg_num/pgp_num > > to 2516 to stop the process). > > The OSD logs do not show anything why this might happen. > > Some OSDs also segfault. > > > > I now started to stop all OSDs on a host, and do a "ceph-bluestore-tool > > repair" and a "ceph-kvstore-tool bluestore-kv compact" on all OSDs. This > > takes for the 8GB OSDs around 30 minutes. When I start the OSDs I > instantly > > get a lot of slow OPS from all the other OSDs when the OSD come up (the > 8TB > > OSDs take around 10 minutes with "load_pgs". > > > > I am unsure what I can do to restore normal cluster performance. Any > ideas > > or suggestions or maybe even known bugs? > > Maybe a line for what I can search in the logs. > > > > Cheers > > Boris > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > -- Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im groüen Saal. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx