After reading a lot about it I still don't understand how this happened and what I can do to fix it. This only trims the pglog, but not the duplicates: ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-41 --op trim-pg-log --pgid 8.664 I also try to recreate the OSDs (sync out, crush rm, wipe disk, create new osd, sync in), but the osd_pglog_items value seems to grow after everything is synced back in (I have 8TB disks that are at around 10million items, one day after I synced them back in). It seems not reach the old value which is around 50million, but still growing. Is there anything I can do with an octopus cluster, or is the only way to upgrade? And why does it happen? Am Di., 21. Feb. 2023 um 18:31 Uhr schrieb Boris Behrens <bb@xxxxxxxxx>: > Thanks a lot Josh. That really seems like my problem. > That does not look healthy in the cluster. oof. > ~# ceph tell osd.* perf dump |grep 'osd_pglog\|^osd\.[0-9]' > osd.0: { > "osd_pglog_bytes": 459617868, > "osd_pglog_items": 2955043, > osd.1: { > "osd_pglog_bytes": 598414548, > "osd_pglog_items": 4315956, > osd.2: { > "osd_pglog_bytes": 357056504, > "osd_pglog_items": 1942486, > osd.3: { > "osd_pglog_bytes": 436198324, > "osd_pglog_items": 2863501, > osd.4: { > "osd_pglog_bytes": 373516972, > "osd_pglog_items": 2127588, > osd.5: { > "osd_pglog_bytes": 335471560, > "osd_pglog_items": 1822608, > osd.6: { > "osd_pglog_bytes": 391814808, > "osd_pglog_items": 2394209, > osd.7: { > "osd_pglog_bytes": 541849048, > "osd_pglog_items": 3880437, > ... > > > Am Di., 21. Feb. 2023 um 18:21 Uhr schrieb Josh Baergen < > jbaergen@xxxxxxxxxxxxxxxx>: > >> Hi Boris, >> >> This sounds a bit like https://tracker.ceph.com/issues/53729. >> https://tracker.ceph.com/issues/53729#note-65 might help you diagnose >> whether this is the case. >> >> Josh >> >> On Tue, Feb 21, 2023 at 9:29 AM Boris Behrens <bb@xxxxxxxxx> wrote: >> > >> > Hi, >> > today I wanted to increase the PGs from 2k -> 4k and random OSDs went >> > offline in the cluster. >> > After some investigation we saw, that the OSDs got OOM killed (I've >> seen a >> > host that went from 90GB used memory to 190GB before OOM kills happen). >> > >> > We have around 24 SSD OSDs per host and 128GB/190GB/265GB memory in >> these >> > hosts. All of them experienced OOM kills. >> > All hosts are octopus / ubuntu 20.04. >> > >> > And on every step new OSDs crashed with OOM. (We now set the >> pg_num/pgp_num >> > to 2516 to stop the process). >> > The OSD logs do not show anything why this might happen. >> > Some OSDs also segfault. >> > >> > I now started to stop all OSDs on a host, and do a "ceph-bluestore-tool >> > repair" and a "ceph-kvstore-tool bluestore-kv compact" on all OSDs. This >> > takes for the 8GB OSDs around 30 minutes. When I start the OSDs I >> instantly >> > get a lot of slow OPS from all the other OSDs when the OSD come up (the >> 8TB >> > OSDs take around 10 minutes with "load_pgs". >> > >> > I am unsure what I can do to restore normal cluster performance. Any >> ideas >> > or suggestions or maybe even known bugs? >> > Maybe a line for what I can search in the logs. >> > >> > Cheers >> > Boris >> > _______________________________________________ >> > ceph-users mailing list -- ceph-users@xxxxxxx >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx >> > > > -- > Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im > groüen Saal. > -- Die Selbsthilfegruppe "UTF-8-Probleme" trifft sich diesmal abweichend im groüen Saal. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx