Hi Boris, This sounds a bit like https://tracker.ceph.com/issues/53729. https://tracker.ceph.com/issues/53729#note-65 might help you diagnose whether this is the case. Josh On Tue, Feb 21, 2023 at 9:29 AM Boris Behrens <bb@xxxxxxxxx> wrote: > > Hi, > today I wanted to increase the PGs from 2k -> 4k and random OSDs went > offline in the cluster. > After some investigation we saw, that the OSDs got OOM killed (I've seen a > host that went from 90GB used memory to 190GB before OOM kills happen). > > We have around 24 SSD OSDs per host and 128GB/190GB/265GB memory in these > hosts. All of them experienced OOM kills. > All hosts are octopus / ubuntu 20.04. > > And on every step new OSDs crashed with OOM. (We now set the pg_num/pgp_num > to 2516 to stop the process). > The OSD logs do not show anything why this might happen. > Some OSDs also segfault. > > I now started to stop all OSDs on a host, and do a "ceph-bluestore-tool > repair" and a "ceph-kvstore-tool bluestore-kv compact" on all OSDs. This > takes for the 8GB OSDs around 30 minutes. When I start the OSDs I instantly > get a lot of slow OPS from all the other OSDs when the OSD come up (the 8TB > OSDs take around 10 minutes with "load_pgs". > > I am unsure what I can do to restore normal cluster performance. Any ideas > or suggestions or maybe even known bugs? > Maybe a line for what I can search in the logs. > > Cheers > Boris > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx