Re: increasing PGs OOM kill SSD OSDs (octopus) - unstable OSD behavior

Josh Baergen <jbaergen@xxxxxxxxxxxxxxxx> · Tue, 21 Feb 2023 10:21:45 -0700

Hi Boris,

This sounds a bit like https://tracker.ceph.com/issues/53729.
https://tracker.ceph.com/issues/53729#note-65 might help you diagnose
whether this is the case.

Josh

On Tue, Feb 21, 2023 at 9:29 AM Boris Behrens <bb@xxxxxxxxx> wrote:
>
> Hi,
> today I wanted to increase the PGs from 2k -> 4k and random OSDs went
> offline in the cluster.
> After some investigation we saw, that the OSDs got OOM killed (I've seen a
> host that went from 90GB used memory to 190GB before OOM kills happen).
>
> We have around 24 SSD OSDs per host and 128GB/190GB/265GB memory in these
> hosts. All of them experienced OOM kills.
> All hosts are octopus / ubuntu 20.04.
>
> And on every step new OSDs crashed with OOM. (We now set the pg_num/pgp_num
> to 2516 to stop the process).
> The OSD logs do not show anything why this might happen.
> Some OSDs also segfault.
>
> I now started to stop all OSDs on a host, and do a "ceph-bluestore-tool
> repair" and a "ceph-kvstore-tool bluestore-kv compact" on all OSDs. This
> takes for the 8GB OSDs around 30 minutes. When I start the OSDs I instantly
> get a lot of slow OPS from all the other OSDs when the OSD come up (the 8TB
> OSDs take around 10 minutes with "load_pgs".
>
> I am unsure what I can do to restore normal cluster performance. Any ideas
> or suggestions or maybe even known bugs?
> Maybe a line for what I can search in the logs.
>
> Cheers
>  Boris
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx