Re: Huge RAM Ussage on OSD recovery

Ing. Luis Felipe Domínguez Vega <luis.dominguez@xxxxxxxxx> · Wed, 21 Oct 2020 12:30:38 -0400

El 2020-10-21 10:08, Mark Nelson escribió:
On 10/21/20 7:54 AM, Ing. Luis Felipe Domínguez Vega wrote:
El 2020-10-21 08:43, Mark Nelson escribió:
Theoretically we shouldn't be spiking memory as much these days 
during
recovery, but the code is complicated and it's tough to reproduce
these kinds of issues in-house.  If you happen to catch it in the 
act,
do you see the pglog mempool stats also spiking up?

Mark

On 10/21/20 2:34 AM, Dan van der Ster wrote:
Hi,

This might be the pglog issue which has been coming up a few times 
on the list.
If the OSD cannot boot without going OOM, you might have success by
trimming the pglog, e.g. search this list for "ceph-objectstore-tool
--op trim-pg-log" for some recipes. The thread "OSDs taking too much
memory, for pglog" in particular might help.

Cheers, Dan

On Tue, Oct 20, 2020 at 11:57 PM Ing. Luis Felipe Domínguez Vega
<luis.dominguez@xxxxxxxxx> wrote:
Hi, today mi Infra provider has a blackout, then the Ceph was try 
to
recover but are in an inconsistent state because many OSD can 
recover
itself because the kernel kill it by OOM. Even now one OSD that was 
OK,
go down by OOM killed.

Even in a server with 32GB RAM the OSD use ALL that and never 
recover, i
think that can be a memory leak, ceph version octopus 15.2.3

In: https://pastebin.pl/view/59089adc
You can see that buffer_anon get 32GB, but why?? all my cluster is 
down
because that.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
this https://pastebin.pl/view/59089adc is almost the OSD going to be 
killed by OOM

Ok, that is very interesting!  The OSD memory autotuning code shrank
the caches to be almost nothing to try and compensate for the huge
growth in buffer_anon (and to a lesser extent osd_pglog) usage but
obviously couldn't do anything with that much memory being used.  Any
chance you could create a tracker ticket and paste the memory pool
info in along with ceph version/etc?

https://tracker.ceph.com/

Mark

Thanks, https://tracker.ceph.com/issues/47929
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx