Re: Out of Memory after Upgrading to Nautilus

Frank Schilder <frans@xxxxxx> · Wed, 5 May 2021 17:45:47 +0000

Hi Cristoph,

how fast do your OSDs run up to 4G with a 1G setting?

You might be hit by up to 2 problems I'm facing on a mimic cluster: the OSD daemons have a memory leak and will slowly run over their limit. I run about 32 OSDs per host with 196G RAM and need to reboot every 3-4 weeks with a memory target of 2G.

The other issue is that the OSD daemons ignore certain values from the config data base in a really bizarre way and you might need an OSD restart despite the doc saying its changeable at run time. I didn't have time to send a ticket to the tracker.

What I found is that even though "ceph config show" and "ceph daemon osd.nnn config show" show that a memory limit of 2G should be active, this is only true if the value either comes from a per-OSD setting, or the global default. The values for hdds or device classes are ignored. This is somewhat difficult to observe unless you can restart OSDs with all sorts of combinations of values set and I didn't have time yet to submit a report with the data I collected.

You should also check that the old-style cache setting is disabled. Coming from luminous, you might have non-default values in the ceph.conf, disabling the memory_target setting and falling back to old cache handling.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Mark Nelson <mnelson@xxxxxxxxxx>
Sent: 05 May 2021 17:15:50
To: ceph-users@xxxxxxx
Subject:  Re: Out of Memory after Upgrading to Nautilus

Hi Cristoph,

1GB per OSD is tough!  the osd memory target only shrinks the size of
the caches but can't control things like osd map size, pg log length,
rocksdb wal buffers, etc.  It's a "best effort" algorithm to try to fit
the OSD mapped memory into that target but on it's own it doesn't really
do well below 2GB/OSD (and even that can be tough when only adjusting
the caches).  That's one of the reasons the default is 4GB.  To fit in
1GB you'll probably also need to reduce some of the previously mentioned
things but there will be consequences (slower recovery, higher write
amplification in rocksdb, etc).  By default a bluestore OSD typically
won't fit into a 1GB memory target and we don't regularly test
configurations with that little memory per OSD.

You might want to look at the memory pool performance counters, the
priority cache performance counters, and the tcmalloc heap stats to help
figure out where the memory is actually being used.

Mark

On 5/5/21 9:30 AM, Christoph Adomeit wrote:
> I manage a historical cluster of severak ceph nodes with each 128 GB Ram and 36 OSD each 8 TB size.
>
> The cluster ist just for archive purpose and performance is not so important.
>
> The cluster was running fine for long time using ceph luminous.
>
> Last week I updated it to Debian 10 and Ceph Nautilus.
>
> Now I can see that the memory usage of each osd grows slowly to 4 GB each and once the system has
> no memory left it will oom-kill processes
>
> I have already configured osd_memory_target = 1073741824 .
> This helps for some hours but then memory usage will grow from 1 GB to 4 GB per OSD.
>
> Any ideas what I can do to further limit osd memory usage ?
>
> It would be good to keep the hardware running some more time without upgrading RAM on all
> OSD machines.
>
> Any Ideas ?
>
> Thanks
>    Christoph
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx