Re: another osd_pglog memory usage incident

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Wed, 7 Oct 2020 16:00:02 +0200

On Wed, Oct 7, 2020 at 3:29 PM Wido den Hollander <wido@xxxxxxxx> wrote:
>
>
>
> On 07/10/2020 14:08, Dan van der Ster wrote:
> > Hi all,
> >
> > This morning some osds in our S3 cluster started going OOM, after
> > restarting them I noticed that the osd_pglog is using >1.5GB per osd.
> > (This is on an osd with osd_memory_target = 2GB, hosting 112PGs, all
> > PGs are active+clean).
> >
> > After reading through this list and trying a few things, I'd like to
> > share the following observations for your feedback:
> >
> > 1. The pg log contains 3000 entries by default (on nautilus). These
> > 3000 entries can legitimately consume gigabytes of ram for some
> > use-cases. (I haven't determined exactly which ops triggered this
> > today).
> > 2. The pg log length is decided by the primary osd -- setting
> > osd_max_pg_log_entries/osd_min_pg_log_entries on one single OSD does
> > not have a big effect (because most of the PGs are primaried somewhere
> > else). You need to set it on all the osds for it to be applied to all
> > PGs.
> > 3. We eventually set osd_max_pg_log_entries = 500 everywhere. This
> > decreased the osd_pglog mempool from more than 1.5GB on our largest
> > osds to less that 500MB.
> > 4. The osd_pglog mempool is not accounted for in the osd_memory_target
> > (in nautilus).
> > 5. I have opened a feature request to limit the pg_log length by
> > memory size (https://tracker.ceph.com/issues/47775). This way we could
> > allocate a fraction of memory to the pg log and it would shorten the
> > pglog length (budget) accordingly.
> > 6. Would it be feasible to add an osd option to 'trim pg log at boot'
> > ? This way we could avoid the cumbersome ceph-objectstore-tool
> > trim-pg-log in cases of disaster (osds going oom at boot).
> >
> > For those that had pglog memory usage incidents -- does this match
> > your experience?
>
> Not really. I have an active case where reducing pglog lenght works for
> a short period after which memory consumption grows again.
>
> These OSDs however show data being used in buffer anon which is probably
> something different.

Well in fact at the very beginning of this incident we had excessive
buffer_anon -- and I only rebooted the osds a couple hours ago and
buffer_anon might indeed be growing still:

# ceph daemon osd.245 dump_mempools | jq .mempool.by_pool.buffer_anon
{
  "items": 36762,
  "bytes": 436869187
}

Did you have any clues yet what is triggering that? How do you work around?
Is there a tracker for this?

-- dan

>
> Regarding the trim on boot, that sounds feasible. I already added a
> 'compact on boot' setting, but trimming all PGs on boot should be
> doable. It loads all the PGs and at that point they can be trimmed.
>
> Wido
>
> >
> > Thanks!
> >
> > Dan
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx