Re: another osd_pglog memory usage incident

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Fri, 9 Oct 2020 13:55:33 +0200

On Fri, Oct 9, 2020 at 1:42 PM Harald Staub <harald.staub@xxxxxxxxx> wrote:
>
> On 07.10.20 21:00, Wido den Hollander wrote:
> >
> >
> > On 07/10/2020 16:00, Dan van der Ster wrote:
> >> On Wed, Oct 7, 2020 at 3:29 PM Wido den Hollander <wido@xxxxxxxx> wrote:
> >>>
> >>>
> >>>
> >>> On 07/10/2020 14:08, Dan van der Ster wrote:
> >>>> Hi all,
> >>>>
> >>>> This morning some osds in our S3 cluster started going OOM, after
> >>>> restarting them I noticed that the osd_pglog is using >1.5GB per osd.
> >>>> (This is on an osd with osd_memory_target = 2GB, hosting 112PGs, all
> >>>> PGs are active+clean).
> [...]
>
> Hi all,
>
> As Wido said, our case may be a bit different.
>
> This is still on 14.2.8. Trouble started with lots of small objects.
> There were 2 Veeam buckets with more than 400M objects each,
> on a pool with EC 8+3. This means that there were about 10 billion
> object shards. DB space on SSD was tiny (originally built for filestore,
> there was space for 25GB, i.e. only 3GB really usable as we know now).
>
> Then OSD memory started to grow, mostly buffer_anon. Decreasing
> osd_max_pg_log_entries helped (with buffer_anon!). We added RAM, only to
> have more OOMs a few days later. And we realized that DB slow bytes had
> started to grow, without bounds.
>
> We could delete the objects (took several weeks), and there were no OOMs
> during this time. But afterwards again growing buffer_anon.
>
> Once I observed free memory improving when a customer was writing
> heavily. So we started to write constantly (small objects to dummy
> buckets). This helps with buffer_anon and also with the DB growth.
>
> It seems that at least 14.2.8 does not trim buffer_anon periodically,
> but only when writing:
> https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/message/EVPELHOL4KLRJ4CKOOD2JECBMUKE4EKB/
>
> One possible explanation (just an idea) for the large amount of
> buffer_anon: DB slow bytes got spread over lots and lots of small
> allocations on the HDD.
>
> We rebuilt all OSDs with bigger DBs (31GB). And we limit the amount of
> slow bytes, with manual compactions.
>
> With the big amount of small objects gone, the cluster was still
> unhealthy. Then we realized that the RGW Garbage Collector did not keep
> up with the load. A possible reason for the GC backlog: customers using
> features like versioning more heavily than before.
>
> There are high refcounts in GC, and there were times with lots of HEAD
> requests from some customers.
>
> GC load is mostly read load. Combined with only low write activity, this
> may be problematic.
>
> We tuned GC up and the backlog is going down now, slowly (again this
> takes weeks).

Thanks Wido and Harald for the info. On our side the issue was not so
severe -- it started with huge buffer_anon across the OSDs, but then
after restarting an affected osd a similar amount of memory would be
accounted in the osd_pglog mempool.
We have mitigated by keeping only 500 pglog entries for now, and at
the moment things are looking quite stable, without any mempools
leaking.
I also noticed a possible relationship with scrubbing -- One week ago
we increased to osd_max_scrubs=5 to clear out a scrubbing backlog; I
wonder if the increased read/write ratio somehow led to an exploding
buffer_anon. Do things stabilize on your side if you temporarily
disable scrubbing?

Otherwise, we've just instrumented the OSDs on this cluster so we can
track all the mempools in grafana. If we learn anything we'll share
that here.

Cheers, Dan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx