Re: another osd_pglog memory usage incident

Harald Staub <harald.staub@xxxxxxxxx> · Fri, 9 Oct 2020 13:41:59 +0200

On 07.10.20 21:00, Wido den Hollander wrote:

On 07/10/2020 16:00, Dan van der Ster wrote:
On Wed, Oct 7, 2020 at 3:29 PM Wido den Hollander <wido@xxxxxxxx> wrote:

On 07/10/2020 14:08, Dan van der Ster wrote:
Hi all,

This morning some osds in our S3 cluster started going OOM, after
restarting them I noticed that the osd_pglog is using >1.5GB per osd.
(This is on an osd with osd_memory_target = 2GB, hosting 112PGs, all
PGs are active+clean).
[...]

Hi all,

As Wido said, our case may be a bit different.

This is still on 14.2.8. Trouble started with lots of small objects. 
There were 2 Veeam buckets with more than 400M objects each,
on a pool with EC 8+3. This means that there were about 10 billion 
object shards. DB space on SSD was tiny (originally built for filestore, 
there was space for 25GB, i.e. only 3GB really usable as we know now).

Then OSD memory started to grow, mostly buffer_anon. Decreasing 
osd_max_pg_log_entries helped (with buffer_anon!). We added RAM, only to 
have more OOMs a few days later. And we realized that DB slow bytes had 
started to grow, without bounds.

We could delete the objects (took several weeks), and there were no OOMs 
during this time. But afterwards again growing buffer_anon.

Once I observed free memory improving when a customer was writing 
heavily. So we started to write constantly (small objects to dummy 
buckets). This helps with buffer_anon and also with the DB growth.

It seems that at least 14.2.8 does not trim buffer_anon periodically, 
but only when writing:
https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/message/EVPELHOL4KLRJ4CKOOD2JECBMUKE4EKB/

One possible explanation (just an idea) for the large amount of 
buffer_anon: DB slow bytes got spread over lots and lots of small 
allocations on the HDD.

We rebuilt all OSDs with bigger DBs (31GB). And we limit the amount of 
slow bytes, with manual compactions.

With the big amount of small objects gone, the cluster was still 
unhealthy. Then we realized that the RGW Garbage Collector did not keep 
up with the load. A possible reason for the GC backlog: customers using 
features like versioning more heavily than before.

There are high refcounts in GC, and there were times with lots of HEAD 
requests from some customers.

GC load is mostly read load. Combined with only low write activity, this 
may be problematic.

We tuned GC up and the backlog is going down now, slowly (again this 
takes weeks).

Cheers
Harry
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx