Re: another osd_pglog memory usage incident

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 07.10.20 21:00, Wido den Hollander wrote:


On 07/10/2020 16:00, Dan van der Ster wrote:
On Wed, Oct 7, 2020 at 3:29 PM Wido den Hollander <wido@xxxxxxxx> wrote:



On 07/10/2020 14:08, Dan van der Ster wrote:
Hi all,

This morning some osds in our S3 cluster started going OOM, after
restarting them I noticed that the osd_pglog is using >1.5GB per osd.
(This is on an osd with osd_memory_target = 2GB, hosting 112PGs, all
PGs are active+clean).
[...]

Hi all,

As Wido said, our case may be a bit different.

This is still on 14.2.8. Trouble started with lots of small objects. There were 2 Veeam buckets with more than 400M objects each, on a pool with EC 8+3. This means that there were about 10 billion object shards. DB space on SSD was tiny (originally built for filestore, there was space for 25GB, i.e. only 3GB really usable as we know now).

Then OSD memory started to grow, mostly buffer_anon. Decreasing osd_max_pg_log_entries helped (with buffer_anon!). We added RAM, only to have more OOMs a few days later. And we realized that DB slow bytes had started to grow, without bounds.

We could delete the objects (took several weeks), and there were no OOMs during this time. But afterwards again growing buffer_anon.

Once I observed free memory improving when a customer was writing heavily. So we started to write constantly (small objects to dummy buckets). This helps with buffer_anon and also with the DB growth.

It seems that at least 14.2.8 does not trim buffer_anon periodically, but only when writing:
https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/message/EVPELHOL4KLRJ4CKOOD2JECBMUKE4EKB/

One possible explanation (just an idea) for the large amount of buffer_anon: DB slow bytes got spread over lots and lots of small allocations on the HDD.

We rebuilt all OSDs with bigger DBs (31GB). And we limit the amount of slow bytes, with manual compactions.

With the big amount of small objects gone, the cluster was still unhealthy. Then we realized that the RGW Garbage Collector did not keep up with the load. A possible reason for the GC backlog: customers using features like versioning more heavily than before.

There are high refcounts in GC, and there were times with lots of HEAD requests from some customers.

GC load is mostly read load. Combined with only low write activity, this may be problematic.

We tuned GC up and the backlog is going down now, slowly (again this takes weeks).

Cheers
Harry
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux