On 07/10/2020 16:00, Dan van der Ster wrote:
On Wed, Oct 7, 2020 at 3:29 PM Wido den Hollander <wido@xxxxxxxx> wrote:
On 07/10/2020 14:08, Dan van der Ster wrote:
Hi all,
This morning some osds in our S3 cluster started going OOM, after
restarting them I noticed that the osd_pglog is using >1.5GB per osd.
(This is on an osd with osd_memory_target = 2GB, hosting 112PGs, all
PGs are active+clean).
After reading through this list and trying a few things, I'd like to
share the following observations for your feedback:
1. The pg log contains 3000 entries by default (on nautilus). These
3000 entries can legitimately consume gigabytes of ram for some
use-cases. (I haven't determined exactly which ops triggered this
today).
2. The pg log length is decided by the primary osd -- setting
osd_max_pg_log_entries/osd_min_pg_log_entries on one single OSD does
not have a big effect (because most of the PGs are primaried somewhere
else). You need to set it on all the osds for it to be applied to all
PGs.
3. We eventually set osd_max_pg_log_entries = 500 everywhere. This
decreased the osd_pglog mempool from more than 1.5GB on our largest
osds to less that 500MB.
4. The osd_pglog mempool is not accounted for in the osd_memory_target
(in nautilus).
5. I have opened a feature request to limit the pg_log length by
memory size (https://tracker.ceph.com/issues/47775). This way we could
allocate a fraction of memory to the pg log and it would shorten the
pglog length (budget) accordingly.
6. Would it be feasible to add an osd option to 'trim pg log at boot'
? This way we could avoid the cumbersome ceph-objectstore-tool
trim-pg-log in cases of disaster (osds going oom at boot).
For those that had pglog memory usage incidents -- does this match
your experience?
Not really. I have an active case where reducing pglog lenght works for
a short period after which memory consumption grows again.
These OSDs however show data being used in buffer anon which is probably
something different.
Well in fact at the very beginning of this incident we had excessive
buffer_anon -- and I only rebooted the osds a couple hours ago and
buffer_anon might indeed be growing still:
# ceph daemon osd.245 dump_mempools | jq .mempool.by_pool.buffer_anon
{
"items": 36762,
"bytes": 436869187
}
Did you have any clues yet what is triggering that? How do you work around?
In this case writing to the RGW seems to keep it workable. If we stop
writing to RADOS the OSD's their memory explodes and they OOM.
We do not have a clue or solution yet.
In this case we also see a lot of BlueFS spillovers and RocksDB growing
almost unbounded, a lot of compactions are required to keep it working.
Is there a tracker for this?
No, not yet. We do have a couple of messages on the ML about this.
Wido
-- dan
Regarding the trim on boot, that sounds feasible. I already added a
'compact on boot' setting, but trimming all PGs on boot should be
doable. It loads all the PGs and at that point they can be trimmed.
Wido
Thanks!
Dan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx