On Wed, Oct 7, 2020 at 3:29 PM Wido den Hollander <wido@xxxxxxxx> wrote: > > > > On 07/10/2020 14:08, Dan van der Ster wrote: > > Hi all, > > > > This morning some osds in our S3 cluster started going OOM, after > > restarting them I noticed that the osd_pglog is using >1.5GB per osd. > > (This is on an osd with osd_memory_target = 2GB, hosting 112PGs, all > > PGs are active+clean). > > > > After reading through this list and trying a few things, I'd like to > > share the following observations for your feedback: > > > > 1. The pg log contains 3000 entries by default (on nautilus). These > > 3000 entries can legitimately consume gigabytes of ram for some > > use-cases. (I haven't determined exactly which ops triggered this > > today). > > 2. The pg log length is decided by the primary osd -- setting > > osd_max_pg_log_entries/osd_min_pg_log_entries on one single OSD does > > not have a big effect (because most of the PGs are primaried somewhere > > else). You need to set it on all the osds for it to be applied to all > > PGs. > > 3. We eventually set osd_max_pg_log_entries = 500 everywhere. This > > decreased the osd_pglog mempool from more than 1.5GB on our largest > > osds to less that 500MB. > > 4. The osd_pglog mempool is not accounted for in the osd_memory_target > > (in nautilus). > > 5. I have opened a feature request to limit the pg_log length by > > memory size (https://tracker.ceph.com/issues/47775). This way we could > > allocate a fraction of memory to the pg log and it would shorten the > > pglog length (budget) accordingly. > > 6. Would it be feasible to add an osd option to 'trim pg log at boot' > > ? This way we could avoid the cumbersome ceph-objectstore-tool > > trim-pg-log in cases of disaster (osds going oom at boot). > > > > For those that had pglog memory usage incidents -- does this match > > your experience? > > Not really. I have an active case where reducing pglog lenght works for > a short period after which memory consumption grows again. > > These OSDs however show data being used in buffer anon which is probably > something different. Well in fact at the very beginning of this incident we had excessive buffer_anon -- and I only rebooted the osds a couple hours ago and buffer_anon might indeed be growing still: # ceph daemon osd.245 dump_mempools | jq .mempool.by_pool.buffer_anon { "items": 36762, "bytes": 436869187 } Did you have any clues yet what is triggering that? How do you work around? Is there a tracker for this? -- dan > > Regarding the trim on boot, that sounds feasible. I already added a > 'compact on boot' setting, but trimming all PGs on boot should be > doable. It loads all the PGs and at that point they can be trimmed. > > Wido > > > > > Thanks! > > > > Dan > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx