On Sat, 3 Aug 2019 07:01:55 -0700 Tejun Heo <tj@xxxxxxxxxx> wrote: > There's an inherent mismatch between memcg and writeback. The former > trackes ownership per-page while the latter per-inode. This was a > deliberate design decision because honoring per-page ownership in the > writeback path is complicated, may lead to higher CPU and IO overheads > and deemed unnecessary given that write-sharing an inode across > different cgroups isn't a common use-case. > > Combined with inode majority-writer ownership switching, this works > well enough in most cases but there are some pathological cases. For > example, let's say there are two cgroups A and B which keep writing to > different but confined parts of the same inode. B owns the inode and > A's memory is limited far below B's. A's dirty ratio can rise enough > to trigger balance_dirty_pages() sleeps but B's can be low enough to > avoid triggering background writeback. A will be slowed down without > a way to make writeback of the dirty pages happen. > > This patch implements foreign dirty recording and foreign mechanism so > that when a memcg encounters a condition as above it can trigger > flushes on bdi_writebacks which can clean its pages. Please see the > comment on top of mem_cgroup_track_foreign_dirty_slowpath() for > details. > > ... > > +void mem_cgroup_track_foreign_dirty_slowpath(struct page *page, > + struct bdi_writeback *wb) > +{ > + struct mem_cgroup *memcg = page->mem_cgroup; > + struct memcg_cgwb_frn *frn; > + u64 now = jiffies_64; > + u64 oldest_at = now; > + int oldest = -1; > + int i; > + > + /* > + * Pick the slot to use. If there is already a slot for @wb, keep > + * using it. If not replace the oldest one which isn't being > + * written out. > + */ > + for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) { > + frn = &memcg->cgwb_frn[i]; > + if (frn->bdi_id == wb->bdi->id && > + frn->memcg_id == wb->memcg_css->id) > + break; > + if (frn->at < oldest_at && atomic_read(&frn->done.cnt) == 1) { > + oldest = i; > + oldest_at = frn->at; > + } > + } > + > + if (i < MEMCG_CGWB_FRN_CNT) { > + unsigned long update_intv = > + min_t(unsigned long, HZ, > + msecs_to_jiffies(dirty_expire_interval * 10) / 8); An explanation of what's going on here would be helpful. Why "* 1.25" and not, umm "* 1.24"? > + /* > + * Re-using an existing one. Let's update timestamp lazily > + * to avoid making the cacheline hot. > + */ > + if (frn->at < now - update_intv) > + frn->at = now; > + } else if (oldest >= 0) { > + /* replace the oldest free one */ > + frn = &memcg->cgwb_frn[oldest]; > + frn->bdi_id = wb->bdi->id; > + frn->memcg_id = wb->memcg_css->id; > + frn->at = now; > + } > +} > + > +/* > + * Issue foreign writeback flushes for recorded foreign dirtying events > + * which haven't expired yet and aren't already being written out. > + */ > +void mem_cgroup_flush_foreign(struct bdi_writeback *wb) > +{ > + struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css); > + unsigned long intv = msecs_to_jiffies(dirty_expire_interval * 10); Ditto. > + u64 now = jiffies_64; > + int i; > + > + for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++) { > + struct memcg_cgwb_frn *frn = &memcg->cgwb_frn[i]; > + > + if (frn->at > now - intv && atomic_read(&frn->done.cnt) == 1) { > + frn->at = 0; > + cgroup_writeback_by_id(frn->bdi_id, frn->memcg_id, > + LONG_MAX, WB_REASON_FOREIGN_FLUSH, > + &frn->done); > + } > + } > +} > +