Re: cgroup v1 and balance_dirty_pages

Johannes Weiner <hannes@xxxxxxxxxxx> · Thu, 17 Nov 2022 11:31:05 -0500

On Thu, Nov 17, 2022 at 09:21:10PM +0530, Aneesh Kumar K V wrote:
> On 11/17/22 9:12 PM, Aneesh Kumar K V wrote:
> > On 11/17/22 8:42 PM, Johannes Weiner wrote:
> >> Hi Aneesh,
> >>
> >> On Thu, Nov 17, 2022 at 12:24:13PM +0530, Aneesh Kumar K.V wrote:
> >>> Currently, we don't pause in balance_dirty_pages with cgroup v1 when we
> >>> have task dirtying too many pages w.r.t to memory limit in the memcg.
> >>> This is because with cgroup v1 all the limits are checked against global
> >>> available resources. So on a system with a large amount of memory, a
> >>> cgroup with a smaller limit can easily hit OOM if the task within the
> >>> cgroup continuously dirty pages.
> >>
> >> Page reclaim has special writeback throttling for cgroup1, see the
> >> folio_wait_writeback() in shrink_folio_list(). It's not as smooth as
> >> proper dirty throttling, but it should prevent OOMs.
> >>
> >> Is this not working anymore?
> > 
> > The test is a simple dd test on on a 256GB system.
> > 
> > root@lp2:/sys/fs/cgroup/memory# mkdir test
> > root@lp2:/sys/fs/cgroup/memory# cd test/
> > root@lp2:/sys/fs/cgroup/memory/test# echo 120M > memory.limit_in_bytes 
> > root@lp2:/sys/fs/cgroup/memory/test# echo $$ > tasks 
> > root@lp2:/sys/fs/cgroup/memory/test# dd if=/dev/zero of=/home/kvaneesh/test bs=1M 
> > Killed
> > 
> > 
> > Will it hit the folio_wait_writeback, because it is sequential i/o and none of the folio
> > we are writing will be in writeback?
> 
> Other way to look at this is, if the writeback is never started via balance_dirty_pages,
> will we be finding folios in shrink_folio_list that is in writeback? 

The flushers are started from reclaim if necessary. See this code from
shrink_inactive_list():

	/*
	 * If dirty folios are scanned that are not queued for IO, it
	 * implies that flushers are not doing their job. This can
	 * happen when memory pressure pushes dirty folios to the end of
	 * the LRU before the dirty limits are breached and the dirty
	 * data has expired. It can also happen when the proportion of
	 * dirty folios grows not through writes but through memory
	 * pressure reclaiming all the clean cache. And in some cases,
	 * the flushers simply cannot keep up with the allocation
	 * rate. Nudge the flusher threads in case they are asleep.
	 */
	if (stat.nr_unqueued_dirty == nr_taken)
		wakeup_flusher_threads(WB_REASON_VMSCAN);

It sounds like there isn't enough time for writeback to commence
before the memcg already declares OOM.

If you place a reclaim_throttle(VMSCAN_THROTTLE_WRITEBACK) after that
wakeup, does that fix the issue?