Re: cgroup v1 and balance_dirty_pages

Aneesh Kumar K V <aneesh.kumar@xxxxxxxxxxxxx> · Thu, 17 Nov 2022 21:12:17 +0530

On 11/17/22 8:42 PM, Johannes Weiner wrote:
> Hi Aneesh,
> 
> On Thu, Nov 17, 2022 at 12:24:13PM +0530, Aneesh Kumar K.V wrote:
>> Currently, we don't pause in balance_dirty_pages with cgroup v1 when we
>> have task dirtying too many pages w.r.t to memory limit in the memcg.
>> This is because with cgroup v1 all the limits are checked against global
>> available resources. So on a system with a large amount of memory, a
>> cgroup with a smaller limit can easily hit OOM if the task within the
>> cgroup continuously dirty pages.
> 
> Page reclaim has special writeback throttling for cgroup1, see the
> folio_wait_writeback() in shrink_folio_list(). It's not as smooth as
> proper dirty throttling, but it should prevent OOMs.
> 
> Is this not working anymore?

The test is a simple dd test on on a 256GB system.

root@lp2:/sys/fs/cgroup/memory# mkdir test
root@lp2:/sys/fs/cgroup/memory# cd test/
root@lp2:/sys/fs/cgroup/memory/test# echo 120M > memory.limit_in_bytes 
root@lp2:/sys/fs/cgroup/memory/test# echo $$ > tasks 
root@lp2:/sys/fs/cgroup/memory/test# dd if=/dev/zero of=/home/kvaneesh/test bs=1M 
Killed

Will it hit the folio_wait_writeback, because it is sequential i/o and none of the folio
we are writing will be in writeback?

> 
>> Shouldn't we throttle the task based on the memcg limits in this case?
>> commit 9badce000e2c ("cgroup, writeback: don't enable cgroup writeback
>> on traditional hierarchies") indicates we run into issues with enabling
>> cgroup writeback with v1. But we still can keep the global writeback
>> domain, but check the throtling needs against memcg limits in
>> balance_dirty_pages()?
> 
> Deciding when to throttle is only one side of the coin, though.
> 
> The other side is selective flushing in the IO context of whoever
> generated the dirty data, and matching the rate of dirtying to the
> rate of writeback. This isn't really possible in cgroup1, as the
> domains for memory and IO control could be disjunct.
> 
> For example, if a fast-IO cgroup shares memory with a slow-IO cgroup,
> what's the IO context for flushing the shared dirty data? What's the
> throttling rate you apply to dirtiers?

I am not using I/O controller at all. Only cpu and memory controllers are
used and what I am observing is depending on the system memory size, the container
with same memory limits will hit OOM on some machine and not on others.

One of the challenge with the above test is, we are not able to reclaim via
shrink_folio_list() because these are dirty file lru pages and we take the
below code path

	if (folio_is_file_lru(folio) &&
			    (!current_is_kswapd() ||
			     !folio_test_reclaim(folio) ||
			     !test_bit(PGDAT_DIRTY, &pgdat->flags))) {
	......
				goto activate_locked;
	}

-aneesh