On 11/17/22 11:20 PM, Johannes Weiner wrote: > On Thu, Nov 17, 2022 at 10:46:53PM +0530, Aneesh Kumar K V wrote: >> On 11/17/22 10:01 PM, Johannes Weiner wrote: >>> On Thu, Nov 17, 2022 at 09:21:10PM +0530, Aneesh Kumar K V wrote: >>>> On 11/17/22 9:12 PM, Aneesh Kumar K V wrote: >>>>> On 11/17/22 8:42 PM, Johannes Weiner wrote: >>>>>> Hi Aneesh, >>>>>> >>>>>> On Thu, Nov 17, 2022 at 12:24:13PM +0530, Aneesh Kumar K.V wrote: >>>>>>> Currently, we don't pause in balance_dirty_pages with cgroup v1 when we >>>>>>> have task dirtying too many pages w.r.t to memory limit in the memcg. >>>>>>> This is because with cgroup v1 all the limits are checked against global >>>>>>> available resources. So on a system with a large amount of memory, a >>>>>>> cgroup with a smaller limit can easily hit OOM if the task within the >>>>>>> cgroup continuously dirty pages. >>>>>> >>>>>> Page reclaim has special writeback throttling for cgroup1, see the >>>>>> folio_wait_writeback() in shrink_folio_list(). It's not as smooth as >>>>>> proper dirty throttling, but it should prevent OOMs. >>>>>> >>>>>> Is this not working anymore? >>>>> >>>>> The test is a simple dd test on on a 256GB system. >>>>> >>>>> root@lp2:/sys/fs/cgroup/memory# mkdir test >>>>> root@lp2:/sys/fs/cgroup/memory# cd test/ >>>>> root@lp2:/sys/fs/cgroup/memory/test# echo 120M > memory.limit_in_bytes >>>>> root@lp2:/sys/fs/cgroup/memory/test# echo $$ > tasks >>>>> root@lp2:/sys/fs/cgroup/memory/test# dd if=/dev/zero of=/home/kvaneesh/test bs=1M >>>>> Killed >>>>> >>>>> >>>>> Will it hit the folio_wait_writeback, because it is sequential i/o and none of the folio >>>>> we are writing will be in writeback? >>>> >>>> Other way to look at this is, if the writeback is never started via balance_dirty_pages, >>>> will we be finding folios in shrink_folio_list that is in writeback? >>> >>> The flushers are started from reclaim if necessary. See this code from >>> shrink_inactive_list(): >>> >>> /* >>> * If dirty folios are scanned that are not queued for IO, it >>> * implies that flushers are not doing their job. This can >>> * happen when memory pressure pushes dirty folios to the end of >>> * the LRU before the dirty limits are breached and the dirty >>> * data has expired. It can also happen when the proportion of >>> * dirty folios grows not through writes but through memory >>> * pressure reclaiming all the clean cache. And in some cases, >>> * the flushers simply cannot keep up with the allocation >>> * rate. Nudge the flusher threads in case they are asleep. >>> */ >>> if (stat.nr_unqueued_dirty == nr_taken) >>> wakeup_flusher_threads(WB_REASON_VMSCAN); >>> >>> It sounds like there isn't enough time for writeback to commence >>> before the memcg already declares OOM. >>> >>> If you place a reclaim_throttle(VMSCAN_THROTTLE_WRITEBACK) after that >>> wakeup, does that fix the issue? >> >> yes. That helped. One thing I noticed is with that reclaim_throttle, we >> don't end up calling folio_wait_writeback() at all. But still the >> dd was able to continue till the file system got full. >> >> Without that reclaim_throttle(), we do end up calling folio_wait_writeback() >> but at some point hit OOM > > Interesting. This is probably owed to the discrepancy between total > memory and the cgroup size. The flusher might put the occasional > cgroup page under writeback, but cgroup reclaim will still see mostly > dirty pages and not slow down enough. > > Would you mind sending a patch for adding that reclaim_throttle()? > Gated on !writeback_throttling_sane(), with a short comment explaining > that the flushers may not issue writeback quickly enough for cgroup1 > writeback throttling to work on larger systems with small cgroups. I will do that. -aneesh