Re: MGLRU premature memcg OOM on slow writes

Yu Zhao <yuzhao@xxxxxxxxxx> · Tue, 12 Mar 2024 14:11:23 -0600

On Tue, Mar 12, 2024 at 02:07:04PM -0600, Yu Zhao wrote:
> On Tue, Mar 12, 2024 at 09:44:19AM -0700, Axel Rasmussen wrote:
> > On Mon, Mar 11, 2024 at 2:11 AM Yafang Shao <laoar.shao@xxxxxxxxx> wrote:
> > >
> > > On Sat, Mar 9, 2024 at 3:19 AM Axel Rasmussen <axelrasmussen@xxxxxxxxxx> wrote:
> > > >
> > > > On Thu, Feb 29, 2024 at 4:30 PM Chris Down <chris@xxxxxxxxxxxxxx> wrote:
> > > > >
> > > > > Axel Rasmussen writes:
> > > > > >A couple of dumb questions. In your test, do you have any of the following
> > > > > >configured / enabled?
> > > > > >
> > > > > >/proc/sys/vm/laptop_mode
> > > > > >memory.low
> > > > > >memory.min
> > > > >
> > > > > None of these are enabled. The issue is trivially reproducible by writing to
> > > > > any slow device with memory.max enabled, but from the code it looks like MGLRU
> > > > > is also susceptible to this on global reclaim (although it's less likely due to
> > > > > page diversity).
> > > > >
> > > > > >Besides that, it looks like the place non-MGLRU reclaim wakes up the
> > > > > >flushers is in shrink_inactive_list() (which calls wakeup_flusher_threads()).
> > > > > >Since MGLRU calls shrink_folio_list() directly (from evict_folios()), I agree it
> > > > > >looks like it simply will not do this.
> > > > > >
> > > > > >Yosry pointed out [1], where MGLRU used to call this but stopped doing that. It
> > > > > >makes sense to me at least that doing writeback every time we age is too
> > > > > >aggressive, but doing it in evict_folios() makes some sense to me, basically to
> > > > > >copy the behavior the non-MGLRU path (shrink_inactive_list()) has.
> > > > >
> > > > > Thanks! We may also need reclaim_throttle(), depending on how you implement it.
> > > > > Current non-MGLRU behaviour on slow storage is also highly suspect in terms of
> > > > > (lack of) throttling after moving away from VMSCAN_THROTTLE_WRITEBACK, but one
> > > > > thing at a time :-)
> > > >
> > > >
> > > > Hmm, so I have a patch which I think will help with this situation,
> > > > but I'm having some trouble reproducing the problem on 6.8-rc7 (so
> > > > then I can verify the patch fixes it).
> > >
> > > We encountered the same premature OOM issue caused by numerous dirty pages.
> > > The issue disappears after we revert the commit 14aa8b2d5c2e
> > > "mm/mglru: don't sync disk for each aging cycle"
> > >
> > > To aid in replicating the issue, we've developed a straightforward
> > > script, which consistently reproduces it, even on the latest kernel.
> > > You can find the script provided below:
> > >
> > > ```
> > > #!/bin/bash
> > >
> > > MEMCG="/sys/fs/cgroup/memory/mglru"
> > > ENABLE=$1
> > >
> > > # Avoid waking up the flusher
> > > sysctl -w vm.dirty_background_bytes=$((1024 * 1024 * 1024 *4))
> > > sysctl -w vm.dirty_bytes=$((1024 * 1024 * 1024 *4))
> > >
> > > if [ ! -d ${MEMCG} ]; then
> > >         mkdir -p ${MEMCG}
> > > fi
> > >
> > > echo $$ > ${MEMCG}/cgroup.procs
> > > echo 1g > ${MEMCG}/memory.limit_in_bytes
> > >
> > > if [ $ENABLE -eq 0 ]; then
> > >         echo 0 > /sys/kernel/mm/lru_gen/enabled
> > > else
> > >         echo 0x7 > /sys/kernel/mm/lru_gen/enabled
> > > fi
> > >
> > > dd if=/dev/zero of=/data0/mglru.test bs=1M count=1023
> > > rm -rf /data0/mglru.test
> > > ```
> > >
> > > This issue disappears as well after we disable the mglru.
> > >
> > > We hope this script proves helpful in identifying and addressing the
> > > root cause. We eagerly await your insights and proposed fixes.
> > 
> > Thanks Yafang, I was able to reproduce the issue using this script.
> > 
> > Perhaps interestingly, I was not able to reproduce it with cgroupv2
> > memcgs. I know writeback semantics are quite a bit different there, so
> > perhaps that explains why.
> > 
> > Unfortunately, it also reproduces even with the commit I had in mind
> > (basically stealing the "if (all isolated pages are unqueued dirty) {
> > wakeup_flusher_threads(); reclaim_throttle(); }" from
> > shrink_inactive_list, and adding it to MGLRU's evict_folios()). So
> > I'll need to spend some more time on this; I'm planning to send
> > something out for testing next week.
> 
> Hi Chris,
> 
> My apologies for not getting back to you sooner.
> 
> And thanks everyone for all the input!
> 
> My take is that Chris' premature OOM kills were NOT really due to
> the flusher not waking up or missing throttling.
> 
> Yes, these two are among the differences between the active/inactive
> LRU and MGLRU, but their roles, IMO, are not as important as the LRU
> positions of dirty pages. The active/inactive LRU moves dirty pages
> all the way to the end of the line (reclaim happens at the front)
> whereas MGLRU moves them into the middle, during direct reclaim. The
> rationale for MGLRU was that this way those dirty pages would still
> be counted as "inactive" (or cold).
> 
> This theory can be quickly verified by comparing how much
> nr_vmscan_immediate_reclaim grows, i.e.,
> 
>   Before the copy
>     grep nr_vmscan_immediate_reclaim /proc/vmstat
>   And then after the copy
>     grep nr_vmscan_immediate_reclaim /proc/vmstat
> 
> The growth should be trivial for MGLRU and nontrivial for the
> active/inactive LRU.
> 
> If this is indeed the case, I'd appreciate very much if anyone could
> try the following (I'll try it myself too later next week).
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 4255619a1a31..020f5d98b9a1 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -4273,10 +4273,13 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
>  	}
>  
>  	/* waiting for writeback */
> -	if (folio_test_locked(folio) || folio_test_writeback(folio) ||
> -	    (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
> -		gen = folio_inc_gen(lruvec, folio, true);
> -		list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
> +	if (folio_test_writeback(folio) || (type == LRU_GEN_FILE && folio_test_dirty(folio))) {
> +		DEFINE_MAX_SEQ(lruvec);
> +		int old_gen, new_gen = lru_gen_from_seq(max_seq);
> +
> +		old_gen = folio_update_gen(folio, new_gen);
> +		lru_gen_update_size(lruvec, folio, old_gen, new_gen);
> +		list_move(&folio->lru, &lrugen->folios[new_gen][type][zone]);

Sorry missing one line here:

 +		folio_set_reclaim(folio);

>  		return true;
>  	}
>  
> > > > If I understand the issue right, all we should need to do is get a
> > > > slow filesystem, and then generate a bunch of dirty file pages on it,
> > > > while running in a tightly constrained memcg. To that end, I tried the
> > > > following script. But, in reality I seem to get little or no
> > > > accumulation of dirty file pages.
> > > >
> > > > I thought maybe fio does something different than rsync which you said
> > > > you originally tried, so I also tried rsync (copying /usr/bin into
> > > > this loop mount) and didn't run into an OOM situation either.
> > > >
> > > > Maybe some dirty ratio settings need tweaking or something to get the
> > > > behavior you see? Or maybe my test has a dumb mistake in it. :)
> > > >
> > > >
> > > >
> > > > #!/usr/bin/env bash
> > > >
> > > > echo 0 > /proc/sys/vm/laptop_mode || exit 1
> > > > echo y > /sys/kernel/mm/lru_gen/enabled || exit 1
> > > >
> > > > echo "Allocate disk image"
> > > > IMAGE_SIZE_MIB=1024
> > > > IMAGE_PATH=/tmp/slow.img
> > > > dd if=/dev/zero of=$IMAGE_PATH bs=1024k count=$IMAGE_SIZE_MIB || exit 1
> > > >
> > > > echo "Setup loop device"
> > > > LOOP_DEV=$(losetup --show --find $IMAGE_PATH) || exit 1
> > > > LOOP_BLOCKS=$(blockdev --getsize $LOOP_DEV) || exit 1
> > > >
> > > > echo "Create dm-slow"
> > > > DM_NAME=dm-slow
> > > > DM_DEV=/dev/mapper/$DM_NAME
> > > > echo "0 $LOOP_BLOCKS delay $LOOP_DEV 0 100" | dmsetup create $DM_NAME || exit 1
> > > >
> > > > echo "Create fs"
> > > > mkfs.ext4 "$DM_DEV" || exit 1
> > > >
> > > > echo "Mount fs"
> > > > MOUNT_PATH="/tmp/$DM_NAME"
> > > > mkdir -p "$MOUNT_PATH" || exit 1
> > > > mount -t ext4 "$DM_DEV" "$MOUNT_PATH" || exit 1
> > > >
> > > > echo "Generate dirty file pages"
> > > > systemd-run --wait --pipe --collect -p MemoryMax=32M \
> > > >         fio -name=writes -directory=$MOUNT_PATH -readwrite=randwrite \
> > > >         -numjobs=10 -nrfiles=90 -filesize=1048576 \
> > > >         -fallocate=posix \
> > > >         -blocksize=4k -ioengine=mmap \
> > > >         -direct=0 -buffered=1 -fsync=0 -fdatasync=0 -sync=0 \
> > > >         -runtime=300 -time_based