On Mon, Mar 11, 2024 at 2:11 AM Yafang Shao <laoar.shao@xxxxxxxxx> wrote: > > On Sat, Mar 9, 2024 at 3:19 AM Axel Rasmussen <axelrasmussen@xxxxxxxxxx> wrote: > > > > On Thu, Feb 29, 2024 at 4:30 PM Chris Down <chris@xxxxxxxxxxxxxx> wrote: > > > > > > Axel Rasmussen writes: > > > >A couple of dumb questions. In your test, do you have any of the following > > > >configured / enabled? > > > > > > > >/proc/sys/vm/laptop_mode > > > >memory.low > > > >memory.min > > > > > > None of these are enabled. The issue is trivially reproducible by writing to > > > any slow device with memory.max enabled, but from the code it looks like MGLRU > > > is also susceptible to this on global reclaim (although it's less likely due to > > > page diversity). > > > > > > >Besides that, it looks like the place non-MGLRU reclaim wakes up the > > > >flushers is in shrink_inactive_list() (which calls wakeup_flusher_threads()). > > > >Since MGLRU calls shrink_folio_list() directly (from evict_folios()), I agree it > > > >looks like it simply will not do this. > > > > > > > >Yosry pointed out [1], where MGLRU used to call this but stopped doing that. It > > > >makes sense to me at least that doing writeback every time we age is too > > > >aggressive, but doing it in evict_folios() makes some sense to me, basically to > > > >copy the behavior the non-MGLRU path (shrink_inactive_list()) has. > > > > > > Thanks! We may also need reclaim_throttle(), depending on how you implement it. > > > Current non-MGLRU behaviour on slow storage is also highly suspect in terms of > > > (lack of) throttling after moving away from VMSCAN_THROTTLE_WRITEBACK, but one > > > thing at a time :-) > > > > > > Hmm, so I have a patch which I think will help with this situation, > > but I'm having some trouble reproducing the problem on 6.8-rc7 (so > > then I can verify the patch fixes it). > > We encountered the same premature OOM issue caused by numerous dirty pages. > The issue disappears after we revert the commit 14aa8b2d5c2e > "mm/mglru: don't sync disk for each aging cycle" > > To aid in replicating the issue, we've developed a straightforward > script, which consistently reproduces it, even on the latest kernel. > You can find the script provided below: > > ``` > #!/bin/bash > > MEMCG="/sys/fs/cgroup/memory/mglru" > ENABLE=$1 > > # Avoid waking up the flusher > sysctl -w vm.dirty_background_bytes=$((1024 * 1024 * 1024 *4)) > sysctl -w vm.dirty_bytes=$((1024 * 1024 * 1024 *4)) > > if [ ! -d ${MEMCG} ]; then > mkdir -p ${MEMCG} > fi > > echo $$ > ${MEMCG}/cgroup.procs > echo 1g > ${MEMCG}/memory.limit_in_bytes > > if [ $ENABLE -eq 0 ]; then > echo 0 > /sys/kernel/mm/lru_gen/enabled > else > echo 0x7 > /sys/kernel/mm/lru_gen/enabled > fi > > dd if=/dev/zero of=/data0/mglru.test bs=1M count=1023 > rm -rf /data0/mglru.test > ``` > > This issue disappears as well after we disable the mglru. > > We hope this script proves helpful in identifying and addressing the > root cause. We eagerly await your insights and proposed fixes. Thanks Yafang, I was able to reproduce the issue using this script. Perhaps interestingly, I was not able to reproduce it with cgroupv2 memcgs. I know writeback semantics are quite a bit different there, so perhaps that explains why. Unfortunately, it also reproduces even with the commit I had in mind (basically stealing the "if (all isolated pages are unqueued dirty) { wakeup_flusher_threads(); reclaim_throttle(); }" from shrink_inactive_list, and adding it to MGLRU's evict_folios()). So I'll need to spend some more time on this; I'm planning to send something out for testing next week. > > > > > If I understand the issue right, all we should need to do is get a > > slow filesystem, and then generate a bunch of dirty file pages on it, > > while running in a tightly constrained memcg. To that end, I tried the > > following script. But, in reality I seem to get little or no > > accumulation of dirty file pages. > > > > I thought maybe fio does something different than rsync which you said > > you originally tried, so I also tried rsync (copying /usr/bin into > > this loop mount) and didn't run into an OOM situation either. > > > > Maybe some dirty ratio settings need tweaking or something to get the > > behavior you see? Or maybe my test has a dumb mistake in it. :) > > > > > > > > #!/usr/bin/env bash > > > > echo 0 > /proc/sys/vm/laptop_mode || exit 1 > > echo y > /sys/kernel/mm/lru_gen/enabled || exit 1 > > > > echo "Allocate disk image" > > IMAGE_SIZE_MIB=1024 > > IMAGE_PATH=/tmp/slow.img > > dd if=/dev/zero of=$IMAGE_PATH bs=1024k count=$IMAGE_SIZE_MIB || exit 1 > > > > echo "Setup loop device" > > LOOP_DEV=$(losetup --show --find $IMAGE_PATH) || exit 1 > > LOOP_BLOCKS=$(blockdev --getsize $LOOP_DEV) || exit 1 > > > > echo "Create dm-slow" > > DM_NAME=dm-slow > > DM_DEV=/dev/mapper/$DM_NAME > > echo "0 $LOOP_BLOCKS delay $LOOP_DEV 0 100" | dmsetup create $DM_NAME || exit 1 > > > > echo "Create fs" > > mkfs.ext4 "$DM_DEV" || exit 1 > > > > echo "Mount fs" > > MOUNT_PATH="/tmp/$DM_NAME" > > mkdir -p "$MOUNT_PATH" || exit 1 > > mount -t ext4 "$DM_DEV" "$MOUNT_PATH" || exit 1 > > > > echo "Generate dirty file pages" > > systemd-run --wait --pipe --collect -p MemoryMax=32M \ > > fio -name=writes -directory=$MOUNT_PATH -readwrite=randwrite \ > > -numjobs=10 -nrfiles=90 -filesize=1048576 \ > > -fallocate=posix \ > > -blocksize=4k -ioengine=mmap \ > > -direct=0 -buffered=1 -fsync=0 -fdatasync=0 -sync=0 \ > > -runtime=300 -time_based > > > > > -- > Regards > Yafang