Re: [PATCH 05/24] mm/swap: move readahead policy checking into swapin_readahead

Kairui Song <ryncsn@xxxxxxxxx> · Tue, 21 Nov 2023 16:32:45 +0800

Chris Li <chrisl@xxxxxxxxxx> 于2023年11月21日周二 15:41写道：
>
> On Mon, Nov 20, 2023 at 10:35 PM Kairui Song <ryncsn@xxxxxxxxx> wrote:
> >
> > Chris Li <chrisl@xxxxxxxxxx> 于2023年11月21日周二 14:18写道：
> > >
> > > On Sun, Nov 19, 2023 at 11:48 AM Kairui Song <ryncsn@xxxxxxxxx> wrote:
> > > >
> > > > From: Kairui Song <kasong@xxxxxxxxxxx>
> > > >
> > > > This makes swapin_readahead a main entry for swapin pages,
> > > > prepare for optimizations in later commits.
> > > >
> > > > This also makes swapoff able to make use of readahead checking
> > > > based on entry. Swapping off a 10G ZRAM (lzo-rle) is faster:
> > > >
> > > > Before:
> > > > time swapoff /dev/zram0
> > > > real    0m12.337s
> > > > user    0m0.001s
> > > > sys     0m12.329s
> > > >
> > > > After:
> > > > time swapoff /dev/zram0
> > > > real    0m9.728s
> > > > user    0m0.001s
> > > > sys     0m9.719s
> > > >
> > > > And what's more, because now swapoff will also make use of no-readahead
> > > > swapin helper, this also fixed a bug for no-readahead case (eg. ZRAM):
> > > > when a process that swapped out some memory previously was moved to a new
> > > > cgroup, and the original cgroup is dead, swapoff the swap device will
> > > > make the swapped in pages accounted into the process doing the swapoff
> > > > instead of the new cgroup the process was moved to.
> > > >
> > > > This can be easily reproduced by:
> > > > - Setup a ramdisk (eg. ZRAM) swap.
> > > > - Create memory cgroup A, B and C.
> > > > - Spawn process P1 in cgroup A and make it swap out some pages.
> > > > - Move process P1 to memory cgroup B.
> > > > - Destroy cgroup A.
> > > > - Do a swapoff in cgroup C.
> > > > - Swapped in pages is accounted into cgroup C.
>
> In a strange way it makes sense to charge to C.
> Swap out == free up memory.
> Swap in == consume memory.
> C turn off swap, effectively this behavior will consume a lot of memory.
> C gets charged, so if the C is out of memory, it will punish C.
> C will not be able to continue swap in memory. The problem gets under control.

Yes, I think charging either C or B makes sense in their own way. To
me I think current behavior is kind of counter-intuitive.

Image if there are cgroup PC1, and its child cgroup CC1, CC2. If a process
swapped out some memory in CC1 then moved to CC2, and CC1 is dying.
On swapoff the charge will be moved out of PC1...

And swapoff often happens in some unlimited admin cgroup or some
cgroup for management agents.

If PC1 has a memory limit, the process in it can breach the limit easily,
we will see a process that never left PC1 having a much higher RSS
than PC1/CC1/CC2's limit.

And if there is a limit for the management agent cgroup, the agent
will be OOM instead of OOM in PC1.

Simply moving a process between the child cgroup of the same parent
cgroup won't cause a similar issue, things get weird when swapoff is
involved.

And actually with multiple layers of swap, it's less risky to swapoff
a device since other swap devices can catch over committed memory.

Oh, and there is one more case I forgot to cover in this series:
Moving a process is indeed something not happening very frequently,
but a process run in cgroup then exit, and leave some shmem swapped
out could be a common case.
Current behavior on swapoff will move these charges out of the
original parent cgroup too.

So maybe a more ideal solution for swapoff is: simply always charge a
dying cgroup parent cgroup?

Maybe a sysctl/cmdline could be introduced to control the behavior.