Huang, Ying <ying.huang@xxxxxxxxx> 于2023年11月20日周一 15:37写道: > > Kairui Song <ryncsn@xxxxxxxxx> writes: > > > From: Kairui Song <kasong@xxxxxxxxxxx> > > > > When a process which previously swapped some memory was moved to > > another cgroup, and the cgroup it previous in is dead, then swapped in > > pages will be leaked into rootcg. Previous commits fixed the bug for > > no readahead path, this commit fix the same issue for readahead path. > > > > This can be easily reproduced by: > > - Setup a SSD or HDD swap. > > - Create memory cgroup A, B and C. > > - Spawn process P1 in cgroup A and make it swap out some pages. > > - Move process P1 to memory cgroup B. > > - Destroy cgroup A. > > - Do a swapoff in cgroup C > > - Swapped in pages is accounted into cgroup C. > > > > This patch will fix it make the swapped in pages accounted in cgroup B. > > Accroding to "Memory Ownership" section of > Documentation/admin-guide/cgroup-v2.rst, > > " > A memory area is charged to the cgroup which instantiated it and stays > charged to the cgroup until the area is released. Migrating a process > to a different cgroup doesn't move the memory usages that it > instantiated while in the previous cgroup to the new cgroup. > " > > Because we don't move the charge when we move a task from one cgroup to > another. It's controversial which cgroup should be charged to. > According to the above document, it's acceptable to charge to the cgroup > C (cgroup where swapoff happens). Hi Ying, thank you very much for the info! It is controversial indeed, just the original behavior is kind of counter-intuitive. Image if there are cgroup P1, and its child cgroup C1 C2. If a process swapped out some memory in C1 then moved to C2, and C1 is dead. On swapoff the charge will be moved out of P1... And swapoff often happen on some unlimited cgroup or some cgroup for management agent. If P1 have a memory limit, it can breech the limit easily, we will see a process that never leave P1 having a much higher RSS that P1/C1/C2's limit. And if there is a limit for the management agent cgroup, the agent will be OOM instead of OOM in P1. Simply moving a process between the child cgroup of the same parent cgroup won't cause such issue, thing get weird when swapoff is involved. Or maybe we should try to be compatible, and introduce a sysctl or cmdline for this?