Re: [PATCH 1/1] mm/oom_kill: trigger the oom killer if oom occurs without __GFP_FS

Hui Wang <hui.wang@xxxxxxxxxxxxx> · Thu, 27 Apr 2023 15:49:19 +0800

On 4/27/23 15:03, Colin King (gmail) wrote:
On 27/04/2023 04:47, Hui Wang wrote:

On 4/27/23 09:18, Gao Xiang wrote:


On 2023/4/26 19:07, Hui Wang wrote:

On 4/26/23 16:33, Michal Hocko wrote:
[CC squashfs maintainer]

On Wed 26-04-23 13:10:30, Hui Wang wrote:
If we run the stress-ng in the filesystem of squashfs, the system
will be in a state something like hang, the stress-ng couldn't
finish running and the console couldn't react to users' input.

This issue happens on all arm/arm64 platforms we are working on,
through debugging, we found this issue is introduced by oom handling
in the kernel.

The fs->readahead() is called between memalloc_nofs_save() and
memalloc_nofs_restore(), and the squashfs_readahead() calls
alloc_page(), in this case, if there is no memory left, the
out_of_memory() will be called without __GFP_FS, then the oom killer
will not be triggered and this process will loop endlessly and wait
for others to trigger oom killer to release some memory. But for a
system with the whole root filesystem constructed by squashfs,
nearly all userspace processes will call out_of_memory() without
__GFP_FS, so we will see that the system enters a state something 
like
hang when running stress-ng.

To fix it, we could trigger a kthread to call page_alloc() with
__GFP_FS before returning from out_of_memory() due to without
__GFP_FS.
I do not think this is an appropriate way to deal with this issue.
Does it even make sense to trigger OOM killer for something like
readahead? Would it be more mindful to fail the allocation instead?
That being said should allocations from squashfs_readahead use
__GFP_RETRY_MAYFAIL instead?

Thanks for your comment, and this issue could hardly be reproduced 
on ext4 filesystem, that is because the ext4->readahead() doesn't 
call alloc_page(). If changing the ext4->readahead() as below, it 
will be easy to reproduce this issue with the ext4 filesystem 
(repeatedly run: $stress-ng --bigheap ${num_of_cpu_threads} 
--sequential 0 --timeout 30s --skip-silent --verbose)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index ffbbd9626bd8..8b9db0b9d0b8 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3114,12 +3114,18 @@ static int ext4_read_folio(struct file 
*file, struct folio *folio)
  static void ext4_readahead(struct readahead_control *rac)
  {
         struct inode *inode = rac->mapping->host;
+       struct page *tmp_page;

         /* If the file has inline data, no need to do readahead. */
         if (ext4_has_inline_data(inode))
                 return;

+       tmp_page = alloc_page(GFP_KERNEL);
+
         ext4_mpage_readpages(inode, rac, NULL);
+
+       if (tmp_page)
+               __free_page(tmp_page);
  }


Hi Xiang and Michal,
Is it tested with a pure ext4 without any other fs background?

Basically yes. Maybe there is a squashfs mounted for python3 in my 
test environment. But stress-ng and its needed sharing libs are in 
the ext4.

One could build a static version of stress-ng to remove the need for 
shared library loading at run time:

git clone https://github.com/ColinIanKing/stress-ng
cd stress-ng
make clean
STATIC=1 make -j 8

I did that already, and copied it to /home/ubuntu under uc20/uc22 and 
ran it from /home/ubuntu, there is no hang issue anymore. The folder 
/home/ubuntu/ is ext4 filesystem, that proves the issue only happens on 
squashfs.

And if I built it without static=1, it will hang even I ran it from 
/home/ubuntu/ because the system needs to load shared libs from squashfs 
folder.

Thanks,

Hui.


I don't think it's true that "ext4->readahead() doesn't call
alloc_page()" since I think even ext2/ext4 uses buffer head
interfaces to read metadata (extents or old block mapping)
from its bd_inode for readahead, which indirectly allocates
some extra pages to page cache as well.

Calling alloc_page() or allocating memory in the readahead() is not a 
problem, suppose we have 4 processes (A, B, C and D). Process A, B 
and C are entering out_of_memory() because of allocating memory in 
the readahead(), they are looping and waiting for some memory be 
released. And process D could enter out_of_memory() with __GFP_FS, 
then it could trigger oom killer, so A, B and C could get the memory 
and return to the readahead(), there is no system hang issue.

But if all 4 processes enter out_of_memory() from readahead(), they 
will loop and wait endlessly, there is no process to trigger oom 
killer,  so the users will think the system is getting hang.

I applied my change for ext4->readahead to linux-next, and tested it 
on my ubuntu classic server for arm64, I could reproduce the hang 
issue within 1 minutes with 100% rate. I guess it is easy to 
reproduce the issue because it is an embedded environment, the total 
number of processes in the system is very limited, nearly all 
userspace processes will finally reach out_of_memory() from 
ext4_readahead(), and nearly all kthreads will not reach 
out_of_memory() for long time, that makes the system in a state like 
hang (not real hang).

And this is why I wrote a patch to let a specific kthread trigger oom 
killer forcibly (my initial patch).



The difference only here is the total number of pages to be
allocated here, but many extra compressed data takeing extra
allocation causes worse.  So I think it much depends on how
stressful does your stress workload work like, and I'm even
not sure it's a real issue since if you stop the stress
workload, it will immediately recover (only it may not oom
directly).

Yes, it is not a real hang. All userspace processes are looping and 
waiting for other processes to release or reclaim memory. And in this 
case, we can't stop the stress workload since users can't control the 
system through console.

So Michal,

Don't know if you read the "[PATCH 0/1] mm/oom_kill: system enters a 
state something like hang when running stress-ng", do you know why 
out_of_memory() will return immediately if there is no __GFP_FS, 
could we drop these lines directly:

     /*
      * The OOM killer does not compensate for IO-less reclaim.
      * pagefault_out_of_memory lost its gfp context so we have to
      * make sure exclude 0 mask - all other users should have at least
      * ___GFP_DIRECT_RECLAIM to get here. But mem_cgroup_oom() has to
      * invoke the OOM killer even if it is a GFP_NOFS allocation.
      */
     if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS) && 
!is_memcg_oom(oc))
         return true;


Thanks,

Hui.

Thanks,
Gao Xiang