Re: [PATCH 1/1] mm/oom_kill: trigger the oom killer if oom occurs without __GFP_FS

Hui Wang <hui.wang@xxxxxxxxxxxxx> · Thu, 27 Apr 2023 13:22:55 +0800

On 4/27/23 09:37, Phillip Lougher wrote:

On 27/04/2023 01:42, Hui Wang wrote:

On 4/27/23 03:34, Phillip Lougher wrote:

On 26/04/2023 20:06, Phillip Lougher wrote:

On 26/04/2023 19:26, Yang Shi wrote:
On Wed, Apr 26, 2023 at 10:38 AM Phillip Lougher
<phillip@xxxxxxxxxxxxxxx> wrote:

On 26/04/2023 17:44, Phillip Lougher wrote:
On 26/04/2023 12:07, Hui Wang wrote:
On 4/26/23 16:33, Michal Hocko wrote:
[CC squashfs maintainer]

On Wed 26-04-23 13:10:30, Hui Wang wrote:
If we run the stress-ng in the filesystem of squashfs, the 
system
will be in a state something like hang, the stress-ng couldn't
finish running and the console couldn't react to users' input.

This issue happens on all arm/arm64 platforms we are working on,
through debugging, we found this issue is introduced by oom 
handling
in the kernel.

The fs->readahead() is called between memalloc_nofs_save() and
memalloc_nofs_restore(), and the squashfs_readahead() calls
alloc_page(), in this case, if there is no memory left, the
out_of_memory() will be called without __GFP_FS, then the oom 
killer
will not be triggered and this process will loop endlessly 
and wait
for others to trigger oom killer to release some memory. But 
for a
system with the whole root filesystem constructed by squashfs,
nearly all userspace processes will call out_of_memory() without
__GFP_FS, so we will see that the system enters a state 
something like
hang when running stress-ng.

To fix it, we could trigger a kthread to call page_alloc() with
__GFP_FS before returning from out_of_memory() due to without
__GFP_FS.
I do not think this is an appropriate way to deal with this 
issue.
Does it even make sense to trigger OOM killer for something like
readahead? Would it be more mindful to fail the allocation 
instead?
That being said should allocations from squashfs_readahead use
__GFP_RETRY_MAYFAIL instead?
Thanks for your comment, and this issue could hardly be 
reproduced on
ext4 filesystem, that is because the ext4->readahead() doesn't 
call
alloc_page(). If changing the ext4->readahead() as below, it 
will be
easy to reproduce this issue with the ext4 filesystem (repeatedly
run: $stress-ng --bigheap ${num_of_cpu_threads} --sequential 0
--timeout 30s --skip-silent --verbose)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index ffbbd9626bd8..8b9db0b9d0b8 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3114,12 +3114,18 @@ static int ext4_read_folio(struct file 
*file,
struct folio *folio)
  static void ext4_readahead(struct readahead_control *rac)
  {
         struct inode *inode = rac->mapping->host;
+       struct page *tmp_page;

         /* If the file has inline data, no need to do 
readahead. */
         if (ext4_has_inline_data(inode))
                 return;

+       tmp_page = alloc_page(GFP_KERNEL);
+
         ext4_mpage_readpages(inode, rac, NULL);
+
+       if (tmp_page)
+               __free_page(tmp_page);
  }


BTW, I applied my patch to the linux-next and ran the oom 
stress-ng
testcases overnight, there is no hang, oops or crash, looks like
there is no big problem to use a kthread to trigger the oom 
killer in
this case.

And Hi squashfs maintainer, I checked the code of filesystem, 
looks
like most filesystems will not call alloc_page() in the 
readahead(),
could you please help take a look at this issue, thanks.

This will be because most filesystems don't need to do so. 
Squashfs is
a compressed filesystem with large blocks covering much more 
than one
page, and it decompresses these blocks in 
squashfs_readahead().   If
__readahead_batch() does not return the full set of pages 
covering the
Squashfs block, it allocates a temporary page for the 
decompressors to
decompress into to "fill in the hole".

What can be done here as far as Squashfs is concerned .... I could
move the page allocation out of the readahead path (e.g. do it at
mount time).
You could try this patch which does that.  Compile tested only.
The kmalloc_array() may call alloc_page() to trigger this problem too
IIUC. It should be pre-allocated as well?


That is a much smaller allocation, and so it entirely depends 
whether it is an issue or not.  There are also a number of other 
small memory allocations in the path as well.

The whole point of this patch is to move the *biggest* allocation 
which is the reported issue and then see what happens.   No point 
in making this test patch more involved and complex than necessary 
at this point.

Phillip


Also be aware this stress-ng triggered issue is new, and apparently 
didn't occur last year.   So it is reasonable to assume the issue 
has been introduced as a side effect of the readahead improvements.  
One of these is this allocation of a temporary page to decompress 
into rather than falling back to entirely decompressing into a 
pre-allocated buffer (allocated at mount time).  The small memory 
allocations have been there for many years.

Allocating the page at mount time effectively puts the memory 
allocation situation back to how it was last year before the 
readahead work.

Phillip

Thanks Phillip and Yang.

And Phillip,

I tested your change, it didn't help. According to my debug, the OOM 
happens at the place of allocating memory for bio, it is at the line 
of "struct page *page = alloc_page(GFP_NOIO);" in the 
squashfs_bio_read(). Other filesystems just use the pre-allocated 
memory in the "struct readahead_control" to do the bio, but squashfs 
allocate the new page to do the bio (maybe because the squashfs is a 
compressed filesystem).

Hi Phillip,
The test patch was a process of elimination, it removed the obvious 
change from last year.

It is also because it is a compressed filesystem, in most filesystems 
what is read off disk in I/O is what ends up in the page cache.  In a 
compressed filesystem what is read in isn't what ends up in the page 
cache.

Understand.
BTW, this is not a new issue for squashfs, we have uc20 (linux-5.4 
kernel) and uc22 (linux-5.15 kernel), all have this issue. The issue 
already existed in the squahsfs_readpage() in the 5.4 kernel.

That information would have been rather useful in the initial report, 
and saved myself from wasting my time.  Thanks for that.
Sorry didn't mention it before.

Now in the squashfs_readpage() situation does processes hang or 
crash?  In the squashfs_readpage() path __GFP_NOFS should not be in 
effect.  So is the OOM killer being invoked in this code path or 
not?   Does alloc_page() in the bio code return NULL, and/or invoke 
the OMM killer or does it get stuck.  Don't keep this information to 
yourself so I have to guess.

In the squashfs_readpage() situation, the process still hang, the 
__GFP_NOFS also applies to squahsfs_readpage(), please see below 
calltrace I captured from linux-5.4:

[  118.131804] wwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww current->comm = 
stress-ng-bighe oc->gfp_mask = 8c50 (GFP_FS is not set in gfp_mask)
[  118.142829] ------------[ cut here ]------------
[  118.142843] WARNING: CPU: 1 PID: 794 at mm/oom_kill.c:1097 
out_of_memory+0x2dc/0x340
[  118.142845] Modules linked in:
[  118.142851] CPU: 1 PID: 794 Comm: stress-ng-bighe Tainted: G        
W         5.4.0+ #152
[  118.142853] Hardware name: LS1028A RDB Board (DT)
[  118.142857] pstate: 60400005 (nZCv daif +PAN -UAO)
[  118.142860] pc : out_of_memory+0x2dc/0x340
[  118.142863] lr : out_of_memory+0xe4/0x340
[  118.142865] sp : ffff8000115cb580
[  118.142867] x29: ffff8000115cb580 x28: 0000000000000000
[  118.142871] x27: ffffcefc623dab80 x26: 000031039de16790
[  118.142875] x25: ffffcefc621e9878 x24: 0000000000000100
[  118.142878] x23: ffffcefc62278000 x22: 0000000000000000
[  118.142881] x21: ffff8000115cb6f8 x20: ffff00206272e740
[  118.142885] x19: 0000000000000000 x18: ffffcefc622268f8
[  118.142888] x17: 0000000000000000 x16: 0000000000000000
[  118.142891] x15: ffffcefc621e8a38 x14: 1a9f17e4f9444a3e
[  118.142894] x13: 0000000000000001 x12: 0000000000000400
[  118.142897] x11: 0000000000000400 x10: 0000000000000a90
[  118.142900] x9 : ffff8000115cb2c0 x8 : ffff00206272f230
[  118.142903] x7 : 0000001b81dc2360 x6 : 0000000000000000
[  118.142906] x5 : 0000000000000000 x4 : ffff00207f7db210
[  118.142909] x3 : 0000000000000000 x2 : 0000000000000000
[  118.142912] x1 : ffff00206272e740 x0 : 0000000000000000
[  118.142915] Call trace:
[  118.142919]  out_of_memory+0x2dc/0x340
[  118.142924]  __alloc_pages_nodemask+0xf04/0x1090
[  118.142928]  alloc_slab_page+0x34/0x430
[  118.142931]  allocate_slab+0x474/0x500
[  118.142935]  ___slab_alloc.constprop.0+0x1e4/0x64c
[  118.142938]  __slab_alloc.constprop.0+0x54/0xb0
[  118.142941]  kmem_cache_alloc+0x31c/0x350
[  118.142945]  alloc_buffer_head+0x2c/0xac
[  118.142948]  alloc_page_buffers+0xb8/0x210
[  118.142951]  __getblk_gfp+0x180/0x39c
[  118.142955]  squashfs_read_data+0x2a4/0x6f0
[  118.142958]  squashfs_readpage_block+0x2c4/0x630
[  118.142961]  squashfs_readpage+0x5e4/0x98c
[  118.142964]  filemap_fault+0x17c/0x720
[  118.142967]  __do_fault+0x44/0x110
[  118.142970]  __handle_mm_fault+0x930/0xdac
[  118.142973]  handle_mm_fault+0xc8/0x190
[  118.142978]  do_page_fault+0x134/0x5a0
[  118.142982]  do_translation_fault+0xe0/0x108
[  118.142985]  do_mem_abort+0x54/0xb0
[  118.142988]  el0_da+0x1c/0x20
[  118.142990] ---[ end trace c105c6721d4e890e ]---


I guess it is here to drop the __GFP_FS in the linux-5.4 
(grow_dev_page() in the fs/buffer.c): gfp_mask = 
mapping_gfp_constraint(inode->i_mapping, ~__GFP_FS) | gfp;

I guess if could use pre-allocated memory to do the bio, it will help.

We'll see.

As far I can see it you've made the system run out of memory, and are 
now complaining about the result.  There's nothing unconventional 
about Squashfs handling of out of memory, and most filesystems put 
into an out of memory situation will fail.

Understand.  And now the squashfs is used in ubuntu core and will be in 
many IoT products, really need to find a solution for it.

Thanks,

Hui.


Phillip



Thanks,

Hui.