Re: [PATCH 1/1] mm/oom_kill: trigger the oom killer if oom occurs without __GFP_FS

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 4/26/23 16:33, Michal Hocko wrote:
[CC squashfs maintainer]

On Wed 26-04-23 13:10:30, Hui Wang wrote:
If we run the stress-ng in the filesystem of squashfs, the system
will be in a state something like hang, the stress-ng couldn't
finish running and the console couldn't react to users' input.

This issue happens on all arm/arm64 platforms we are working on,
through debugging, we found this issue is introduced by oom handling
in the kernel.

The fs->readahead() is called between memalloc_nofs_save() and
memalloc_nofs_restore(), and the squashfs_readahead() calls
alloc_page(), in this case, if there is no memory left, the
out_of_memory() will be called without __GFP_FS, then the oom killer
will not be triggered and this process will loop endlessly and wait
for others to trigger oom killer to release some memory. But for a
system with the whole root filesystem constructed by squashfs,
nearly all userspace processes will call out_of_memory() without
__GFP_FS, so we will see that the system enters a state something like
hang when running stress-ng.

To fix it, we could trigger a kthread to call page_alloc() with
__GFP_FS before returning from out_of_memory() due to without
__GFP_FS.
I do not think this is an appropriate way to deal with this issue.
Does it even make sense to trigger OOM killer for something like
readahead? Would it be more mindful to fail the allocation instead?
That being said should allocations from squashfs_readahead use
__GFP_RETRY_MAYFAIL instead?

Thanks for your comment, and this issue could hardly be reproduced on ext4 filesystem, that is because the ext4->readahead() doesn't call alloc_page(). If changing the ext4->readahead() as below, it will be easy to reproduce this issue with the ext4 filesystem (repeatedly run: $stress-ng --bigheap ${num_of_cpu_threads} --sequential 0 --timeout 30s --skip-silent --verbose)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index ffbbd9626bd8..8b9db0b9d0b8 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3114,12 +3114,18 @@ static int ext4_read_folio(struct file *file, struct folio *folio)
 static void ext4_readahead(struct readahead_control *rac)
 {
        struct inode *inode = rac->mapping->host;
+       struct page *tmp_page;

        /* If the file has inline data, no need to do readahead. */
        if (ext4_has_inline_data(inode))
                return;

+       tmp_page = alloc_page(GFP_KERNEL);
+
        ext4_mpage_readpages(inode, rac, NULL);
+
+       if (tmp_page)
+               __free_page(tmp_page);
 }


BTW, I applied my patch to the linux-next and ran the oom stress-ng testcases overnight, there is no hang, oops or crash, looks like there is no big problem to use a kthread to trigger the oom killer in this case.

And Hi squashfs maintainer, I checked the code of filesystem, looks like most filesystems will not call alloc_page() in the readahead(), could you please help take a look at this issue, thanks.


Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
Cc: Michal Hocko <mhocko@xxxxxxxx>
Cc: Suren Baghdasaryan <surenb@xxxxxxxxxx>
Cc: Colin Ian King <colin.i.king@xxxxxxxxx>
Cc: Yang Shi <shy828301@xxxxxxxxx>
Cc: Johannes Weiner <hannes@xxxxxxxxxxx>
Cc: Vlastimil Babka <vbabka@xxxxxxx>
Cc: Christoph Hellwig <hch@xxxxxxxxxxxxx>
Cc: Mel Gorman <mgorman@xxxxxxx>
Cc: Dan Carpenter <dan.carpenter@xxxxxxxxxx>
Signed-off-by: Hui Wang <hui.wang@xxxxxxxxxxxxx>
---
  mm/oom_kill.c | 22 +++++++++++++++++++++-
  1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 044e1eed720e..c9c38d6b8580 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1094,6 +1094,24 @@ int unregister_oom_notifier(struct notifier_block *nb)
  }
  EXPORT_SYMBOL_GPL(unregister_oom_notifier);
+/*
+ * If an oom occurs without the __GFP_FS flag in the gfp_mask, the oom killer
+ * will not be triggered. In this case, we could call schedule_work to run
+ * trigger_oom_killer_work() to trigger an oom forcibly with __GFP_FS flag,
+ * this could make the oom killer run with a fair chance.
+ */
+static void trigger_oom_killer_work(struct work_struct *work)
+{
+	struct page *tmp_page;
+
+	/* This could trigger an oom forcibly with a chance */
+	tmp_page = alloc_page(GFP_KERNEL);
+	if (tmp_page)
+		__free_page(tmp_page);
+}
+
+static DECLARE_WORK(oom_trigger_work, trigger_oom_killer_work);
+
  /**
   * out_of_memory - kill the "best" process when we run out of memory
   * @oc: pointer to struct oom_control
@@ -1135,8 +1153,10 @@ bool out_of_memory(struct oom_control *oc)
  	 * ___GFP_DIRECT_RECLAIM to get here. But mem_cgroup_oom() has to
  	 * invoke the OOM killer even if it is a GFP_NOFS allocation.
  	 */
-	if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS) && !is_memcg_oom(oc))
+	if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS) && !is_memcg_oom(oc)) {
+		schedule_work(&oom_trigger_work);
  		return true;
+	}
/*
  	 * Check if there were limitations on the allocation (only relevant for
--
2.34.1




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux