[PATCH 1/1] mm/oom_kill: trigger the oom killer if oom occurs without __GFP_FS

Hui Wang <hui.wang@xxxxxxxxxxxxx> · Wed, 26 Apr 2023 13:10:30 +0800

If we run the stress-ng in the filesystem of squashfs, the system
will be in a state something like hang, the stress-ng couldn't
finish running and the console couldn't react to users' input.

This issue happens on all arm/arm64 platforms we are working on,
through debugging, we found this issue is introduced by oom handling
in the kernel.

The fs->readahead() is called between memalloc_nofs_save() and
memalloc_nofs_restore(), and the squashfs_readahead() calls
alloc_page(), in this case, if there is no memory left, the
out_of_memory() will be called without __GFP_FS, then the oom killer
will not be triggered and this process will loop endlessly and wait
for others to trigger oom killer to release some memory. But for a
system with the whole root filesystem constructed by squashfs,
nearly all userspace processes will call out_of_memory() without
__GFP_FS, so we will see that the system enters a state something like
hang when running stress-ng.

To fix it, we could trigger a kthread to call page_alloc() with
__GFP_FS before returning from out_of_memory() due to without
__GFP_FS.

Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
Cc: Michal Hocko <mhocko@xxxxxxxx>
Cc: Suren Baghdasaryan <surenb@xxxxxxxxxx>
Cc: Colin Ian King <colin.i.king@xxxxxxxxx>
Cc: Yang Shi <shy828301@xxxxxxxxx>
Cc: Johannes Weiner <hannes@xxxxxxxxxxx>
Cc: Vlastimil Babka <vbabka@xxxxxxx>
Cc: Christoph Hellwig <hch@xxxxxxxxxxxxx>
Cc: Mel Gorman <mgorman@xxxxxxx>
Cc: Dan Carpenter <dan.carpenter@xxxxxxxxxx>
Signed-off-by: Hui Wang <hui.wang@xxxxxxxxxxxxx>
---
 mm/oom_kill.c | 22 +++++++++++++++++++++-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 044e1eed720e..c9c38d6b8580 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1094,6 +1094,24 @@ int unregister_oom_notifier(struct notifier_block *nb)
 }
 EXPORT_SYMBOL_GPL(unregister_oom_notifier);
 
+/*
+ * If an oom occurs without the __GFP_FS flag in the gfp_mask, the oom killer
+ * will not be triggered. In this case, we could call schedule_work to run
+ * trigger_oom_killer_work() to trigger an oom forcibly with __GFP_FS flag,
+ * this could make the oom killer run with a fair chance.
+ */
+static void trigger_oom_killer_work(struct work_struct *work)
+{
+	struct page *tmp_page;
+
+	/* This could trigger an oom forcibly with a chance */
+	tmp_page = alloc_page(GFP_KERNEL);
+	if (tmp_page)
+		__free_page(tmp_page);
+}
+
+static DECLARE_WORK(oom_trigger_work, trigger_oom_killer_work);
+
 /**
  * out_of_memory - kill the "best" process when we run out of memory
  * @oc: pointer to struct oom_control
@@ -1135,8 +1153,10 @@ bool out_of_memory(struct oom_control *oc)
 	 * ___GFP_DIRECT_RECLAIM to get here. But mem_cgroup_oom() has to
 	 * invoke the OOM killer even if it is a GFP_NOFS allocation.
 	 */
-	if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS) && !is_memcg_oom(oc))
+	if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS) && !is_memcg_oom(oc)) {
+		schedule_work(&oom_trigger_work);
 		return true;
+	}
 
 	/*
 	 * Check if there were limitations on the allocation (only relevant for
-- 
2.34.1