__GFP_NOFAIL and oom_killer_disabled?

Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> · Sun, 22 Feb 2015 23:48:01 +0900

Andrew Morton wrote:
> And yes, I agree that sites such as xfs's kmem_alloc() should be
> passing __GFP_NOFAIL to tell the page allocator what's going on.  I
> don't think it matters a lot whether kmem_alloc() retains its retry
> loop.  If __GFP_NOFAIL is working correctly then it will never loop
> anyway...

__GFP_NOFAIL fails to work correctly if oom_killer_disabled == true.
I'm wondering how oom_killer_disable() interferes with __GFP_NOFAIL
allocation. We had race check after setting oom_killer_disabled to true
in 3.19.

---------- linux-3.19/kernel/power/process.c ----------
int freeze_processes(void)
{
(...snipped...)
        pm_wakeup_clear();
        printk("Freezing user space processes ... ");
        pm_freezing = true;
        oom_kills_saved = oom_kills_count();
        error = try_to_freeze_tasks(true);
        if (!error) {
                __usermodehelper_set_disable_depth(UMH_DISABLED);
                oom_killer_disable();

                /*
                 * There might have been an OOM kill while we were
                 * freezing tasks and the killed task might be still
                 * on the way out so we have to double check for race.
                 */
                if (oom_kills_count() != oom_kills_saved &&
                    !check_frozen_processes()) {
                        __usermodehelper_set_disable_depth(UMH_ENABLED);
                        printk("OOM in progress.");
                        error = -EBUSY;
                } else {
                        printk("done.");
                }
        }
(...snipped...)
}
---------- linux-3.19/kernel/power/process.c ----------

I worry that commit c32b3cbe0d067a9c "oom, PM: make OOM detection in
the freezer path raceless" might have opened a race window for
__alloc_pages_may_oom(__GFP_NOFAIL) allocation to fail when OOM killer
is disabled. I think something like

--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -789,7 +789,7 @@ bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 	bool ret = false;
 
 	down_read(&oom_sem);
-	if (!oom_killer_disabled) {
+	if (!oom_killer_disabled || (gfp_mask & __GFP_NOFAIL)) {
 		__out_of_memory(zonelist, gfp_mask, order, nodemask, force_kill);
 		ret = true;
 	}

is needed. But such change can race with up_write() and wait_event() in
oom_killer_disable(). While the comment of oom_killer_disable() says
"The function cannot be called when there are runnable user tasks because
the userspace would see unexpected allocation failures as a result.",
aren't there still kernel threads which might do __GFP_NOFAIL allocations?
After all, don't we need to recheck after setting oom_killer_disabled to true?

---------- linux.git/kernel/power/process.c ----------
int freeze_processes(void)
{
(...snipped...)
        pm_wakeup_clear();
        pr_info("Freezing user space processes ... ");
        pm_freezing = true;
        error = try_to_freeze_tasks(true);
        if (!error) {
                __usermodehelper_set_disable_depth(UMH_DISABLED);
                pr_cont("done.");
        }
        pr_cont("\n");
        BUG_ON(in_atomic());

        /*
         * Now that the whole userspace is frozen we need to disbale
         * the OOM killer to disallow any further interference with
         * killable tasks.
         */
        if (!error && !oom_killer_disable())
                error = -EBUSY;
(...snipped...)
}
---------- linux.git/kernel/power/process.c ----------

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>