Re: [PATCH] mm: avoid livelock on !__GFP_FS allocations

David Rientjes <rientjes@xxxxxxxxxx> · Wed, 16 Nov 2011 14:48:12 -0800 (PST)

On Thu, 17 Nov 2011, Minchan Kim wrote:

> >> > diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
> >> > index fdd4263..01aa9b5 100644
> >> > --- a/kernel/power/suspend.c
> >> > +++ b/kernel/power/suspend.c
> >> > @@ -297,9 +297,11 @@ int enter_state(suspend_state_t state)
> >> >                goto Finish;
> >> >
> >> >        pr_debug("PM: Entering %s sleep\n", pm_states[state]);
> >> > +       oom_killer_disable();
> >> >        pm_restrict_gfp_mask();
> >> >        error = suspend_devices_and_enter(state);
> >> >        pm_restore_gfp_mask();
> >> > +       oom_killer_enable();
> >> >
> >> >  Finish:
> >> >        pr_debug("PM: Finishing wakeup.\n");
> >> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >> > index 6e8ecb6..d8c31b7 100644
> >> > --- a/mm/page_alloc.c
> >> > +++ b/mm/page_alloc.c
> >> > @@ -2177,9 +2177,9 @@ rebalance:
> >> >         * running out of options and have to consider going OOM
> >> >         */
> >> >        if (!did_some_progress) {
> >> > -               if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
> >> > -                       if (oom_killer_disabled)
> >> > +               if (oom_killer_disabled)
> >> >                                goto nopage;
> >
> > You're allowing __GFP_NOFAIL allocations to fail.
> >
> >> > +               if ((gfp_mask & __GFP_FS) && !(gfp_mask & __GFP_NORETRY)) {
> >> >                        page = __alloc_pages_may_oom(gfp_mask, order,
> >> >                                        zonelist, high_zoneidx,
> >> >                                        nodemask, preferred_zone,
> >> >
> >>
> >> I'd prefer something like this. The whole 'gfp_allowed_flags' thing was
> >> designed to make GFP_KERNEL work during boot time where it's obviously safe to
> >> do that. I really don't think that's going to work suspend cleanly.
> >>
> >
> > Adding Rafael to the cc.
> >
> > This has been done since 2.6.34 and presumably has been working quite
> > well.  I don't have a specific objection to gfp_allowed_flags to be used
> > outside of boot since it seems plausible that there are system-level
> > contexts that would need different behavior in the page allocator and this
> > does it effectively without major surgery or a slower fastpath.  Suspend
> > is using it just like boot does before irqs are enabled, so I don't have
> > an objection to it.
> >
> 
> My point isn't using gfp_allowed_flags(maybe it's Pekka's concern) but
> why adding new special case handling code like pm_suspended_storage.
> I think we can handle the issue with oom_killer_disabled(but the naming is bad)
> 

Ignore the name of the function that Mel is introducing, it's only related 
to suspend because that's the only thing that (currently) alters 
gfp_allowed_mask after boot.  If something else were to clear __GFP_FS and 
__GFP_IO in the future, we'd simply need to rename the function.  I was 
going to ask for a comment specifically about that, but I think it's 
proximity to the function that allows gfp_allowed_mask to be altered is 
sufficient.

We'd really like to avoid the loop if __GFP_FS and __GFP_IO are not set.  
If oom_killer_disabled is set anytime that gfp_allowed_mask does not allow 
them for _all_ page allocations, then we could certainly replace the new 
pm_suspended_storage() check in should_alloc_retry() with a check on 
oom_killer_disabled.

I'll let you and Rafael work out whether that can be done or not, I just 
see pm_restrict_gfp_mask() being called in kernel/power/hibernate.c and 
kernel/power/suspend.c whereas oom_killer_disable() is only called in 
kernel/power/process.c.  oom_killer_disable() would be unnecessary if 
__GFP_FS were already cleared, so it would require some changes in the 
suspend code.

I like Mel's patch because it's easily maintainable and only depends on 
the state of reclaim and the gfp flags being passed in, which is the 
direct cause of the infinite loop.