Re: How to handle TIF_MEMDIE stalls?

Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> · Wed, 11 Feb 2015 11:23:37 +0900

Johannes Weiner wrote:
> On Tue, Feb 10, 2015 at 10:58:46PM +0900, Tetsuo Handa wrote:
> > (Michal is offline, asking Johannes instead.)
> > 
> > Tetsuo Handa wrote:
> > > (A) The order-0 __GFP_WAIT allocation fails immediately upon OOM condition
> > >     despite we didn't remove the
> > > 
> > >         /*
> > >          * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
> > >          * means __GFP_NOFAIL, but that may not be true in other
> > >          * implementations.
> > >          */
> > >         if (order <= PAGE_ALLOC_COSTLY_ORDER)
> > >                 return 1;
> > > 
> > >     check in should_alloc_retry(). Is this what you expected?
> > 
> > This behavior is caused by commit 9879de7373fcfb46 "mm: page_alloc:
> > embed OOM killing naturally into allocation slowpath". Did you apply
> > that commit with agreement to let GFP_NOIO / GFP_NOFS allocations fail
> > upon memory pressure and permit filesystems to take fs error actions?
> > 
> > 	/* The OOM killer does not compensate for light reclaim */
> > 	if (!(gfp_mask & __GFP_FS))
> > 		goto out;
> 
> The model behind the refactored code is to continue retrying the
> allocation as long as the allocator has the ability to free memory,
> i.e. if page reclaim makes progress, or the OOM killer can be used.
> 
> That being said, I missed that GFP_NOFS were able to loop endlessly
> even without page reclaim making progress or the OOM killer working,
> and since it didn't fit the model I dropped it by accident.
> 
> Is this a real workload you are having trouble with or an artificial
> stresstest?  Because I'd certainly be willing to revert that part of
> the patch and make GFP_NOFS looping explicit if it helps you.  But I
> do think the new behavior makes more sense, so I'd prefer to keep it
> if it's merely a stress test you use to test allocator performance.

I'm working for troubleshooting RHEL systems. This is an artificial
stresstest which I developed for trying to reproduce various low memory
troubles occurred on customer's systems.

> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8e20f9c2fa5a..f77c58ebbcfa 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2382,8 +2382,15 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>  		if (high_zoneidx < ZONE_NORMAL)
>  			goto out;
>  		/* The OOM killer does not compensate for light reclaim */
> -		if (!(gfp_mask & __GFP_FS))
> +		if (!(gfp_mask & __GFP_FS)) {
> +			/*
> +			 * XXX: Page reclaim didn't yield anything,
> +			 * and the OOM killer can't be invoked, but
> +			 * keep looping as per should_alloc_retry().
> +			 */
> +			*did_some_progress = 1;
>  			goto out;
> +		}

Why do you omit out_of_memory() call for GFP_NOIO / GFP_NOFS allocations?
Thread2 doing GFP_FS / GFP_KERNEL allocation might be waiting for Thread1
doing GFP_NOIO / GFP_NOFS allocation to call out_of_memory() on behalf of
Thread2, as mutexed by

        /*
         * Acquire the per-zone oom lock for each zone.  If that
         * fails, somebody else is making progress for us.
         */
        if (!oom_zonelist_trylock(zonelist, gfp_mask)) {
                *did_some_progress = 1;
                schedule_timeout_uninterruptible(1);
                return NULL;
        }

lock. If Thread1 calls oom_zonelist_trylock() / oom_zonelist_unlock() without
sleep while Thread2 calls oom_zonelist_trylock() / oom_zonelist_unlock() with
sleep, Thread2 is unlikely able to call out_of_memory() because Thread2 likely
fails at oom_zonelist_trylock().

>  		/*
>  		 * GFP_THISNODE contains __GFP_NORETRY and we never hit this.
>  		 * Sanity check for bare calls of __GFP_THISNODE, not real OOM.
> 

Though, more serious behavior with this reproducer is (B) where the system
stalls forever without kernel messages being saved to /var/log/messages .
out_of_memory() does not select victims until the coredump to pipe can make
progress whereas the coredump to pipe can't make progress until memory
allocation succeeds or fails.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>