Re: zram OOM behavior

Luigi Semenzato <semenzato@xxxxxxxxxx> · Thu, 1 Nov 2012 10:50:42 -0700

On Wed, Oct 31, 2012 at 9:48 PM, David Rientjes <rientjes@xxxxxxxxxx> wrote:
> On Thu, 1 Nov 2012, Minchan Kim wrote:
>
>> It's not true any more.
>> 3.6 includes following code in try_to_free_pages
>>
>>         /*
>>          * Do not enter reclaim if fatal signal is pending. 1 is returned so
>>          * that the page allocator does not consider triggering OOM
>>          */
>>         if (fatal_signal_pending(current))
>>                 return 1;
>>
>> So the hunged task never go to the OOM path and could be looping forever.
>>
>
> Ah, interesting.  This is from commit 5515061d22f0 ("mm: throttle direct
> reclaimers if PF_MEMALLOC reserves are low and swap is backed by network
> storage").  Thanks for adding Mel to the cc.
>
> The oom killer specifically has logic for this condition: when calling
> out_of_memory() the first thing it does is
>
>         if (fatal_signal_pending(current))
>                 set_thread_flag(TIF_MEMDIE);
>
> to allow it access to memory reserves so that it may exit if it's having
> trouble.  But that ends up never happening because of the above code that
> Minchan has identified.
>
> So we either need to do set_thread_flag(TIF_MEMDIE) in try_to_free_pages()
> as well or revert that early return entirely; there's no justification
> given for it in the comment nor in the commit log.  I'd rather remove it
> and allow the oom killer to trigger and grant access to memory reserves
> itself if necessary.
>
> Mel, how does commit 5515061d22f0 deal with threads looping forever if
> they need memory in the exit path since the oom killer never gets called?
>
> That aside, it doesn't seem like this is the issue that Luigi is reporting
> since his patch that avoids deferring the oom killer presumably fixes the
> issue for him.  So it turns out the oom killer must be getting called.
>
> Luigi, can you try this instead?  It applies to the latest git but should
> be easily modified to apply to any 3.x kernel you're running.
> ---
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -310,26 +310,13 @@ enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
>         if (!task->mm)
>                 return OOM_SCAN_CONTINUE;
>
> -       if (task->flags & PF_EXITING) {
> +       if (task->flags & PF_EXITING && !force_kill) {
>                 /*
> -                * If task is current and is in the process of releasing memory,
> -                * allow the "kill" to set TIF_MEMDIE, which will allow it to
> -                * access memory reserves.  Otherwise, it may stall forever.
> -                *
> -                * The iteration isn't broken here, however, in case other
> -                * threads are found to have already been oom killed.
> +                * If this task is not being ptraced on exit, then wait for it
> +                * to finish before killing some other task unnecessarily.
>                  */
> -               if (task == current)
> -                       return OOM_SCAN_SELECT;
> -               else if (!force_kill) {
> -                       /*
> -                        * If this task is not being ptraced on exit, then wait
> -                        * for it to finish before killing some other task
> -                        * unnecessarily.
> -                        */
> -                       if (!(task->group_leader->ptrace & PT_TRACE_EXIT))
> -                               return OOM_SCAN_ABORT;
> -               }
> +               if (!(task->group_leader->ptrace & PT_TRACE_EXIT))
> +                       return OOM_SCAN_ABORT;
>         }
>         return OOM_SCAN_OK;
>  }
> @@ -706,11 +693,11 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>                 return;
>
>         /*
> -        * If current has a pending SIGKILL, then automatically select it.  The
> -        * goal is to allow it to allocate so that it may quickly exit and free
> -        * its memory.
> +        * If current has a pending SIGKILL or is exiting, then automatically
> +        * select it.  The goal is to allow it to allocate so that it may
> +        * quickly exit and free its memory.
>          */
> -       if (fatal_signal_pending(current)) {
> +       if (fatal_signal_pending(current) || current->flags & PF_EXITING) {
>                 set_thread_flag(TIF_MEMDIE);
>                 return;
>         }

I tested this change with my load and it appears to also prevent the deadlocks.

I have a question though.  I thought only one process was allowed to
be in TIF_MEMDIE state, but I don't see anything that prevents this
code (before or after the change) from setting the flag in multiple
processes.  Is this a problem?

Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>