Re: [PATCH] mm: don't warn about allocations which stall for too long

Steven Rostedt <rostedt@xxxxxxxxxxx> · Tue, 31 Oct 2017 15:32:25 -0400

Thank you for the perfect timing. You posted this the day after I
proposed a new solution at Kernel Summit in Prague for the printk lock
loop that you experienced here.

I attached the pdf that I used for that discussion (ignore the last
slide, it was left over and I never went there).

My proposal is to do something like this with printk:

Three types of printk usages:

1) Active printer (actively writing to the console).
2) Waiter (active printer, first user)
3) Sees active printer and a waiter, and just adds to the log buffer
   and leaves.

(new globals)
static DEFINE_SPIN_LOCK(console_owner_lock);
static struct task_struct console_owner;
static bool waiter;

console_unlock() {

[ Assumes this part can not preempt ]

	spin_lock(console_owner_lock);
	console_owner = current;
	spin_unlock(console_owner_lock);

	for each message
		write message out to console

		if (READ_ONCE(waiter))
			break;

	spin_lock(console_owner_lock);
	console_owner = NULL;
	spin_unlock(console_owner_lock);

[ preemption possible ]

	[ Needs to make sure waiter gets semaphore ]

	up(console_sem);
}

Then printk can have something like:

	if (console_trylock())
		console_unlock();
	else {
		struct task_struct *owner = NULL;

		spin_lock(console_owner_lock);
		if (waiter)
			goto out;
		WRITE_ONCE(waiter, true);
		owner = READ_ONCE(console_owner);		
	out:
		spin_unlock(console_owner_lock);
		if (owner) {
			while (!console_trylock())	
				cpu_relax();
			spin_lock(console_owner_lock);
			waiter = false;
			spin_unlock(console_owner_lock);
		}
	}

This way, only one CPU spins waiting to print, and only if the
console_lock owner is actively printing. If the console_lock owner
notices someone is waiting to print, it stops printing as a waiter will
always continue the prints. This will balance out the printks among all
the CPUs that are doing them and no one CPU will get stuck doing all
the printks.

This would solve your issue because the next warn_alloc() caller would
become the waiter, and take over the next message in the queue. This
would spread out the load of who does the actual printing, and not have
one printer be stuck doing the work.

-- Steve

On Thu, 26 Oct 2017 20:28:59 +0900
Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> wrote:

> Commit 63f53dea0c9866e9 ("mm: warn about allocations which stall for too
> long") was a great step for reducing possibility of silent hang up problem
> caused by memory allocation stalls. But this commit reverts it, for it is
> possible to trigger OOM lockup and/or soft lockups when many threads
> concurrently called warn_alloc() (in order to warn about memory allocation
> stalls) due to current implementation of printk(), and it is difficult to
> obtain useful information due to limitation of synchronous warning
> approach.
> 
> Current printk() implementation flushes all pending logs using the context
> of a thread which called console_unlock(). printk() should be able to flush
> all pending logs eventually unless somebody continues appending to printk()
> buffer.
> 
> Since warn_alloc() started appending to printk() buffer while waiting for
> oom_kill_process() to make forward progress when oom_kill_process() is
> processing pending logs, it became possible for warn_alloc() to force
> oom_kill_process() loop inside printk(). As a result, warn_alloc()
> significantly increased possibility of preventing oom_kill_process() from
> making forward progress.
> 
> ---------- Pseudo code start ----------
> Before warn_alloc() was introduced:
> 
>   retry:
>     if (mutex_trylock(&oom_lock)) {
>       while (atomic_read(&printk_pending_logs) > 0) {
>         atomic_dec(&printk_pending_logs);
>         print_one_log();
>       }
>       // Send SIGKILL here.
>       mutex_unlock(&oom_lock)
>     }
>     goto retry;
> 
> After warn_alloc() was introduced:
> 
>   retry:
>     if (mutex_trylock(&oom_lock)) {
>       while (atomic_read(&printk_pending_logs) > 0) {
>         atomic_dec(&printk_pending_logs);
>         print_one_log();
>       }
>       // Send SIGKILL here.
>       mutex_unlock(&oom_lock)
>     } else if (waited_for_10seconds()) {
>       atomic_inc(&printk_pending_logs);
>     }
>     goto retry;
> ---------- Pseudo code end ----------
> 
> Although waited_for_10seconds() becomes true once per 10 seconds, unbounded
> number of threads can call waited_for_10seconds() at the same time. Also,
> since threads doing waited_for_10seconds() keep doing almost busy loop, the
> thread doing print_one_log() can use little CPU resource. Therefore, this
> situation can be simplified like
> 
> ---------- Pseudo code start ----------
>   retry:
>     if (mutex_trylock(&oom_lock)) {
>       while (atomic_read(&printk_pending_logs) > 0) {
>         atomic_dec(&printk_pending_logs);
>         print_one_log();
>       }
>       // Send SIGKILL here.
>       mutex_unlock(&oom_lock)
>     } else {
>       atomic_inc(&printk_pending_logs);
>     }
>     goto retry;
> ---------- Pseudo code end ----------
> 
> when printk() is called faster than print_one_log() can process a log.
> 
> One of possible mitigation would be to introduce a new lock in order to
> make sure that no other series of printk() (either oom_kill_process() or
> warn_alloc()) can append to printk() buffer when one series of printk()
> (either oom_kill_process() or warn_alloc()) is already in progress. Such
> serialization will also help obtaining kernel messages in readable form.
> 
> ---------- Pseudo code start ----------
>   retry:
>     if (mutex_trylock(&oom_lock)) {
>       mutex_lock(&oom_printk_lock);
>       while (atomic_read(&printk_pending_logs) > 0) {
>         atomic_dec(&printk_pending_logs);
>         print_one_log();
>       }
>       // Send SIGKILL here.
>       mutex_unlock(&oom_printk_lock);
>       mutex_unlock(&oom_lock)
>     } else {
>       if (mutex_trylock(&oom_printk_lock)) {
>         atomic_inc(&printk_pending_logs);
>         mutex_unlock(&oom_printk_lock);
>       }
>     }
>     goto retry;
> ---------- Pseudo code end ----------
> 
> But this commit does not go that direction, for we don't want to introduce
> a new lock dependency, and we unlikely be able to obtain useful information
> even if we serialized oom_kill_process() and warn_alloc().
> 
> Synchronous approach is prone to unexpected results (e.g. too late [1], too
> frequent [2], overlooked [3]). As far as I know, warn_alloc() never helped
> with providing information other than "something is going wrong".
> I want to consider asynchronous approach which can obtain information
> during stalls with possibly relevant threads (e.g. the owner of oom_lock
> and kswapd-like threads) and serve as a trigger for actions (e.g. turn
> on/off tracepoints, ask libvirt daemon to take a memory dump of stalling
> KVM guest for diagnostic purpose).
> 
> This commit temporarily looses ability to report e.g. OOM lockup due to
> unable to invoke the OOM killer due to !__GFP_FS allocation request.
> But asynchronous approach will be able to detect such situation and emit
> warning. Thus, let's remove warn_alloc().
> 
> [1] https://bugzilla.kernel.org/show_bug.cgi?id=192981
> [2] http://lkml.kernel.org/r/CAM_iQpWuPVGc2ky8M-9yukECtS+zKjiDasNymX7rMcBjBFyM_A@xxxxxxxxxxxxxx
> [3] commit db73ee0d46379922 ("mm, vmscan: do not loop on too_many_isolated for ever"))
> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx>
> Reported-by: Cong Wang <xiyou.wangcong@xxxxxxxxx>
> Reported-by: yuwang.yuwang <yuwang.yuwang@xxxxxxxxxxxxxxx>
> Reported-by: Johannes Weiner <hannes@xxxxxxxxxxx>
> Cc: Michal Hocko <mhocko@xxxxxxxxxx>
> Cc: Vlastimil Babka <vbabka@xxxxxxx>
> Cc: Mel Gorman <mgorman@xxxxxxx>
> Cc: Dave Hansen <dave.hansen@xxxxxxxxx>
> Cc: Sergey Senozhatsky <sergey.senozhatsky@xxxxxxxxx>
> Cc: Petr Mladek <pmladek@xxxxxxxx>
> ---
>  mm/page_alloc.c | 10 ----------
>  1 file changed, 10 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 97687b3..a4edfba 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3856,8 +3856,6 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  	enum compact_result compact_result;
>  	int compaction_retries;
>  	int no_progress_loops;
> -	unsigned long alloc_start = jiffies;
> -	unsigned int stall_timeout = 10 * HZ;
>  	unsigned int cpuset_mems_cookie;
>  	int reserve_flags;
>  
> @@ -3989,14 +3987,6 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  	if (!can_direct_reclaim)
>  		goto nopage;
>  
> -	/* Make sure we know about allocations which stall for too long */
> -	if (time_after(jiffies, alloc_start + stall_timeout)) {
> -		warn_alloc(gfp_mask & ~__GFP_NOWARN, ac->nodemask,
> -			"page allocation stalls for %ums, order:%u",
> -			jiffies_to_msecs(jiffies-alloc_start), order);
> -		stall_timeout += 10 * HZ;
> -	}
> -
>  	/* Avoid recursion of direct reclaim */
>  	if (current->flags & PF_MEMALLOC)
>  		goto nopage;

Attachment:
printk-ks2017.pdf

Description: Adobe PDF document