Re: [LTP] [BUG] oom hangs the system, NMI backtrace shows most CPUs in shrink_slab

Jan Stancek <jstancek@xxxxxxxxxx> · Fri, 29 Jan 2016 02:32:53 -0500 (EST)

----- Original Message -----
> From: "Tetsuo Handa" <penguin-kernel@xxxxxxxxxxxxxxxxxxx>
> To: mhocko@xxxxxxxx, jstancek@xxxxxxxxxx
> Cc: tj@xxxxxxxxxx, clameter@xxxxxxx, js1304@xxxxxxxxx, arekm@xxxxxxxx, akpm@xxxxxxxxxxxxxxxxxxxx,
> torvalds@xxxxxxxxxxxxxxxxxxxx, linux-mm@xxxxxxxxx
> Sent: Thursday, 28 January, 2016 4:48:36 PM
> Subject: Re: [LTP] [BUG] oom hangs the system, NMI backtrace shows most CPUs in shrink_slab
> 
> Tetsuo Handa wrote:
> > Inviting people who involved in commit 373ccbe5927034b5 "mm, vmstat: allow
> > WQ concurrency to discover memory reclaim doesn't make any progress".
> > 
> > In this thread, Jan hit an OOM stall where free memory does not increase
> > even after OOM victim and dying tasks terminated. I'm wondering why such
> > thing can happen. Jan established a reproducer and I tried it.
> > 
> > I'm observing vmstat_update workqueue item forever remains pending.
> > Didn't we make sure that vmstat_update is processed when memory allocation
> > is stalling?
> 
> I confirmed that a forced sleep patch solves this problem.
> 
> ----------------------------------------
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 7340353..b986216 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -979,6 +979,12 @@ long wait_iff_congested(struct zone *zone, int sync,
> long timeout)
>  	 */
>  	if (atomic_read(&nr_wb_congested[sync]) == 0 ||
>  	    !test_bit(ZONE_CONGESTED, &zone->flags)) {
> +		const struct memalloc_info *m = &current->memalloc;
> +		if (m->valid && time_after_eq(jiffies, m->start + 30 * HZ)) {
> +			pr_err("********** %s(%u) Forced sleep **********\n",
> +			       current->comm, current->pid);
> +			schedule_timeout_uninterruptible(HZ);
> +		}
>  
>  		/*
>  		 * Memory allocation/reclaim might be called from a WQ
> ----------------------------------------
> 
> ----------------------------------------
> [  939.038719] Showing busy workqueues and worker pools:
> [  939.040519] workqueue events: flags=0x0
> [  939.042142]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
> [  939.044350]     pending: vmpressure_work_fn(delay=20659)
> [  939.046302]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
> [  939.048392]     pending: vmw_fb_dirty_flush [vmwgfx](delay=42),
> vmstat_shepherd(delay=10)
> [  939.050946] workqueue events_power_efficient: flags=0x80
> [  939.052844]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
> [  939.054980]     pending: fb_flashcursor(delay=20573)
> [  939.056939] workqueue events_freezable_power_: flags=0x84
> [  939.058872]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
> [  939.060994]     in-flight: 9571:disk_events_workfn(delay=20719)
> [  939.063069] workqueue vmstat: flags=0xc
> [  939.064667]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
> [  939.066795]     pending: vmstat_update(delay=20016)
> [  939.068752] workqueue xfs-eofblocks/sda1: flags=0xc
> [  939.070546]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
> [  939.072675]     pending: xfs_eofblocks_worker(delay=5574)
> [  939.074660] pool 4: cpus=2 node=0 flags=0x0 nice=0 workers=8 idle: 20
> 10098 10100 505 10099 10364 10363
> [  948.026046] ********** a.out(10423) Forced sleep **********
> [  948.036318] ********** a.out(10424) Forced sleep **********
> [  948.323267] ********** kworker/2:3(9571) Forced sleep **********
> [  949.030045] a.out invoked oom-killer: gfp_mask=0x24280ca, order=0,
> oom_score_adj=0
> [  949.032320] a.out cpuset=/ mems_allowed=0
> [  949.033976] CPU: 3 PID: 10423 Comm: a.out Not tainted 4.4.0+ #39
> ----------------------------------------
> [ 1255.809372] Showing busy workqueues and worker pools:
> [ 1255.811163] workqueue events: flags=0x0
> [ 1255.812744]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
> [ 1255.814877]     pending: vmpressure_work_fn(delay=10713)
> [ 1255.816837]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
> [ 1255.818960]     pending: vmw_fb_dirty_flush [vmwgfx](delay=42)
> [ 1255.821025] workqueue events_power_efficient: flags=0x80
> [ 1255.822937]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
> [ 1255.825136]     pending: fb_flashcursor(delay=20673)
> [ 1255.827069] workqueue events_freezable_power_: flags=0x84
> [ 1255.828953]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
> [ 1255.831050]     in-flight: 20:disk_events_workfn(delay=20777)
> [ 1255.833063]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
> [ 1255.835168]     pending: disk_events_workfn(delay=7)
> [ 1255.837084] workqueue vmstat: flags=0xc
> [ 1255.838707]   pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
> [ 1255.840827]     pending: vmstat_update(delay=19787)
> [ 1255.842794] pool 4: cpus=2 node=0 flags=0x0 nice=0 workers=3 idle: 9571
> 10098
> [ 1265.036032] ********** kworker/2:0(20) Forced sleep **********
> [ 1265.038131] a.out invoked oom-killer: gfp_mask=0x24280ca, order=0,
> oom_score_adj=0
> [ 1265.041018] a.out cpuset=/ mems_allowed=0
> [ 1265.043008] CPU: 2 PID: 10622 Comm: a.out Not tainted 4.4.0+ #39
> ----------------------------------------
> 
> In the post "[PATCH 1/2] mm, oom: introduce oom reaper", Andrew Morton said
> that "schedule_timeout() in state TASK_RUNNING doesn't do anything".
> 
> Looking at commit 373ccbe5927034b5, it is indeed using schedule_timeout(1)
> instead of schedule_timeout_*(1). What!? We meant to force the kworker to
> sleep but the kworker did not sleep at all? Then, that explains why the
> forced sleep patch above solves the OOM livelock.
> 
> Jan, can you reproduce your problem with below patch applied?

I took v4.5-rc1, applied your memalloc patch and then patch below.

I have mixed results so far. First attempt hanged after ~15 minutes,
second is still running (for 12+ hours).

The way it hanged is different from previous ones, I don't recall seeing
messages like these before:
  SLUB: Unable to allocate memory on node -1 (gfp=0x2000000)
  NMI watchdog: Watchdog detected hard LOCKUP on cpu 0

Full log from one that hanged:
  http://jan.stancek.eu/tmp/oom_hangs/console.log.4-v4.5-rc1_and_wait_iff_congested_patch.txt

I'll let it run through the weekend.

Regards,
Jan

> 
> ----------
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index 7340353..cbe6f0b 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -989,7 +989,7 @@ long wait_iff_congested(struct zone *zone, int sync, long
> timeout)
>  		 * here rather than calling cond_resched().
>  		 */
>  		if (current->flags & PF_WQ_WORKER)
> -			schedule_timeout(1);
> +			schedule_timeout_uninterruptible(1);
>  		else
>  			cond_resched();
>  
> ----------
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>