Re: How to handle TIF_MEMDIE stalls?

Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx> · Tue, 30 Dec 2014 22:33:14 +0900

Michal Hocko wrote:
> So the OOM blocked task is sitting in the page fault caused by clearing
> the user buffer. According to your debugging patch this should be
> GFP_HIGHUSER_MOVABLE | __GFP_ZERO allocation which is the case where we
> retry without failing most of the time.

Oops, my debugging patch had a bug. I wanted to print p->gfp_flags but
was printing (p->gfp_flags & __GFP_WAIT). Retested with a fix and result
is http://I-love.SAKURA.ne.jp/tmp/serial-20141230-ab-3.txt.xz .

  static void print_memalloc_info(const struct task_struct *p)
  {
          const gfp_t gfp = p->gfp_flags;

          /*
           * __alloc_pages_nodemask() doesn't use smp_wmb() between
           * updating ->gfp_start and ->gfp_flags. But reading stale
           * ->gfp_start value harms nothing but printing bogus duration.
           * Correct duration will be printed when this function is
           * called for the next time.
           */
          if (unlikely(gfp & __GFP_WAIT))
                  printk(KERN_INFO "MemAlloc: %ld jiffies on 0x%x\n",
                         jiffies - p->gfp_start, gfp);
  }

> That being said this doesn't look like a live lock or a lockup. System
> should recover from this state but it might take a lot of time (there
> are hundreds of tasks waiting on the i_mutex lock, each will try to
> allocate and fail and OOM victims will have to get out of the kernel and
> die). I am not sure we can do much about that from the allocator POV. A
> possible way would be refraining from the reclaim efforts when it is
> clear that nothing is really reclaimable. But I suspect this would be
> tricky to get right.

Indeed, this is not a livelock since the task holding the mutex is doing
a !__GFP_FS allocation and is making too-slow-to-wait progress, and the
"waited for" lines are eventually gone.

[  121.017797] b.out           R  running task        0  9999   9982 0x00000088
[  121.019750] MemAlloc: 30542 jiffies on 0x102005a
[  223.486701] b.out           R  running task        0 10008   9982 0x00000080
[  223.488642] MemAlloc: 12242 jiffies on 0x102005a
[  415.695635] b.out           R  running task        0 10013   9982 0x00000080
[  415.697578] MemAlloc: 108210 jiffies on 0x102005a
[  960.228134] b.out           R  running task        0 10013   9982 0x00000080
[  960.230179] MemAlloc: 652090 jiffies on 0x102005a

> > where I think a.out cannot die within reasonable duration due to b.out .
> 
> I am not sure you can have any reasonable time expectation with such a
> huge contention on a single file. Even killing the task manually would
> take quite some time I suspect. Sure, memory pressure makes it all much
> worse.

Not specific to OOM-killer case, but I wish that the stall ends within 10
seconds, for my customers are using watchdog timeout of 11 seconds with
watchdog keep-alive interval of 2 seconds.

I wish that there is a way to record that the process who is supposed to do
watchdog keep-alive operation was unexpectedly blocked for many seconds at
memory allocation. My gfp_start patch works for that purpose.

> > but I think we need to be prepared for cases where sending SIGKILL to
> > all threads sharing the same memory does not help.
> 
> Sure, unkillable tasks are a problem which we have to handle. Having
> GFP_KERNEL allocations looping without way out contributes to this which
> is sad but your current data just show that sometimes it might take ages
> to finish even without that going on.

Can't we replace mutex_lock() / wait_for_completion() with killable versions
where it is safe (in order to reduce locations of unkillable waits)?
I think replacing mutex_lock() in xfs_file_buffered_aio_write() with killable
version is possible because data written by buffered write is not guaranteed
to be flushed until sync() / fsync() / fdatasync() returns.

And can't we detect unkillable TIF_MEMDIE tasks (like checking task's ->state
after a while after TIF_MEMDIE was set)? My sysctl_memdie_timeout_jiffies
patch works for that purpose.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>