Re: [RFC PATCH -v2] mm, oom: introduce oom reaper

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Michal Hocko wrote:
> On Sun 29-11-15 01:10:10, Tetsuo Handa wrote:
> > Tetsuo Handa wrote:
> > > > Users of mmap_sem which need it for write should be carefully reviewed
> > > > to use _killable waiting as much as possible and reduce allocations
> > > > requests done with the lock held to absolute minimum to reduce the risk
> > > > even further.
> > > 
> > > It will be nice if we can have down_write_killable()/down_read_killable().
> > 
> > It will be nice if we can also have __GFP_KILLABLE.
> 
> Well, we already do this implicitly because OOM killer will
> automatically do mark_oom_victim if it has fatal_signal_pending and then
> __alloc_pages_slowpath fails the allocation if the memory reserves do
> not help to finish the allocation.

I don't think so because !__GFP_FS && !__GFP_NOFAIL allocations do not do
mark_oom_victim() even if fatal_signal_pending() is true because
out_of_memory() is not called.

Also, __GFP_KILLABLE is helpful even before the kernel declares OOM because
users can give up earlier when memory allocation is slashing (i.e. allow
users who recognized that memory allocation is too slow to wait to kill
processes before the kernel declares OOM).
I'm willing to use __GFP_KILLABLE from TOMOYO security module because we are
using GFP_NOFS allocations for checking permissions for access requests from
user space (because some LSM hooks are GFP_KERNEL unsafe) where failing
GFP_NOFS allocations without invoking the OOM killer can result in
unrecoverable failure (e.g. unexpected termination of critical processes).

Anyway, __GFP_KILLABLE is outside of this thread, so I stop here for now.



> > Although currently it can't
> > be perfect because reclaim functions called from __alloc_pages_slowpath() use
> > unkillable waits, starting from just bail out as with __GFP_NORETRY when
> > fatal_signal_pending(current) is true will be helpful.
> > 
> > So far I'm hitting no problem with testers except the one using mmap()/munmap().
> > 
> > I think that cmpxchg() was not needed.
> 
> It is not needed right now but I would rather not depend on the oom
> mutex here. This is not a hot path where an atomic would add an
> overhead.

Current patch can allow oom_reaper() to call mmdrop(mm) before
wake_oom_reaper() calls atomic_inc(&mm->mm_count) because sequence like

  oom_reaper() (a realtime thread)         wake_oom_reaper() (current thread)         Current OOM victim

  oom_reap_vmas(mm); /* mm = Previous OOM victim */
  WRITE_ONCE(mm_to_reap, NULL);
                                           old_mm = cmpxchg(&mm_to_reap, NULL, mm); /* mm = Current OOM victim */
                                           if (!old_mm) {
  wait_event_freezable(oom_reaper_wait, (mm = READ_ONCE(mm_to_reap)));
  oom_reap_vmas(mm); /* mm = Current OOM victim, undo atomic_inc(&mm->mm_count) done by oom_kill_process() */
  WRITE_ONCE(mm_to_reap, NULL);
                                                                                      exit and release mm
                                           atomic_inc(&mm->mm_count); /* mm = Current OOM victim */
                                           wake_up(&oom_reaper_wait);

  wait_event_freezable(oom_reaper_wait, (mm = READ_ONCE(mm_to_reap))); /* mm = Next OOM victim */

is possible.

If you are serious about execution ordering, we should protect mm_to_reap
using smp_mb__after_atomic_inc(), rcu_assign_pointer()/rcu_dereference() etc.
in addition to my patch.



But what I don't like is that current patch cannot handle a trap explained
below. What about marking current OOM victim unkillable by updating
victim->signal->oom_score_adj to OOM_SCORE_ADJ_MIN and clearing victim's
TIF_MEMDIE flag when the victim is still alive for a second after
oom_reap_vmas() completed? In this way, my worry (2) at
http://lkml.kernel.org/r/201510121543.EJF21858.LtJFHOOOSQVMFF@xxxxxxxxxxxxxxxxxxx
(though this trap is not a mmap_sem livelock) will be gone. That is,
holding a victim's task_struct than a victim's mm will do better things.

---------- oom-write.c start ----------
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

int main(int argc, char *argv[])
{
	unsigned long size;
	char *buf = NULL;
	unsigned long i;
	for (i = 0; i < 10; i++) {
		if (fork() == 0) {
			close(1);
			open("/tmp/file", O_WRONLY | O_CREAT | O_APPEND, 0600);
			execl("./write", "./write", NULL);
			_exit(1);
		}
	}
	for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
		char *cp = realloc(buf, size);
		if (!cp) {
			size >>= 1;
			break;
		}
		buf = cp;
	}
	sleep(5);
	/* Will cause OOM due to overcommit */
	for (i = 0; i < size; i += 4096)
		buf[i] = 0;
	pause();
	return 0;
}
---------- oom-write.c end ----------

----- write.asm start -----
; nasm -f elf write.asm && ld -s -m elf_i386 -o write write.o
section .text
    CPU 386
    global _start
_start:
; whlie (write(1, buf, 4096) == 4096);
    mov eax, 4 ; NR_write
    mov ebx, 1
    mov ecx, _start - 96
    mov edx, 4096
    int 0x80
    cmp eax, 4096
    je _start
; pause();
    mov eax, 29 ; NR_pause
    int 0x80
; _exit(0);
    mov eax, 1 ; NR_exit
    mov ebx, 0
    int 0x80
----- write.asm end -----

What is happening with this trap:

  (1) out_of_memory() chose oom-write(3805) which consumed most memory.
  (2) oom_kill_process() chose first write(3806) which is one of children
      of oom-write(3805).
  (3) oom_reaper() reclaimed write(3806)'s memory which consumed only
      a few pages.
  (4) out_of_memory() chose oom-write(3805) again.
  (5) oom_kill_process() chose second write(3807) which is one of children
      of oom-write(3805).
  (6) oom_reaper() reclaimed write(3807)'s memory which consumed only
      a few pages.
  (7) second write(3807) is blocked by unkillable mutex held by first
      write(3806), and first write(3806) is waiting for second write(3807)
      to release more memory even after oom_reaper() completed.
  (8) eventually first write(3806) successfully terminated, but
      second write(3807) remained stuck.
  (9) irqbalance(1710) got memory before second write(3807)
      can make forward progress.

----------
[   78.157198] oom-write invoked oom-killer: order=0, oom_score_adj=0, gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|GFP_ZERO)
(...snipped...)
[   78.325409] [ 3805]  1000  3805   541715   357876     708       6        0             0 oom-write
[   78.327978] [ 3806]  1000  3806       39        1       3       2        0             0 write
[   78.330149] [ 3807]  1000  3807       39        1       3       2        0             0 write
[   78.332167] [ 3808]  1000  3808       39        1       3       2        0             0 write
[   78.334488] [ 3809]  1000  3809       39        1       3       2        0             0 write
[   78.336471] [ 3810]  1000  3810       39        1       3       2        0             0 write
[   78.338414] [ 3811]  1000  3811       39        1       3       2        0             0 write
[   78.340709] [ 3812]  1000  3812       39        1       3       2        0             0 write
[   78.342711] [ 3813]  1000  3813       39        1       3       2        0             0 write
[   78.344727] [ 3814]  1000  3814       39        1       3       2        0             0 write
[   78.346613] [ 3815]  1000  3815       39        1       3       2        0             0 write
[   78.348829] Out of memory: Kill process 3805 (oom-write) score 808 or sacrifice child
[   78.350818] Killed process 3806 (write) total-vm:156kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
[   78.455314] oom-write invoked oom-killer: order=0, oom_score_adj=0, gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|GFP_ZERO)
(...snipped...)
[   78.631333] [ 3805]  1000  3805   541715   361440     715       6        0             0 oom-write
[   78.633802] [ 3807]  1000  3807       39        1       3       2        0             0 write
[   78.635977] [ 3808]  1000  3808       39        1       3       2        0             0 write
[   78.638325] [ 3809]  1000  3809       39        1       3       2        0             0 write
[   78.640463] [ 3810]  1000  3810       39        1       3       2        0             0 write
[   78.642837] [ 3811]  1000  3811       39        1       3       2        0             0 write
[   78.644924] [ 3812]  1000  3812       39        1       3       2        0             0 write
[   78.646990] [ 3813]  1000  3813       39        1       3       2        0             0 write
[   78.649039] [ 3814]  1000  3814       39        1       3       2        0             0 write
[   78.651242] [ 3815]  1000  3815       39        1       3       2        0             0 write
[   78.653326] Out of memory: Kill process 3805 (oom-write) score 816 or sacrifice child
[   78.655235] Killed process 3807 (write) total-vm:156kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
[   88.776446] MemAlloc-Info: 1 stalling task, 1 dying task, 1 victim task.
[   88.778228] MemAlloc: systemd-journal(481) seq=17 gfp=0x24280ca order=0 delay=10000
[   88.780158] MemAlloc: write(3807) uninterruptible dying victim
(...snipped...)
[   98.915687] MemAlloc-Info: 8 stalling task, 1 dying task, 1 victim task.
[   98.917888] MemAlloc: kthreadd(2) seq=12 gfp=0x27000c0 order=2 delay=14885 uninterruptible
[   98.920297] MemAlloc: systemd-journal(481) seq=17 gfp=0x24280ca order=0 delay=20139
[   98.922652] MemAlloc: irqbalance(1710) seq=3 gfp=0x24280ca order=0 delay=16231
[   98.924874] MemAlloc: vmtoolsd(1908) seq=1 gfp=0x2400240 order=0 delay=20044
[   98.927043] MemAlloc: pickup(3680) seq=1 gfp=0x2400240 order=0 delay=10230 uninterruptible
[   98.929405] MemAlloc: nmbd(3713) seq=1 gfp=0x2400240 order=0 delay=14716
[   98.931559] MemAlloc: oom-write(3805) seq=12718 gfp=0x24280ca order=0 delay=14887
[   98.933843] MemAlloc: write(3806) seq=29813 gfp=0x2400240 order=0 delay=14887 uninterruptible exiting
[   98.936460] MemAlloc: write(3807) uninterruptible dying victim
(...snipped...)
[  140.356230] MemAlloc-Info: 9 stalling task, 1 dying task, 1 victim task.
[  140.358448] MemAlloc: kthreadd(2) seq=12 gfp=0x27000c0 order=2 delay=56326 uninterruptible
[  140.360979] MemAlloc: systemd-journal(481) seq=17 gfp=0x24280ca order=0 delay=61580 uninterruptible
[  140.363716] MemAlloc: irqbalance(1710) seq=3 gfp=0x24280ca order=0 delay=57672
[  140.365983] MemAlloc: vmtoolsd(1908) seq=1 gfp=0x2400240 order=0 delay=61485 uninterruptible
[  140.368521] MemAlloc: pickup(3680) seq=1 gfp=0x2400240 order=0 delay=51671 uninterruptible
[  140.371128] MemAlloc: nmbd(3713) seq=1 gfp=0x2400240 order=0 delay=56157 uninterruptible
[  140.373548] MemAlloc: smbd(3734) seq=1 gfp=0x27000c0 order=2 delay=48147
[  140.375722] MemAlloc: oom-write(3805) seq=12718 gfp=0x24280ca order=0 delay=56328 uninterruptible
[  140.378647] MemAlloc: write(3806) seq=29813 gfp=0x2400240 order=0 delay=56328 exiting
[  140.381695] MemAlloc: write(3807) uninterruptible dying victim
(...snipped...)
[  150.493557] MemAlloc-Info: 7 stalling task, 1 dying task, 1 victim task.
[  150.495725] MemAlloc: kthreadd(2) seq=12 gfp=0x27000c0 order=2 delay=66463
[  150.497897] MemAlloc: systemd-journal(481) seq=17 gfp=0x24280ca order=0 delay=71717 uninterruptible
[  150.500490] MemAlloc: vmtoolsd(1908) seq=1 gfp=0x2400240 order=0 delay=71622 uninterruptible
[  150.502940] MemAlloc: pickup(3680) seq=1 gfp=0x2400240 order=0 delay=61808
[  150.505122] MemAlloc: nmbd(3713) seq=1 gfp=0x2400240 order=0 delay=66294 uninterruptible
[  150.507521] MemAlloc: smbd(3734) seq=1 gfp=0x27000c0 order=2 delay=58284
[  150.509678] MemAlloc: oom-write(3805) seq=12718 gfp=0x24280ca order=0 delay=66465 uninterruptible
[  150.512333] MemAlloc: write(3807) uninterruptible dying victim
----------
Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20151205.txt.xz .

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>



[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [ECOS]     [Asterisk Internet PBX]     [Linux API]