Re: [PATCH 2/2] mm, oom: fix potential data corruption when oom_reaper races with writer

Andrea Arcangeli <aarcange@xxxxxxxxxx> · Wed, 9 Aug 2017 20:36:26 +0200

On Wed, Aug 09, 2017 at 08:35:36AM +0900, Tetsuo Handa wrote:
> I don't think so. We spent a lot of time in order to remove possible locations
> which can lead to failing to invoke the OOM killer when out_of_memory() is called.

It's not clear the connection between failing to invoke the OOM killer
and the OOM reaper. I assume you mean failing to kill the task after
the OOM killer has been invoked through out_of_memory().

You should always see in the logs "%s: Kill process %d (%s) score %u
or sacrifice child\n", the invocation itself should never been an
issue and it's unrelated to the OOM reaper.

> Since RHEL7 changed default filesystem from ext4 to xfs, OOM related problems
> became much easier to occur, for xfs involves many kernel threads where
> TIF_MEMDIE based access to memory reserves cannot work among relevant threads.

I could reproduce similar issues where the TIF_MEMDIE task was hung on
fs locks hold by kernel threads in ext4 too but those should have been
solved by other means.

> Judging from my experience at a support center, it is too difficult for customers
> to report OOM hangs. It requires customers to stand by in front of the console
> twenty-four seven so that we get SysRq-t etc. whenever an OOM related problem is
> suspected. We can't ask customers for such effort. There is no report does not mean
> OOM hang is not occurring without artificial memory stress tests.

The printk above is likely to show in the logs after reboot, but I
agree in the cloud a node hanging on OOM is probably hidden and there
are all sort of management provisions possible to prevent hitting a
real OOM too. For example memcg.

Still having no apparent customer complains I think is significant
because it means they easily tackle the problem by other means, be it
watchdogs or they prevent it in the first place with memcg.

I'm not saying it's a minor issue, to me it's totally annoying if my
system hangs on OOM so it should be reliable in practice. I'm only not
sure if tacking the OOM issues with the big hammer that still cannot
guarantee anything 100%, is justified, considering the complexity it
brings to the VM core and there's still no guarantee of not hanging.

> The OOM reaper does not need to free memory fast enough, for the OOM killer
> does not select the second task for kill until the OOM reaper sets
> MMF_OOM_SKIP or __mmput() sets MMF_OOM_SKIP.

Right, there's no need to be fast there.

> I think that the main point of the OOM reaper nowadays are that
> "how can we allow the OOM reaper to take mmap_sem for read (because
> khugepaged might take mmap_sem of the OOM victim for write)"

The main point of the OOM reaper is to avoid killing more tasks. Not
just because it would be a false positive but also because even if we
kill more tasks, they may be all stuck on the same fs locks hold by
kernel threads that cannot be killed and loop asking for more memory.

So the OOM reaper tends to reduce the risk of OOM hangs but sure thing
it cannot guarantee perfection either.

Incidentally the OOM reaper still has a timeout where it gives up and
it moves to kill another task after the timeout.

khugepaged doesn't allocate memory while holding the mmap_sem for
writing.

It's not exactly clear how in the below dump khugepaged is the problem
because 3163 is also definitely holding the mmap_sem for reading and
it cannot release it independent of khugepaged. However khugepaged
could try to grab it for writing and the fairness provisions of the
rwsem would prevent down_read_trylock to go ahead.

There's nothing specific about khugepaged here, you can try to do a
pthread_create() to create a thread in your a.out program and then
call mmap munmap in a loop (no need to touch any memory). Eventually
you'll get the page fault in your a.out process holding the mmap_sem
for reading and the child thread trying to take it for writing. Which
should be enough to block the OOM reaper entirely with the child stuck
in D state.

I already have a patch in my tree that let exit_mmap and OOM reapear
to take down pagetables concurrently only serialized by the PT lock
(upstream the OOM reaper can only run before exit_mmap starts while
mm_users is still > 0). This lets the OOM reaper run even if mm_users
of the TIF_MEMDIE task already reached 0. However to avoid taking the
mmap_sem in __oom_reap_task_mm for reading you would need to do the
opposite of upstream and then it would only solve OOM hangs between
the last mmput and exit_mmap.

To zap pagetables without mmap_sem I think quite some overhaul is
needed (likely much bigger than the one required to fix the memory and
coredump corruption). If that is done it should be done to run
MADV_DONTNEED without mmap_sem if something. OOM reaper increased
accuracy wouldn't be enough of a motivation to justify such an
increase in complexity and constant fast-path overhead (be it to
release vmas with RCU through callbacks with delayed freeing or
anything else required to drop the mmap_sem while still allowing the
OOM reapear to run while mm_users is still > 0). It'd be quite
challenging to do that because the vma bits are also protected by
mmap_sem and you can only replace rbtree nodes with RCU, not to
rebalance with argumentation.

Assuming we do all that work and slowdown the fast paths further, just
for the OOM reaper, what would then happen if the process hung has no
anonymous memory to free and instead it runs on shmem only? Would we
be back to square one and hang with the below dump?

What if we fix xfs instead to get rid of the below problem? Wouldn't
then the OOM reaper become irrelevant if removed or not?

> ----------
> [  493.787997] Out of memory: Kill process 3163 (a.out) score 739 or sacrifice child
> [  493.791708] Killed process 3163 (a.out) total-vm:4268108kB, anon-rss:2754236kB, file-rss:0kB, shmem-rss:0kB
> [  494.838382] oom_reaper: unable to reap pid:3163 (a.out)
> [  494.847768] 
> [  494.847768] Showing all locks held in the system:
> [  494.861357] 1 lock held by oom_reaper/59:
> [  494.865903]  #0:  (tasklist_lock){.+.+..}, at: [<ffffffff9f0c202d>] debug_show_all_locks+0x3d/0x1a0
> [  494.872934] 1 lock held by khugepaged/63:
> [  494.877426]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff9f1d5a4d>] khugepaged+0x99d/0x1af0
> [  494.884165] 3 locks held by kswapd0/75:
> [  494.888628]  #0:  (shrinker_rwsem){++++..}, at: [<ffffffff9f16c638>] shrink_slab.part.44+0x48/0x2b0
> [  494.894125]  #1:  (&type->s_umount_key#30){++++++}, at: [<ffffffff9f1f30f6>] trylock_super+0x16/0x50
> [  494.898328]  #2:  (&pag->pag_ici_reclaim_lock){+.+.-.}, at: [<ffffffffc03aeafd>] xfs_reclaim_inodes_ag+0x3ad/0x4d0 [xfs]
> [  494.902703] 3 locks held by kworker/u128:31/387:
> [  494.905404]  #0:  ("writeback"){.+.+.+}, at: [<ffffffff9f08ddcc>] process_one_work+0x1fc/0x480
> [  494.909237]  #1:  ((&(&wb->dwork)->work)){+.+.+.}, at: [<ffffffff9f08ddcc>] process_one_work+0x1fc/0x480
> [  494.913205]  #2:  (&type->s_umount_key#30){++++++}, at: [<ffffffff9f1f30f6>] trylock_super+0x16/0x50
> [  494.916954] 1 lock held by xfsaild/sda1/422:
> [  494.919288]  #0:  (&xfs_nondir_ilock_class){++++--}, at: [<ffffffffc03b8828>] xfs_ilock_nowait+0x148/0x240 [xfs]
> [  494.923470] 1 lock held by systemd-journal/491:
> [  494.926102]  #0:  (&(&ip->i_mmaplock)->mr_lock){++++++}, at: [<ffffffffc03b85da>] xfs_ilock+0x11a/0x1b0 [xfs]
> [  494.929942] 1 lock held by gmain/745:
> [  494.932368]  #0:  (&(&ip->i_mmaplock)->mr_lock){++++++}, at: [<ffffffffc03b85da>] xfs_ilock+0x11a/0x1b0 [xfs]
> [  494.936505] 1 lock held by tuned/1009:
> [  494.938856]  #0:  (&(&ip->i_mmaplock)->mr_lock){++++++}, at: [<ffffffffc03b85da>] xfs_ilock+0x11a/0x1b0 [xfs]
> [  494.942824] 2 locks held by agetty/982:
> [  494.944900]  #0:  (&tty->ldisc_sem){++++.+}, at: [<ffffffff9f78503f>] ldsem_down_read+0x1f/0x30
> [  494.948244]  #1:  (&ldata->atomic_read_lock){+.+...}, at: [<ffffffff9f4108bf>] n_tty_read+0xbf/0x8e0
> [  494.952118] 1 lock held by sendmail/984:
> [  494.954408]  #0:  (&(&ip->i_mmaplock)->mr_lock){++++++}, at: [<ffffffffc03b85da>] xfs_ilock+0x11a/0x1b0 [xfs]
> [  494.958370] 5 locks held by a.out/3163:
> [  494.960544]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff9f05ca34>] __do_page_fault+0x154/0x4c0
> [  494.964191]  #1:  (shrinker_rwsem){++++..}, at: [<ffffffff9f16c638>] shrink_slab.part.44+0x48/0x2b0
> [  494.967922]  #2:  (&type->s_umount_key#30){++++++}, at: [<ffffffff9f1f30f6>] trylock_super+0x16/0x50
> [  494.971548]  #3:  (&pag->pag_ici_reclaim_lock){+.+.-.}, at: [<ffffffffc03ae7fe>] xfs_reclaim_inodes_ag+0xae/0x4d0 [xfs]
> [  494.975644]  #4:  (&xfs_nondir_ilock_class){++++--}, at: [<ffffffffc03b8580>] xfs_ilock+0xc0/0x1b0 [xfs]
> [  494.979194] 1 lock held by a.out/3164:
> [  494.981220]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff9f076d05>] do_exit+0x175/0xbb0
> [  494.984448] 1 lock held by a.out/3165:
> [  494.986554]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff9f076d05>] do_exit+0x175/0xbb0
> [  494.989841] 1 lock held by a.out/3166:
> [  494.992089]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff9f076d05>] do_exit+0x175/0xbb0
> [  494.995388] 1 lock held by a.out/3167:
> [  494.997420]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff9f076d05>] do_exit+0x175/0xbb0
> ----------
> 
>   collapse_huge_page at mm/khugepaged.c:1001
>    (inlined by) khugepaged_scan_pmd at mm/khugepaged.c:1209
>    (inlined by) khugepaged_scan_mm_slot at mm/khugepaged.c:1728
>    (inlined by) khugepaged_do_scan at mm/khugepaged.c:1809
>    (inlined by) khugepaged at mm/khugepaged.c:1854
> 
> and "how can we close race between checking MMF_OOM_SKIP and doing last alloc_page_from_freelist()
> attempt (because that race allows needlessly selecting the second task for kill)" in addition to
> "how can we close race between unmap_page_range() and the page faults with retry fallback".

Yes. And the "how is OOM reaper guaranteed not to run already while
coredumping is starting" should be added to the above list of things
to fix or explain.

I'm just questioning if all this energy isn't better spent in fixing
XFS with a memory reserve in xfs_reclaim_inode for kmem_alloc (like we
have mempools for bio) and drop the OOM reaper leaving the VM fast
paths alone.

> The subject of this thread is "how can we close race between unmap_page_range()
> and the page faults with retry fallback". Are you suggesting that we should remove
> the OOM reaper so that we don't need to change page faults and/or __mmput() paths?

Well certainly if it's not fixed, I think we'd be better off to remove
it because the risk of an hang is preferable than risk of memory
corruption or corrupted core dumps.

If it was that simple as it is currently it was nice to have, but
doing it safe without risk to corrupt memory and coredumps and without
slowing down the VM fast paths, sounds overkill. Last but not the
least it hides reproducible of issues like the above hang you posted,
that I think it can't do anything about even if you remove khugepaged...

... unless we drop the mmap_sem from MADV_DONTNEED but it's not easily
feasible if unmap_page_range has to run while mm_users may still be
still > 0. Doing more VM changes that are OOM reaper specific doesn't
seem attractive to me.

I'd rather prefer if we can fix the issues in xfs the old fashioned
way that won't end up again in a hang, if after all that work, the
TIF_MEMDIE task happened to have 0 anon mem allocated in it.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>