Michal! On Wed, Mar 23 2022 at 10:17, Michal Hocko wrote: > Let me skip over futex part which I need to digest and only focus on the > oom side of the things for clarification. The most important thing to know about futexes is: They are cursed. > On Tue 22-03-22 23:43:18, Thomas Gleixner wrote: > [...] >> > While some places can be handled by changing uninterruptible waiting to >> > killable there are places which are not really fixable, e.g. lock >> > chain dependency which leads to memory allocation. >> >> I'm not following. Which lock chain dependency causes memory allocation? > > Consider an oom victim is blocked on a lock or waiting for an event to > happen but the lock holder is stuck allocating or the wake up depends on > an allocation. Many sleeping locks are doing GFP_KERNEL allocations. Fair enough. >> That will prevent the enforced race in most cases and allow the exiting >> and/or killed processes to cleanup themself. Not pretty, but it should >> reduce the chance of the reaper to win the race with the exiting and/or >> killed process significantly. >> >> It's not going to work when the problem is combined with a heavy VM >> overload situation which keeps a guest (or one/some it's vCPUs) away >> from being scheduled. See below for a discussion of guarantees. >> >> If it failed to do so when the sleep returns, then you still can reap >> it. > > Yes, this is certainly an option. Please note that the oom_reaper is not > the only way to trigger this. process_mrelease syscall performs the same > operation from the userspace. Arguably process_mrelease could be used > sanely/correctly because the userspace oom killer can do pro-cleanup > steps before going to final SIGKILL & process_mrelease. One way would be > to send SIGTERM in the first step and allow the victim to perform its > cleanup. A potential staged approach would be: Send SIGTERM wait some time Send SIGKILL wait some time sys_process_mrelease() Needs proper documentation though. >> That said, the robust list is no guarantee. It's a best effort approach >> which works well most of the time at least for the "normal" issues where >> a task holding a futex dies unexpectedly. But there is no guarantee that >> it works under all circumstances, e.g. OOM. > > OK, so this is an important note. I am all fine by documenting this > restriction. It is not like oom victims couldn't cause other disruptions > by leaving inconsistent/stale state behind. Correct. Futexes are a small part of the overall damage. >> Wrong. The whole point of robust lists is to handle the "normal" case >> gracefully. A process being OOM killed is _NOT_ in the "normal" >> category. >> >> Neither is it "normal" that a VM is scheduled out long enough to miss a >> 1 second deadline. That might be considered normal by cloud folks, but >> that's absolute not normal from an OS POV. Again, that's not a OS >> problem, that's an operator/admin problem. > > Thanks for this clarification. I would tend to agree. Following a > previous example that oom victims can leave inconsistent state behind > which can influence other processes. I am wondering what kind of > expectations about the lock protected state can we make when the holder > of the lock has been interrupted at any random place in the critical > section. Right. If a futex is released by the exit cleanup, it is marked with FUTEX_OWNER_DIED. User space is supposed to handle this. pthread_mutex_lock() returns EOWNERDEAD to the caller if the owner died bit is set. It's the callers responsibility to deal with potentially corrupted or inconsistent state. >> So let me summarize what I think needs to be done in priority order: >> >> #1 Delay the oom reaper so the normal case of a process being able to >> exit is not subject to a pointless race to death. >> >> #2 If #1 does not result in the desired outcome, reap the mm (which is >> what we have now). >> >> #3 If it's expected that #2 will allow the stuck process to make >> progress on the way towards cleanup, then do not reap any VMA >> containing a robust list head of a thread which belongs to the >> exiting and/or killed process. >> >> The remaining cases, i.e. the lock chain example I pointed out above or >> the stuck forever task are going to be rare and fall under the >> collateral damage and no guarantee rule. > > I do agree that delaying oom_reaper wake up is the simplest thing to do > at this stage and it could catch up most of the failures. We still have > process_mrelease syscall case but I guess we can document this as a > caveat into the man page. Yes. The user space oom killer should better know what it is doing :) Thanks, tglx