On Tue, May 4, 2021 at 6:26 PM Suren Baghdasaryan <surenb@xxxxxxxxxx> wrote: > > On Tue, May 4, 2021 at 5:37 PM Shakeel Butt <shakeelb@xxxxxxxxxx> wrote: > > > > On Wed, Apr 21, 2021 at 7:29 AM Michal Hocko <mhocko@xxxxxxxx> wrote: > > > > > [...] > > > > > What if the pool is depleted? > > > > > > > > This would mean that either the estimate of mempool size is bad or > > > > oom-killer is buggy and leaking memory. > > > > > > > > I am open to any design directions for mempool or some other way where > > > > we can provide a notion of memory guarantee to oom-killer. > > > > > > OK, thanks for clarification. There will certainly be hard problems to > > > sort out[1] but the overall idea makes sense to me and it sounds like a > > > much better approach than a OOM specific solution. > > > > > > > > > [1] - how the pool is going to be replenished without hitting all > > > potential reclaim problems (thus dependencies on other all tasks > > > directly/indirectly) yet to not rely on any background workers to do > > > that on the task behalf without a proper accounting etc... > > > -- > > > > I am currently contemplating between two paths here: > > > > First, the mempool, exposed through either prctl or a new syscall. > > Users would need to trace their userspace oom-killer (or whatever > > their use case is) to find an appropriate mempool size they would need > > and periodically refill the mempools if allowed by the state of the > > machine. The challenge here is to find a good value for the mempool > > size and coordinating the refilling of mempools. > > > > Second is a mix of Roman and Peter's suggestions but much more > > simplified. A very simple watchdog with a kill-list of processes and > > if userspace didn't pet the watchdog within a specified time, it will > > kill all the processes in the kill-list. The challenge here is to > > maintain/update the kill-list. > > IIUC this solution is designed to identify cases when oomd/lmkd got > stuck while allocating memory due to memory shortages and therefore > can't feed the watchdog. In such a case the kernel goes ahead and > kills some processes to free up memory and unblock the blocked > process. Effectively this would limit the time such a process gets > stuck by the duration of the watchdog timeout. If my understanding of > this proposal is correct, Your understanding is indeed correct. > then I see the following downsides: > 1. oomd/lmkd are still not prevented from being stuck, it just limits > the duration of this blocked state. Delaying kills when memory > pressure is high even for short duration is very undesirable. Yes I agree. > I think > having mempool reserves could address this issue better if it can > always guarantee memory availability (not sure if it's possible in > practice). I think "mempool ... always guarantee memory availability" is something I should quantify with some experiments. > 2. What would be performance overhead of this watchdog? To limit the > duration of a process being blocked to a small enough value we would > have to have quite a small timeout, which means oomd/lmkd would have > to wake up quite often to feed the watchdog. Frequent wakeups on a > battery-powered system is not a good idea. This is indeed the downside i.e. the tradeoff between acceptable stall vs frequent wakeups. > 3. What if oomd/lmkd gets stuck for some memory-unrelated reason and > can't feed the watchdog? In such a scenario the kernel would assume > that it is stuck due to memory shortages and would go on a killing > spree. This is correct but IMHO killing spree is not worse than oomd/lmkd getting stuck for some other reason. > If there is a sure way to identify when a process gets stuck > due to memory shortages then this could work better. Hmm are you saying looking at the stack traces of the userspace oom-killer or some metrics related to oom-killer? It will complicate the code. > 4. Additional complexity of keeping the list of potential victims in > the kernel. Maybe we can simply reuse oom_score to choose the best > victims? Your point of additional complexity is correct. Regarding oom_score I think you meant oom_score_adj, I would avoid putting more policies/complexity in the kernel but I got your point that the simplest watchdog might not be helpful at all. > Thanks, > Suren. > > > > > I would prefer the direction which oomd and lmkd are open to adopt. > > > > Any suggestions?