On Wed, Apr 21, 2021 at 06:26:37AM -0700, Shakeel Butt wrote: > On Tue, Apr 20, 2021 at 7:58 PM Roman Gushchin <guro@xxxxxx> wrote: > > > [...] > > > > > > Michal has suggested ALLOC_OOM which is less risky. > > > > The problem is that even if you'll serve the oom daemon task with pages > > from a reserve/custom pool, it doesn't guarantee anything, because the task > > still can wait for a long time on some mutex, taken by another process, > > throttled somewhere in the reclaim. > > I am assuming here by mutex you are referring to locks which > oom-killer might have to take to read metrics or any possible lock > which oom-killer might have to take which some other process can take > too. > > Have you observed this situation happening with oomd on production? I'm not aware of any oomd-specific issues. I'm not sure if they don't exist at all, but so far it's wasn't a problem for us. Maybe it because you tend to have less pagecache (as I understand), maybe it comes to specific oomd policies/settings. I know we had different pains with mmap_sem and atop and similar programs, where reading process data stalled on mmap_sem for a long time. Thanks!