On Wed, Apr 21, 2021 at 12:23 AM Michal Hocko <mhocko@xxxxxxxx> wrote: > [...] > > In our observation the global reclaim is very non-deterministic at the > > tail and dramatically impacts the reliability of the system. We are > > looking for a solution which is independent of the global reclaim. > > I believe it is worth purusing a solution that would make the memory > reclaim more predictable. I have seen direct reclaim memory throttling > in the past. For some reason which I haven't tried to examine this has > become less of a problem with newer kernels. Maybe the memory access > patterns have changed or those problems got replaced by other issues but > an excessive throttling is definitely something that we want to address > rather than work around by some user visible APIs. > I agree we want to address the excessive throttling but for everyone on the machine and most importantly it is a moving target. The reclaim code continues to evolve and in addition it has callbacks to diverse sets of subsystems. The user visible APIs is for one specific use-case i.e. oom-killer which will indirectly help in reducing the excessive throttling. [...] > > So, the suggestion is to have a per-task flag to (1) indicate to not > > throttle and (2) fail allocations easily on significant memory > > pressure. > > > > For (1), the challenge I see is that there are a lot of places in the > > reclaim code paths where a task can get throttled. There are > > filesystems that block/throttle in slab shrinking. Any process can get > > blocked on an unrelated page or inode writeback within reclaim. > > > > For (2), I am not sure how to deterministically define "significant > > memory pressure". One idea is to follow the __GFP_NORETRY semantics > > and along with (1) the userspace oom-killer will see ENOMEM more > > reliably than stucking in the reclaim. > > Some of the interfaces (e.g. seq_file uses GFP_KERNEL reclaim strength) > could be more relaxed and rather fail than OOM kill but wouldn't your > OOM handler be effectivelly dysfunctional when not able to collect data > to make a decision? > Yes it would be. Roman is suggesting to have a precomputed kill-list (pidfds ready to send SIGKILL) and whenever oom-killer gets ENOMEM, it would go with the kill-list. Though we are still contemplating the ways and side-effects of preferably returning ENOMEM in slowpath for oom-killer and in addition the complexity to maintain the kill-list and keeping it up to date. thanks, Shakeel