On Tue 20-04-21 09:04:21, Shakeel Butt wrote: > On Mon, Apr 19, 2021 at 11:46 PM Michal Hocko <mhocko@xxxxxxxx> wrote: > > > > On Mon 19-04-21 18:44:02, Shakeel Butt wrote: > [...] > > > memory.min. However a new allocation from userspace oom-killer can > > > still get stuck in the reclaim and policy rich oom-killer do trigger > > > new allocations through syscalls or even heap. > > > > Can you be more specific please? > > > > To decide when to kill, the oom-killer has to read a lot of metrics. > It has to open a lot of files to read them and there will definitely > be new allocations involved in those operations. For example reading > memory.stat does a page size allocation. Similarly, to perform action > the oom-killer may have to read cgroup.procs file which again has > allocation inside it. True but many of those can be avoided by opening the file early. At least seq_file based ones will not allocate later if the output size doesn't increase. Which should be the case for many. I think it is a general improvement to push those who allocate during read to an open time allocation. > Regarding sophisticated oom policy, I can give one example of our > cluster level policy. For robustness, many user facing jobs run a lot > of instances in a cluster to handle failures. Such jobs are tolerant > to some amount of failures but they still have requirements to not let > the number of running instances below some threshold. Normally killing > such jobs is fine but we do want to make sure that we do not violate > their cluster level agreement. So, the userspace oom-killer may > dynamically need to confirm if such a job can be killed. What kind of data do you need to examine to make those decisions? > [...] > > > To reliably solve this problem, we need to give guaranteed memory to > > > the userspace oom-killer. > > > > There is nothing like that. Even memory reserves are a finite resource > > which can be consumed as it is sharing those reserves with other users > > who are not necessarily coordinated. So before we start discussing > > making this even more muddy by handing over memory reserves to the > > userspace we should really examine whether pre-allocation is something > > that will not work. > > > > We actually explored if we can restrict the syscalls for the > oom-killer which does not do memory allocations. We concluded that is > not practical and not maintainable. Whatever the list we can come up > with will be outdated soon. In addition, converting all the must-have > syscalls to not do allocations is not possible/practical. I am definitely curious to learn more. [...] > > > 2. Mempool > > > > > > The idea is to preallocate mempool with a given amount of memory for > > > userspace oom-killer. Preferably this will be per-thread and > > > oom-killer can preallocate mempool for its specific threads. The core > > > page allocator can check before going to the reclaim path if the task > > > has private access to the mempool and return page from it if yes. > > > > Could you elaborate some more on how this would be controlled from the > > userspace? A dedicated syscall? A driver? > > > > I was thinking of simply prctl(SET_MEMPOOL, bytes) to assign mempool > to a thread (not shared between threads) and prctl(RESET_MEMPOOL) to > free the mempool. I am not a great fan of prctl. It has become a dumping ground for all mix of unrelated functionality. But let's say this is a minor detail at this stage. So you are proposing to have a per mm mem pool that would be used as a fallback for an allocation which cannot make a forward progress, right? Would that pool be preallocated and sitting idle? What kind of allocations would be allowed to use the pool? What if the pool is depleted? -- Michal Hocko SUSE Labs