On Fri 25-05-18 15:18:11, David Rientjes wrote: [...] > Let's see what Mike and Aneesh say, because they may object to using > VM_FAULT_OOM because there's no way to guarantee that we'll come under the > limit of hugetlb_cgroup as a result of the oom. My assumption is that we > use VM_FAULT_SIGBUS since oom killing will not guarantee that the > allocation can succeed. Yes. And the lack of hugetlb awareness in the oom killer is another reason. There is absolutely no reason to kill a task when somebody misconfigured the hugetlb pool. > But now a process can get a SIGBUS if its hugetlb > pages are not allocatable or its under a limit imposed by hugetlb_cgroup > that it's not aware of. Faulting hugetlb pages is certainly risky > business these days... It's always been and I am afraid it will always be unless somebody simply reimplements the current code to be NUMA aware for example (it is just too easy to drain a per NODE reserves...). > Perhaps the optimal solution for reaching hugetlb_cgroup limits is to > induce an oom kill from within the hugetlb_cgroup itself? Otherwise the > unlucky process to fault their hugetlb pages last gets SIGBUS. Hmm, so you expect that the killed task would simply return pages to the pool? Wouldn't that require to have a hugetlb cgroup OOM killer that would only care about hugetlb reservations of tasks? Is that worth all the effort and the additional code? -- Michal Hocko SUSE Labs