On Tue 10-10-17 10:17:33, Johannes Weiner wrote: > On Tue, Oct 10, 2017 at 11:14:30AM +0200, Michal Hocko wrote: > > On Mon 09-10-17 16:26:13, Johannes Weiner wrote: > > > It's consistent in the sense that only page faults enable the memcg > > > OOM killer. It's not the type of memory that decides, it's whether the > > > allocation context has a channel to communicate an error to userspace. > > > > > > Whether userspace is able to handle -ENOMEM from syscalls was a voiced > > > concern at the time this patch was merged, although there haven't been > > > any reports so far, > > > > Well, I remember reports about MAP_POPULATE breaking or at least having > > an unexpected behavior. > > Hm, that slipped past me. Did we do something about these? Or did they > fix userspace? Well it was mostly LTP complaining. I have tried to fix that but Linus was against so we just documented that this is possible and MAP_POPULATE is not a guarantee. > > Well, we should be able to do that with the oom_reaper. At least for v2 > > which doesn't have synchronous userspace oom killing. > > I don't see how the OOM reaper is a guarantee as long as we have this: > > if (!down_read_trylock(&mm->mmap_sem)) { > ret = false; > trace_skip_task_reaping(tsk->pid); > goto unlock_oom; > } And we will simply mark the victim MMF_OOM_SKIP and hide it from the oom killer if we fail to get the mmap_sem after several attempts. This will allow to find a new victim. So we shouldn't deadlock. > What do you mean by 'v2'? cgroup v2 because the legacy memcg allowed sync wait for the oom killer and that would be a bigger problem from a deep callchains for obevious reasons. > > > > c) Overcharge kmem to oom memcg and queue an async memcg limit checker, > > > > which will oom kill if needed. > > > > > > This makes the most sense to me. Architecturally, I imagine this would > > > look like b), with an OOM handler at the point of return to userspace, > > > except that we'd overcharge instead of retrying the syscall. > > > > I do not think we should break the hard limit semantic if possible. We > > can currently allow that for allocations which are very short term (oom > > victims) or too important to fail but allowing that for kmem charges in > > general sounds like too easy to runaway. > > I'm not sure there is a convenient way out of this. > > If we want to respect the hard limit AND guarantee allocation success, > the OOM killer has to free memory reliably - which it doesn't. But if > it did, we could also break the limit temporarily and have the OOM > killer replenish the pool before that userspace app can continue. The > allocation wouldn't have to be short-lived, since memory is fungible. If we can guarantee the oom killer is started then we can allow temporal access to reserves which is already implemented even for memcg. The thing is we do not invoke the oom killer... -- Michal Hocko SUSE Labs