On Tue, Oct 10, 2017 at 04:24:34PM +0200, Michal Hocko wrote: > On Tue 10-10-17 10:17:33, Johannes Weiner wrote: > > On Tue, Oct 10, 2017 at 11:14:30AM +0200, Michal Hocko wrote: > > > On Mon 09-10-17 16:26:13, Johannes Weiner wrote: > > > > It's consistent in the sense that only page faults enable the memcg > > > > OOM killer. It's not the type of memory that decides, it's whether the > > > > allocation context has a channel to communicate an error to userspace. > > > > > > > > Whether userspace is able to handle -ENOMEM from syscalls was a voiced > > > > concern at the time this patch was merged, although there haven't been > > > > any reports so far, > > > > > > Well, I remember reports about MAP_POPULATE breaking or at least having > > > an unexpected behavior. > > > > Hm, that slipped past me. Did we do something about these? Or did they > > fix userspace? > > Well it was mostly LTP complaining. I have tried to fix that but Linus > was against so we just documented that this is possible and MAP_POPULATE > is not a guarantee. Okay, makes sense. I wouldn't really count that as a regression. > > > Well, we should be able to do that with the oom_reaper. At least for v2 > > > which doesn't have synchronous userspace oom killing. > > > > I don't see how the OOM reaper is a guarantee as long as we have this: > > > > if (!down_read_trylock(&mm->mmap_sem)) { > > ret = false; > > trace_skip_task_reaping(tsk->pid); > > goto unlock_oom; > > } > > And we will simply mark the victim MMF_OOM_SKIP and hide it from the oom > killer if we fail to get the mmap_sem after several attempts. This will > allow to find a new victim. So we shouldn't deadlock. It's less likely to deadlock, but not exactly deadlock-free. There might not BE any other mm's holding significant amounts of memory. > > What do you mean by 'v2'? > > cgroup v2 because the legacy memcg allowed sync wait for the oom killer > and that would be a bigger problem from a deep callchains for obevious > reasons. Actually, the async oom killing code isn't dependent on cgroup version. cgroup1 doesn't wait inside the charge context, either. > > > > > c) Overcharge kmem to oom memcg and queue an async memcg limit checker, > > > > > which will oom kill if needed. > > > > > > > > This makes the most sense to me. Architecturally, I imagine this would > > > > look like b), with an OOM handler at the point of return to userspace, > > > > except that we'd overcharge instead of retrying the syscall. > > > > > > I do not think we should break the hard limit semantic if possible. We > > > can currently allow that for allocations which are very short term (oom > > > victims) or too important to fail but allowing that for kmem charges in > > > general sounds like too easy to runaway. > > > > I'm not sure there is a convenient way out of this. > > > > If we want to respect the hard limit AND guarantee allocation success, > > the OOM killer has to free memory reliably - which it doesn't. But if > > it did, we could also break the limit temporarily and have the OOM > > killer replenish the pool before that userspace app can continue. The > > allocation wouldn't have to be short-lived, since memory is fungible. > > If we can guarantee the oom killer is started then we can allow temporal > access to reserves which is already implemented even for memcg. The > thing is we do not invoke the oom killer... You lost me here. Which reserves? All I'm saying is that, when the syscall-context fails to charge, we should do mem_cgroup_oom() to set up the async OOM killer, let the charge succeed over the hard limit - since the OOM killer will most likely get us back below the limit - then mem_cgroup_oom_synchronize() before the syscall returns to userspace. That would avoid returning -ENOMEM from syscalls without the risk of the hard limit deadlocking - at the risk of sometimes overrunning the hard limit, but that seems like the least problematic behavior out of the three.