Re: [patch 00/11] userspace out of memory handling

Jianguo Wu <wujianguo@xxxxxxxxxx> · Tue, 11 Mar 2014 20:03:15 +0800

On 2014/3/6 10:52, David Rientjes wrote:

> On Wed, 5 Mar 2014, Andrew Morton wrote:
> 
>>> This patchset introduces a standard interface through memcg that allows
>>> both of these conditions to be handled in the same clean way: users
>>> define memory.oom_reserve_in_bytes to define the reserve and this
>>> amount is allowed to be overcharged to the process handling the oom
>>> condition's memcg.  If used with the root memcg, this amount is allowed
>>> to be allocated below the per-zone watermarks for root processes that
>>> are handling such conditions (only root may write to
>>> cgroup.event_control for the root memcg).
>>
>> If process A is trying to allocate memory, cannot do so and the
>> userspace oom-killer is invoked, there must be means via which process
>> A waits for the userspace oom-killer's action.
> 
> It does so by relooping in the page allocator waiting for memory to be 
> freed just like it would if the kernel oom killer were called and process 
> A was waiting for the oom kill victim process B to exit, we don't have the 
> ability to put it on a waitqueue because we don't touch the freeing 
> hotpath.  The userspace oom handler may not even necessarily kill 
> anything, it may be able to free its own memory and start throttling other 
> processes, for example.
> 
>> And there must be
>> fallbacks which occur if the userspace oom killer fails to clear the
>> oom condition, or times out.
>>
> 
> I agree completely and proposed this before as memory.oom_delay_millisecs 
> at http://lwn.net/Articles/432226 which we use internally when memory 
> can't be freed or a memcg's limit cannot be expanded.  I guess it makes 
> more sense alongside the rest of this patchset now, I can add it as an 
> additional patch next time around.
> 
>> Would be interested to see a description of how all this works.
>>
> 
> There's an article for LWN also being developed on this topic.  As 
> mentioned in that article, I think it would be best to generalize a lot of 
> the common functions and the eventfd handling entirely into a library.  
> I've attached an example implementation that just invokes a function to 
> handle the situation.
> 
> For Google's usecase specifically, at the root memcg level (system oom) we 
> want to do priority based memcg killing.  We want to kill from within a 
> memcg hierarchy that has the lowest priority relative to other memcgs.  
> This cannot be implemented with /proc/pid/oom_score_adj today.  Those 
> priorities may also change depending on whether a memcg hierarchy is 
> "overlimit", i.e. its limit has been increased temporarily because it has 
> hit a memcg oom and additional memory is readily available on the system.
> 
> So why not just introduce a memcg tunable that specifies a priority?  
> Well, it's not that simple.  Other users will want to implement different 
> policies on system oom (think about things like existing panic_on_oom or 
> oom_kill_allocating_task sysctls).  I introduced oom_kill_allocating_task 
> originally for SGI because they wanted a fast oom kill rather than 
> expensive tasklist scan: the allocating task itself is rather irrelevant, 
> it was just the unlucky task that was allocating at the moment that oom 
> was triggered.  What's guaranteed is that current in that case will always 
> free memory from under oom (it's not a member of some other mempolicy or 
> cpuset that would be needlessly killed).  Both sysctls could trivially be 
> reimplemented in userspace with this feature.
> 
> I have other customers who don't run in a memcg environment at all, they 
> simply reattach all processes to root and delete all other memcgs.  These 
> customers are only concerned about system oom conditions and want to do 
> something "interesting" before a process is killed.  Some want to log the 
> VM statistics as an artifact to examine later, some want to examine heap 
> profiles, others can start throttling and freeing memory rather than kill 
> anything.  All of this is impossible today because the kernel oom killer 
> will simply kill something immediately and any stats we collect afterwards 
> don't represent the oom condition.  The heap profiles are lost, throttling 
> is useless, etc.
> 
> Jianguo (cc'd) may also have usecases not described here.
> 

I want to log memory usage, like slabinfo, vmalloc info, page-cache info, etc. before
kill anything.

>> It is unfortunate that this feature is memcg-only.  Surely it could
>> also be used by non-memcg setups.  Would like to see at least a
>> detailed description of how this will all be presented and implemented.
>> We should aim to make the memcg and non-memcg userspace interfaces and
>> user-visible behaviour as similar as possible.
>>
> 
> It's memcg only because it can handle both system and memcg oom conditions 
> with the same clean interface, it would be possible to implement only 
> system oom condition handling through procfs (a little sloppy since it 
> needs to register the eventfd) but then a userspace oom handler would need 
> to determine which interface to use based on whether it was running in a 
> memcg or non-memcg environment.  I implemented this feature with userspace 
> in mind: I didn't want it to need two different implementations to do the 
> same thing depending on memcg.  The way it is written, a userspace oom 
> handler does not know (nor need not care) whether it is constrained by the 
> amount of system RAM or a memcg limit.  It can simply write the reserve to 
> its memcg's memory.oom_reserve_in_bytes, attach to memory.oom_control and 
> be done.
> 
> This does mean that memcg needs to be enabled for the support, though.  
> This is already done on most distributions, the cgroup just needs to be 
> mounted.  Would it be better to duplicate the interface in two different 
> spots depending on CONFIG_MEMCG?  I didn't think so, and I think the idea 
> of a userspace library that takes care of this registration (and mounting, 
> perhaps) proposed on LWN would be the best of both worlds.
> 
>> Patches 1, 2, 3 and 5 appear to be independent and useful so I think
>> I'll cherrypick those, OK?
>>
> 
> Ok!  I'm hoping that the PF_MEMPOLICY bit that is removed in those patches 
> is at least temporarily reserved for PF_OOM_HANDLER introduced here, I 
> removed it purposefully :)

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html