Re: [RFC v3 0/3] vmpressure_fd: Linux VM pressure notifications

Glauber Costa <glommer@xxxxxxxxxxxxx> · Fri, 16 Nov 2012 13:33:39 +0400

On 11/16/2012 01:25 AM, David Rientjes wrote:
> On Thu, 15 Nov 2012, Anton Vorontsov wrote:
> 
>> Hehe, you're saying that we have to have cgroups=y. :) But some folks were
>> deliberately asking us to make the cgroups optional.
>>
> 
> Enabling just CONFIG_CGROUPS (which is enabled by default) and no other 
> current cgroups increases the size of the kernel text by less than 0.3% 
> with x86_64 defconfig:
> 
>    text	   data	    bss	    dec	    hex	filename
> 10330039	1038912	1118208	12487159	 be89f7	vmlinux.disabled
> 10360993	1041624	1122304	12524921	 bf1d79	vmlinux.enabled
> 
> I understand that users with minimally-enabled configs for an optimized 
> memory footprint will have a higher percentage because their kernel is 
> already smaller (~1.8% increase for allnoconfig), but I think the cost of 
> enabling the cgroups code to be able to mount a vmpressure cgroup (which 
> I'd rename to be "mempressure" to be consistent with "memcg" but it's only 
> an opinion) is relatively small and allows for a much more maintainable 
> and extendable feature to be included: it already provides the 
> cgroup.event_control interface that supports eventfd that makes 
> implementation much easier.  It also makes writing a library on top of the 
> cgroup to be much easier because of the standardization.
> 
> I'm more concerned about what to do with the memcg memory thresholds and 
> whether they can be replaced with this new cgroup.  If so, then we'll have 
> to figure out how to map those triggers to use the new cgroup's interface 
> in a way that doesn't break current users that open and pass the fd of 
> memory.usage_in_bytes to cgroup.event_control for memcg.
> 
>> OK, here is what I can try to do:
>>
>> - Implement memory pressure cgroup as you described, by doing so we'd make
>>   the thing play well with cpusets and memcg;
>>
>> - This will be eventfd()-based;
>>
> 
> Should be based on cgroup.event_control, see how memcg interfaces its 
> memory thresholds with this in Documentation/cgroups/memory.txt.
> 
>> - Once done, we will have a solution for pretty much every major use-case
>>   (i.e. servers, desktops and Android, they all have cgroups enabled);
>>
> 
> Excellent!  I'd be interested in hearing anybody else's opinions, 
> especially those from the memcg world, so we make sure that everybody is 
> happy with the API that you've described.
> 
Just CC'd them all.

My personal take:

Most people hate memcg due to the cost it imposes. I've already
demonstrated that with some effort, it doesn't necessarily have to be
so. (http://lwn.net/Articles/517634/)

The one thing I missed on that work, was precisely notifications. If you
can come up with a good notifications scheme that *lives* in memcg, but
does not *depend* in the memcg infrastructure, I personally think it
could be a big win.

Doing this in memcg has the advantage that the "per-group" vs "global"
is automatically solved, since the root memcg is just another name for
"global".

I honestly like your low/high/oom scheme better than memcg's
"threshold-in-bytes". I would also point out that those thresholds are
*far* from exact, due to the stock charging mechanism, and can be wrong
by as much as O(#cpus). So far, nobody complained. So in theory it
should be possible to convert memcg to low/high/oom, while still
accepting writes in bytes, that would be thrown in the closest bucket.

Another thing from one of your e-mails, that may shift you in the memcg
direction:

"2. The last time I checked, cgroups memory controller did not (and I
guess still does not) not account kernel-owned slabs. I asked several
times why so, but nobody answered."

It should, now, in the latest -mm, although it won't do per-group
reclaim (yet).

I am also failing to see how cpusets would be involved in here. I
understand that you may have free memory in terms of size, but still be
further restricted by cpuset. But I also think that having multiple
entry points for this buy us nothing at all. So the choices I see are:

1) If cpuset + memcg are comounted, take this into account when deciding
low / high / oom. This is yet another advantage over the "threshold in
bytes" interface, in which you can transparently take
other issues into account while keeping the interface.

2) If they are not, just ignore this effect.

The fallback in 2) sounds harsh, but I honestly think this is the price
to pay for the insanity of mounting those things in different
hierarchies, and we do have a plan to have all those things eventually
together anyway. If you have two cgroups dealing with memory, and set
them up in orthogonal ways, I really can't see how we can bring sanity
to that. So just admitting and unleashing the insanity may be better, if
it brings up our urge to fix it. It worked for Batman, why wouldn't it
work for us?

--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html