Re: [RFC][PATCH 2/7] RSS controller core

ebiederm@xxxxxxxxxxxx (Eric W. Biederman) · Sun, 18 Mar 2007 10:58:06 -0600

Dave Hansen <hansendc@xxxxxxxxxx> writes:

> On Mon, 2007-03-12 at 23:41 +0100, Herbert Poetzl wrote:
>> 
>> let me give a real world example here:
>> 
>>  - typical guest with 600MB disk space
>>  - about 100MB guest specific data (not shared)
>>  - assumed that 80% of the libs/tools are used
>
> I get the general idea here, but I just don't think those numbers are
> very accurate.  My laptop has a bunch of gunk open (xterm, evolution,
> firefox, xchat, etc...).  I ran this command:
>
> lsof | egrep '/(usr/|lib.*\.so)' | awk '{print $9}' | sort | uniq | xargs du
> -Dcs
>
> and got:
>
> 113840  total
>
> On a web/database server that I have (ps aux | wc -l == 128), I just ran
> the same:
>
> 39168   total
>
> That's assuming that all of the libraries are fully read in and
> populated, just by their on-disk sizes. Is that not a reasonable measure
> of the kinds of things that we can expect to be shared in a vserver?  If
> so, it's a long way from 400MB.
>
> Could you try a similar measurement on some of your machines?  Perhaps
> mine are just weird.

Think shell scripts and the like.  From what I have seen I would agree
that is typical for application code not to dominate application memory usage.
However on the flip side it is non uncommon for application code to dominate
disk usage.  Some of us have giant music, video or code databases that consume
a lot of disk space but in many instances servers don't have enormous chunks
of private files, and even when they do they share the files from the distribution.

The result of this is that there are a lot of unmapped pages cached in the page
cache for rarely run executables, that are cached just in case we need them.

So while Herbert's numbers may be a little off the general principle of the entire
system doing better if you can share the page cache is very real.

That the page cache isn't accounted for here isn't terribly important we still
get the global benefit.

> I don't doubt this, but doing this two-level page-out thing for
> containers/vservers over their limits is surely something that we should
> consider farther down the road, right?

It is what the current VM of linux does.  There is removing a page from
processes and then there is writing it out to disk.  I think the normal
term is second chance replacement.  The idea is that once you remove
a page from being mapped you let it age a little before it is paged
back in.  This allows pages in high demand to avoid being written
to disk, all they incur are minor not major fault costs.

> It's important to you, but you're obviously not doing any of the
> mainline coding, right?

Tread carefully here.  Herbert may not be doing a lot of mainline coding
or extremely careful review of potential patches but he does seem to have
a decent grasp of the basic issues.   In addition to a reasonable amount
of experience so it is worth listening to what he says.

In addition Herbert does seem to be doing some testing of the mainline
code as we get it going.  So he is contributing.

>> > What are the consequences if this isn't done?  Doesn't 
>> > a loaded system eventually have all of its pages used 
>> > anyway, so won't this always be a temporary situation?
>> 
>> let's consider a quite limited guest (or several
>> of them) which have a 'RAM' limit of 64MB and 
>> additional 64MB of 'virtual swap' assigned ...
>> 
>> if they use roughly 96MB (memory footprint) then
>> having this 'fluffy' optimization will keep them
>> running without any effect on the host side, but
>> without, they will continously swap in and out
>> which will affect not only the host, but also the
>> other guests ...

Ugh.  You really want swap > RAM here.  Because there are real
cases when you are swapping when all of your pages in RAM can
be cached in the page cache.  96MB with 64MB RSS and 64MB swap is
almost a sure way to hit your swap page limit and die.

> All workloads that use $limit+1 pages of memory will always pay the
> price, right?  :)

They should.  When you remove an anonymous page from the pages tables it
needs to be allocated and placed in the swap cache.  Once you do that
it can sit in the page cache like any file backed page.  So the
container that hits $limit+1 should get the paging pressure and a lot
more minor faults.  However we still want to globally write thing to
disk and optimize that as we do right now.

Eric
_______________________________________________
Containers mailing list
Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx
https://lists.linux-foundation.org/mailman/listinfo/containers