Re: [PATCH 0/8] idle page tracking / working set estimation

Michel Lespinasse <walken@xxxxxxxxxx> · Thu, 22 Sep 2011 18:23:27 -0700

On Thu, Sep 22, 2011 at 4:13 PM, Andrew Morton <akpm@xxxxxxxxxx> wrote:
> On Fri, 16 Sep 2011 20:39:05 -0700
> Michel Lespinasse <walken@xxxxxxxxxx> wrote:
>
>> Please comment on the following patches (which are against the v3.0 kernel).
>> We are using these to collect memory utilization statistics for each cgroup
>> accross many machines, and optimize job placement accordingly.
>
> Please consider updating /proc/kpageflags with the three new page
> flags.  If "yes": update.  If "no": explain/justify.

The PG_stale flag should probably be exported that way. I'll make sure
to add this, thanks for the suggestion!

I am not sure about PG_young and PG_idle since they indicate young
bits have been cleared in PTE(s) pointing to the page since the last
page_referenced() call. This seems rather internal - we don't export
PTE young bits in kpageflags currently, nor do we export anything that
would depend on when page_referenced() was last called.

> Which prompts the obvious: the whole feature could have been mostly
> implemented in userspace, using kpageflags.  Some additional kernel
> support would presumably be needed, but I'm not sure how much.
>
> If you haven't already done so, please sketch down what that
> infrastructure would look like and have a think about which approach is
> preferable?

kpageflags does not currently do a page_referenced() call to export
PTE young flags. For a userspace approach, we would have to add that.
Also we would want to actually clear the PTE young bits so that the
page doesn't show up as young again on the next kpageflags read - and,
we wouldn't want to affect the normal LRU algorithms while doing this,
so we'd end up introducing the same PG_young and PG_idle flags. The
next issue would be to find out which cgroup an idle page belongs to -
this could be done by adding a new kpagecgroup file, I suppose. Given
the above, we'd have the necessary components for a userspace approach
- but, the only part that we would really be able to remove from the
kernel side is the loop that scans physical pages and tallies the idle
ones into a per-cgroup count.

> What bugs me a bit about the proposal is its cgroups-centricity.  The
> question "how much memory is my application really using" comes up
> again and again.  It predates cgroups.  One way to answer that question
> is to force a massive amount of swapout on the entire machine, then let
> the system recover and take a look at your app's RSS two minutes later.
> This is very lame.
>
> It's a legitimate requirement, and the kstaled infrastructure puts a
> lot of things in place to answer it well.  But as far as I can tell it
> doesn't quite get over the line.  Then again, maybe it _does_ get
> there: put the application into a memcg all of its own, just for
> instrumentation purposes and then use kstaled to monitor it?

Yes, this is what I would recomment in this situation - create a
memory cgroup to move the application in, and see what kstaled
reports.

> <later> OK, I'm surprised to discover that kstaled is doing a physical
> scan and not a virtual one.  I assume it works, but I don't know why.
> But it makes the above requirement harder, methinks.

The reason for the physical scan is that a virtual scan would have
some limitations:
- it would only report memory that's virtually mapped - we do want
file pages to be classified as idle or not, regardless of how the file
gets accessed
- it may not work well with jobs that involve short lived processes.

> How does all this code get along with hugepages, btw?

They should get along now that Andreas updated get_page and
get_page_unless_zero to avoid the race with THP tail page splitting.

However, you're reminding me that I forgot to include the patch that
would make the accounting correct when we encounter a THP page (we
want to report the entire page as idle rather than just the first 4K,
and increment pfn appropriately for the page size)...

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href