Re: [PATCH 13/40] autonuma: CPU follow memory algorithm

Nai Xia <nai.xia@xxxxxxxxx> · Sun, 01 Jul 2012 10:41:15 +0800

On 2012年07月01日 03:37, Dor Laor wrote:
On 06/30/2012 09:58 AM, Nai Xia wrote:
On Sat, Jun 30, 2012 at 1:48 PM, Dor Laor <dlaor@xxxxxxxxxx> wrote:
On 06/30/2012 05:43 AM, Nai Xia wrote:

On Sat, Jun 30, 2012 at 9:23 AM, Andrea Arcangeli <aarcange@xxxxxxxxxx>
wrote:

On Sat, Jun 30, 2012 at 04:01:50AM +0800, Nai Xia wrote:

On Sat, Jun 30, 2012 at 2:53 AM, Peter Zijlstra <a.p.zijlstra@xxxxxxxxx>
wrote:

On Fri, 2012-06-29 at 12:51 -0400, Dor Laor wrote:

The previous comments were not shouts but the mother of all NAKs.

I never said any such thing. I just said why should I bother reading
your stuff if you're ignoring most my feedback anyway.

If you want to read that as a NAK, not my problem.

Hey guys, Can I say NAK to these patches ?

Now I aware that this sampling algorithm is completely broken, if we
take
a few seconds to see what it is trying to solve:

We all know that LRU is try to solve the question of "what are the
pages recently accessed?",
so its engouth to use pte bits to approximate.

I made an example about the active list to try to explain it why your
example is still going to work fine.

After it becomes active (from inactive) and it's being a referenced
active page, it won't become _very_active_ or _very_very_active_ or
more no matter how many more times you look up the pagecache.

The LRU order wasn't relevant here.

However, the numa balancing problem is fundamentally like this:

In some time unit,

W = pages_accessed * average_page_access_frequence

We are trying to move process to the node having max W, right?

First of all, the mm_autonuma statistics are not in function of time
and there is no page access frequency there.

mm_autonuma is static information collected by knuma_scand from the
pagetables. That's static and 100% accurate on the whole process and
definitely not generated by the numa hinting page faults. I could shut
off all numa hinting page faults permanently and still generate the
mm_autonuma information identically.

There's a knob in /sys/kernel/mm/autonuma/knuma_scand/working_set that
you can enable if you want to use a "runtime" and not static
information for the mm_autonuma too, but that's not the default for
now (but I think it may be a better default, there wasn't enough time
to test this yet)

The task_autonuma (thread) statistics are the only thing that is
sampled by default in a 10sec interval (the interval tunable too with
sysfs, and 10sec is likely too aggressive, 30sec sounds better, we're
eventually going to make it dynamic anyway)

So even if you were right, the thread statistics only kicks in to
balance threads against threads of the same process, most of the time
what's more important are the mm_autonuma statistics.

But in reality the thread statistics also works perfectly for the job,
as an approximation of the NUMA memory footprint of the thread (vs the
other threads). And then the rest of the memory slowly follows
whatever node CPUs I placed the thread (even if that's not the
absolutely best one at all times).

Andrea's patch can only approximate the pages_accessed number in a
time unit(scan interval),
I don't think it can catch even 1% of average_page_access_frequence
on a busy workload.
Blindly assuming that all the pages' average_page_access_frequence is
the same is seemly
broken to me.

All we need is an approximation to take a better than random decision,
even if you get it 1% right, it's still better than 0% right by going
blind. Your 1% is too pessimistic, in my tests the thread statistics
are more like >90% correct in average (I monitor them with the debug
mode constantly).

If this 1% right, happens one a million samples, who cares, it's not
going to run measurably slower anyway (and it will still be better
than picking a 0% right node).

What you're saying is that because the active list in the pagecache
won't differentiate between 10 cache hits and 20 cache hits, we should
drop the active list and stop activating pages and just threat them
all the same because in some unlucky access pattern, the active list
may only get right 1% of the working set. But there's a reason why the
active list exists despite it may get things wrong in some corner case
and possibly leave the large amount of pages accessed infrequently in
the inactive list forever (even if it gets things only 1% right in
those worst cases, it's still better than 0% right and no active list
at all).

To say it in another way, you may still crash with the car even if
you're careful, but do you think it's better to watch at the street or
to drive blindfolded?

numa/sched drives blindfolded, autonuma watches around every 10sec
very carefully for the best next turn to take with the car and to
avoid obstacles, you can imagine who wins.

Watching the street carefully every 10sec doesn't mean the next moment
a missile won't hit your car to make you crash, you're still having
better chances not to crash than by driving blindfolded.

numa/sched pretends to compete without collecting information for the
NUMA thread memory footprint (task_autonuma, sampled with a
exponential backoff at 10sec intervals), and without process
information (full static information from the pagetables, not
sampled). No matter how you compute stuff, if you've nothing
meaningful in input to your algorithm you lose. And it looks like you
believe that you can take better decisions with nothing in input to
your NUMA placement algorithm, because my thread info (task_autonuma)
isn't 100% perfect at all times and it can't predict the future. The
alternative is to get that information from syscalls, but even
ignoring the -ENOMEM from split_vma, that will lead to userland bugs
and overall the task_autonuma information may be more reliable in the
end, even if it's sampled using an exponential backoff.

Also note the exponential backoff thing, it's not really the last
interval, it's the last interval plus half the previous interval plus
1/4 the previous interval etc... and we can trivially control the
decay.

All we need is to get a direction and knowing _exactly_ what the task
did over the last 10 seconds (even if it can't predict the future of
what the thread will do in the next 1sec), is all we need to get a
direction. After we take the direction then the memory will follow so
we cannot care less what it does in the next second because that will
follow the CPU (after a while, last_nid anti-false-sharing logic
permitting), and at least we'll know for sure that the memory accessed
in the last 10sec is already local and that defines the best node to
schedule the thread.

I don't mean there's no room for improvement in the way the input data
can be computed, and even in the way the input data can be generated,
the exponential backoff decay can be tuned too, I just tried to do the
simplest computations on the data to make the workloads converge fast
and you're welcome to contribute.

But I believe the task_autonuma information is extremely valuable and
we can trust it very much knowing we'll get a great placement. The
concern you have isn't invalid, but it's a very minor one and the
sampling rate effects you are concerned about, while real, they're
lost in the noise in practice.

Well, I think I am not convinced by your this many words. And surely
I will NOT follow your reasoning of "Having information is always
good than nothing". We all know that an illy biased balancing is worse
than randomness: at least randomness means "average, fair play, ...".
With all uncertain things, I think only a comprehensive survey
of real world workloads can tell if my concern is significant or not.

So I think my suggestion to you is: Show world some solid and sound

^^^^^^^^^^^^^
real world proof that your approximation is > 90% accurate, just like

The cover letter contained a link to the performance:
https://www.kernel.org/pub/linux/kernel/people/andrea/autonuma/autonuma_bench-20120530.pdf

Yes, I saw this. But if you consider this already a solid and
comprehensive proof.
You win , I have no other words to say.

No one says there is a proof, on contrary, I said it's possible to beat any heuristic algorithm and Andrea explained the LRU is such too.

You asked above for real world example and that's what Andrea was trying to achieve (note that it includes tiny regression w/ parallel kernel compile on tmpfs).

It includes, specJbb, kernelbuild, cpuHog in guests, and handful of units
tests.

I'm sure anyone can beat most kernel algorithm with some pathological case
including LRU and CFS. The only way to improve the numa balancing stuff is

Like I already put, the pathological cases for LRU were already well understood
for decades, they are quite valid to ignore. And every programmer has
be taught to
avoid these cases. And this conclusion took much much time of many many
talented brains.

Who are these programmers that you talk about? The average Java programmer is clueless w.r.t memory allocation.
Even w/ KVM we have an issue of double swap storm when both the guest and the host will have the same page on their LRU list.

But the problem of this algorithm is not. And you are putting haste conclusion
of it without bothering to do comprehensive research.

"Collect the data from a wide range of pages occasionally,
and then do a condense computing on a small set of pages" looks a very common
practice to me. But again, if you simply label this as "minor".
I have no other words to say.

to sample more, meaning faulting more == larger overhead.

Are you sure you really want to compete the sampling speed with CPU intensive
workloads?

You didn't understand my point - I was saying exactly this - it's not worth to sample more because it carries a huge over head.
Pls don't be that fast on the 'send' trigger :)

OK, I think I'd stop discussing this topic now. Without strict and comprehensive
research on this topic, further arguments seems to me to be purely based on
imagination.

And I have no interest in beating any of your fancy algorithm, it wouldn't bring
me 1G$. I am just curiously about the truth.

No one said it's fancy beside you. I actually proposed a way to relax such false migrations in my previous reply.

Basically, anyone has the right to laugh, if W = x * y and you only

Laughing while discussing numa code is under estimated!
Let's not continue to spam the list, I think we've all made our points,
Dor

Note, my laughing is based on your attitude of "anyone beat anything"
which is also under estimated. I remember despite my criticism of
the sampling, I was quite friendly to Andrea, then why the
acidness/bitterness?

But you are right. We all made our points. Let's stop it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>