Re: [PATCH 13/40] autonuma: CPU follow memory algorithm

Nai Xia <nai.xia@xxxxxxxxx> · Sat, 30 Jun 2012 10:43:41 +0800

On Sat, Jun 30, 2012 at 9:23 AM, Andrea Arcangeli <aarcange@xxxxxxxxxx> wrote:
> On Sat, Jun 30, 2012 at 04:01:50AM +0800, Nai Xia wrote:
>> On Sat, Jun 30, 2012 at 2:53 AM, Peter Zijlstra <a.p.zijlstra@xxxxxxxxx> wrote:
>> > On Fri, 2012-06-29 at 12:51 -0400, Dor Laor wrote:
>> >> The previous comments were not shouts but the mother of all NAKs.
>> >
>> > I never said any such thing. I just said why should I bother reading
>> > your stuff if you're ignoring most my feedback anyway.
>> >
>> > If you want to read that as a NAK, not my problem.
>>
>> Hey guys, Can I say NAK to these patches ?
>>
>> Now I aware that this sampling algorithm is completely broken, if we take
>> a few seconds to see what it is trying to solve:
>>
>> We all know that LRU is try to solve the question of "what are the
>> pages recently accessed?",
>> so its engouth to use pte bits to approximate.
>
> I made an example about the active list to try to explain it why your
> example is still going to work fine.
>
> After it becomes active (from inactive) and it's being a referenced
> active page, it won't become _very_active_ or _very_very_active_ or
> more no matter how many more times you look up the pagecache.
>
> The LRU order wasn't relevant here.
>
>> However, the numa balancing problem is fundamentally like this:
>>
>> In some time unit,
>>
>>       W = pages_accessed  *  average_page_access_frequence
>>
>> We are trying to move process to the node having max W,  right?
>
> First of all, the mm_autonuma statistics are not in function of time
> and there is no page access frequency there.
>
> mm_autonuma is static information collected by knuma_scand from the
> pagetables. That's static and 100% accurate on the whole process and
> definitely not generated by the numa hinting page faults. I could shut
> off all numa hinting page faults permanently and still generate the
> mm_autonuma information identically.
>
> There's a knob in /sys/kernel/mm/autonuma/knuma_scand/working_set that
> you can enable if you want to use a "runtime" and not static
> information for the mm_autonuma too, but that's not the default for
> now (but I think it may be a better default, there wasn't enough time
> to test this yet)
>
> The task_autonuma (thread) statistics are the only thing that is
> sampled by default in a 10sec interval (the interval tunable too with
> sysfs, and 10sec is likely too aggressive, 30sec sounds better, we're
> eventually going to make it dynamic anyway)
>
> So even if you were right, the thread statistics only kicks in to
> balance threads against threads of the same process, most of the time
> what's more important are the mm_autonuma statistics.
>
> But in reality the thread statistics also works perfectly for the job,
> as an approximation of the NUMA memory footprint of the thread (vs the
> other threads). And then the rest of the memory slowly follows
> whatever node CPUs I placed the thread (even if that's not the
> absolutely best one at all times).
>
>> Andrea's patch can only approximate the pages_accessed number in a
>> time unit(scan interval),
>> I don't think it can catch even 1% of  average_page_access_frequence
>> on a busy workload.
>> Blindly assuming that all the pages'  average_page_access_frequence is
>> the same is seemly
>> broken to me.
>
> All we need is an approximation to take a better than random decision,
> even if you get it 1% right, it's still better than 0% right by going
> blind. Your 1% is too pessimistic, in my tests the thread statistics
> are more like >90% correct in average (I monitor them with the debug
> mode constantly).
>
> If this 1% right, happens one a million samples, who cares, it's not
> going to run measurably slower anyway (and it will still be better
> than picking a 0% right node).
>
> What you're saying is that because the active list in the pagecache
> won't differentiate between 10 cache hits and 20 cache hits, we should
> drop the active list and stop activating pages and just threat them
> all the same because in some unlucky access pattern, the active list
> may only get right 1% of the working set. But there's a reason why the
> active list exists despite it may get things wrong in some corner case
> and possibly leave the large amount of pages accessed infrequently in
> the inactive list forever (even if it gets things only 1% right in
> those worst cases, it's still better than 0% right and no active list
> at all).
>
> To say it in another way, you may still crash with the car even if
> you're careful, but do you think it's better to watch at the street or
> to drive blindfolded?
>
> numa/sched drives blindfolded, autonuma watches around every 10sec
> very carefully for the best next turn to take with the car and to
> avoid obstacles, you can imagine who wins.
>
> Watching the street carefully every 10sec doesn't mean the next moment
> a missile won't hit your car to make you crash, you're still having
> better chances not to crash than by driving blindfolded.
>
> numa/sched pretends to compete without collecting information for the
> NUMA thread memory footprint (task_autonuma, sampled with a
> exponential backoff at 10sec intervals), and without process
> information (full static information from the pagetables, not
> sampled). No matter how you compute stuff, if you've nothing
> meaningful in input to your algorithm you lose. And it looks like you
> believe that you can take better decisions with nothing in input to
> your NUMA placement algorithm, because my thread info (task_autonuma)
> isn't 100% perfect at all times and it can't predict the future. The
> alternative is to get that information from syscalls, but even
> ignoring the -ENOMEM from split_vma, that will lead to userland bugs
> and overall the task_autonuma information may be more reliable in the
> end, even if it's sampled using an exponential backoff.
>
> Also note the exponential backoff thing, it's not really the last
> interval, it's the last interval plus half the previous interval plus
> 1/4 the previous interval etc... and we can trivially control the
> decay.
>
> All we need is to get a direction and knowing _exactly_ what the task
> did over the last 10 seconds (even if it can't predict the future of
> what the thread will do in the next 1sec), is all we need to get a
> direction. After we take the direction then the memory will follow so
> we cannot care less what it does in the next second because that will
> follow the CPU (after a while, last_nid anti-false-sharing logic
> permitting), and at least we'll know for sure that the memory accessed
> in the last 10sec is already local and that defines the best node to
> schedule the thread.
>
> I don't mean there's no room for improvement in the way the input data
> can be computed, and even in the way the input data can be generated,
> the exponential backoff decay can be tuned too, I just tried to do the
> simplest computations on the data to make the workloads converge fast
> and you're welcome to contribute.
>
> But I believe the task_autonuma information is extremely valuable and
> we can trust it very much knowing we'll get a great placement. The
> concern you have isn't invalid, but it's a very minor one and the
> sampling rate effects you are concerned about, while real, they're
> lost in the noise in practice.

Well, I think I am not convinced by your this many words. And surely
I  will NOT follow your reasoning of "Having information is always
good than nothing".  We all know that  an illy biased balancing is worse
than randomness:  at least randomness means "average, fair play, ...".
With all uncertain things, I think only a comprehensive survey
of real world workloads can tell if my concern is significant or not.

So I think my suggestion to you is:  Show world some solid and sound
real world proof that your approximation is > 90% accurate, just like
the pioneers already did to LRU(This problem is surely different from
LRU. ).  Tons of words, will not do this.

Thanks,

Nai

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href