Re: [PATCH 13/40] autonuma: CPU follow memory algorithm

Nai Xia <nai.xia@xxxxxxxxx> · Sat, 30 Jun 2012 16:23:03 +0800

On Sat, Jun 30, 2012 at 1:48 PM, Dor Laor <dlaor@xxxxxxxxxx> wrote:
> On 06/30/2012 05:43 AM, Nai Xia wrote:
>>
>> On Sat, Jun 30, 2012 at 9:23 AM, Andrea Arcangeli <aarcange@xxxxxxxxxx>
>> wrote:
>>>
>>> On Sat, Jun 30, 2012 at 04:01:50AM +0800, Nai Xia wrote:
>>>>
>>>> On Sat, Jun 30, 2012 at 2:53 AM, Peter Zijlstra <a.p.zijlstra@xxxxxxxxx>
>>>> wrote:
>>>>>
>>>>> On Fri, 2012-06-29 at 12:51 -0400, Dor Laor wrote:
>>>>>>
>>>>>> The previous comments were not shouts but the mother of all NAKs.
>>>>>
>>>>>
>>>>> I never said any such thing. I just said why should I bother reading
>>>>> your stuff if you're ignoring most my feedback anyway.
>>>>>
>>>>> If you want to read that as a NAK, not my problem.
>>>>
>>>>
>>>> Hey guys, Can I say NAK to these patches ?
>>>>
>>>> Now I aware that this sampling algorithm is completely broken, if we
>>>> take
>>>> a few seconds to see what it is trying to solve:
>>>>
>>>> We all know that LRU is try to solve the question of "what are the
>>>> pages recently accessed?",
>>>> so its engouth to use pte bits to approximate.
>>>
>>>
>>> I made an example about the active list to try to explain it why your
>>> example is still going to work fine.
>>>
>>> After it becomes active (from inactive) and it's being a referenced
>>> active page, it won't become _very_active_ or _very_very_active_ or
>>> more no matter how many more times you look up the pagecache.
>>>
>>> The LRU order wasn't relevant here.
>>>
>>>> However, the numa balancing problem is fundamentally like this:
>>>>
>>>> In some time unit,
>>>>
>>>>       W = pages_accessed  *  average_page_access_frequence
>>>>
>>>> We are trying to move process to the node having max W,  right?
>>>
>>>
>>> First of all, the mm_autonuma statistics are not in function of time
>>> and there is no page access frequency there.
>>>
>>> mm_autonuma is static information collected by knuma_scand from the
>>> pagetables. That's static and 100% accurate on the whole process and
>>> definitely not generated by the numa hinting page faults. I could shut
>>> off all numa hinting page faults permanently and still generate the
>>> mm_autonuma information identically.
>>>
>>> There's a knob in /sys/kernel/mm/autonuma/knuma_scand/working_set that
>>> you can enable if you want to use a "runtime" and not static
>>> information for the mm_autonuma too, but that's not the default for
>>> now (but I think it may be a better default, there wasn't enough time
>>> to test this yet)
>>>
>>> The task_autonuma (thread) statistics are the only thing that is
>>> sampled by default in a 10sec interval (the interval tunable too with
>>> sysfs, and 10sec is likely too aggressive, 30sec sounds better, we're
>>> eventually going to make it dynamic anyway)
>>>
>>> So even if you were right, the thread statistics only kicks in to
>>> balance threads against threads of the same process, most of the time
>>> what's more important are the mm_autonuma statistics.
>>>
>>> But in reality the thread statistics also works perfectly for the job,
>>> as an approximation of the NUMA memory footprint of the thread (vs the
>>> other threads). And then the rest of the memory slowly follows
>>> whatever node CPUs I placed the thread (even if that's not the
>>> absolutely best one at all times).
>>>
>>>> Andrea's patch can only approximate the pages_accessed number in a
>>>> time unit(scan interval),
>>>> I don't think it can catch even 1% of  average_page_access_frequence
>>>> on a busy workload.
>>>> Blindly assuming that all the pages'  average_page_access_frequence is
>>>> the same is seemly
>>>> broken to me.
>>>
>>>
>>> All we need is an approximation to take a better than random decision,
>>> even if you get it 1% right, it's still better than 0% right by going
>>> blind. Your 1% is too pessimistic, in my tests the thread statistics
>>> are more like >90% correct in average (I monitor them with the debug
>>> mode constantly).
>>>
>>> If this 1% right, happens one a million samples, who cares, it's not
>>> going to run measurably slower anyway (and it will still be better
>>> than picking a 0% right node).
>>>
>>> What you're saying is that because the active list in the pagecache
>>> won't differentiate between 10 cache hits and 20 cache hits, we should
>>> drop the active list and stop activating pages and just threat them
>>> all the same because in some unlucky access pattern, the active list
>>> may only get right 1% of the working set. But there's a reason why the
>>> active list exists despite it may get things wrong in some corner case
>>> and possibly leave the large amount of pages accessed infrequently in
>>> the inactive list forever (even if it gets things only 1% right in
>>> those worst cases, it's still better than 0% right and no active list
>>> at all).
>>>
>>> To say it in another way, you may still crash with the car even if
>>> you're careful, but do you think it's better to watch at the street or
>>> to drive blindfolded?
>>>
>>> numa/sched drives blindfolded, autonuma watches around every 10sec
>>> very carefully for the best next turn to take with the car and to
>>> avoid obstacles, you can imagine who wins.
>>>
>>> Watching the street carefully every 10sec doesn't mean the next moment
>>> a missile won't hit your car to make you crash, you're still having
>>> better chances not to crash than by driving blindfolded.
>>>
>>> numa/sched pretends to compete without collecting information for the
>>> NUMA thread memory footprint (task_autonuma, sampled with a
>>> exponential backoff at 10sec intervals), and without process
>>> information (full static information from the pagetables, not
>>> sampled). No matter how you compute stuff, if you've nothing
>>> meaningful in input to your algorithm you lose. And it looks like you
>>> believe that you can take better decisions with nothing in input to
>>> your NUMA placement algorithm, because my thread info (task_autonuma)
>>> isn't 100% perfect at all times and it can't predict the future. The
>>> alternative is to get that information from syscalls, but even
>>> ignoring the -ENOMEM from split_vma, that will lead to userland bugs
>>> and overall the task_autonuma information may be more reliable in the
>>> end, even if it's sampled using an exponential backoff.
>>>
>>> Also note the exponential backoff thing, it's not really the last
>>> interval, it's the last interval plus half the previous interval plus
>>> 1/4 the previous interval etc... and we can trivially control the
>>> decay.
>>>
>>> All we need is to get a direction and knowing _exactly_ what the task
>>> did over the last 10 seconds (even if it can't predict the future of
>>> what the thread will do in the next 1sec), is all we need to get a
>>> direction. After we take the direction then the memory will follow so
>>> we cannot care less what it does in the next second because that will
>>> follow the CPU (after a while, last_nid anti-false-sharing logic
>>> permitting), and at least we'll know for sure that the memory accessed
>>> in the last 10sec is already local and that defines the best node to
>>> schedule the thread.
>>>
>>> I don't mean there's no room for improvement in the way the input data
>>> can be computed, and even in the way the input data can be generated,
>>> the exponential backoff decay can be tuned too, I just tried to do the
>>> simplest computations on the data to make the workloads converge fast
>>> and you're welcome to contribute.
>>>
>>> But I believe the task_autonuma information is extremely valuable and
>>> we can trust it very much knowing we'll get a great placement. The
>>> concern you have isn't invalid, but it's a very minor one and the
>>> sampling rate effects you are concerned about, while real, they're
>>> lost in the noise in practice.
>>
>>
>> Well, I think I am not convinced by your this many words. And surely
>> I  will NOT follow your reasoning of "Having information is always
>> good than nothing".  We all know that  an illy biased balancing is worse
>> than randomness:  at least randomness means "average, fair play, ...".
>> With all uncertain things, I think only a comprehensive survey
>> of real world workloads can tell if my concern is significant or not.
>>
>> So I think my suggestion to you is:  Show world some solid and sound
>> real world proof that your approximation is > 90% accurate, just like
>
>
> The cover letter contained a link to the performance:
> https://www.kernel.org/pub/linux/kernel/people/andrea/autonuma/autonuma_bench-20120530.pdf
>
> It includes, specJbb, kernelbuild, cpuHog in guests, and handful of units
> tests.
>
> I'm sure anyone can beat most kernel algorithm with some pathological case
> including LRU and CFS. The only way to improve the numa balancing stuff is
> to sample more, meaning faulting more == larger overhead.
>
> Maybe its worth to add a measurement that if we've done too many bounding of
> a particular page to stop scan that page for a while. It's an optimization
> that needs to be prove it worth in real life.
>

Oh, sorry, I think I forgot few last comments in my last post:

In case you really can take my advice and do comprehensive research,
try to make sure that you compare the result of your fancy sampling algorithm
with this simple logic:

   "Blindly select a node and bind the process and move all pages to it."

Stupid it may sound, I highly suspect it can approach the benchmarks
you already did.

If that's really the truth, then all the sampling and weighting stuff can
be cut off.

Thanks,

Nai

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href