On Sat, Jun 30, 2012 at 1:48 PM, Dor Laor <dlaor@xxxxxxxxxx> wrote: > On 06/30/2012 05:43 AM, Nai Xia wrote: >> >> On Sat, Jun 30, 2012 at 9:23 AM, Andrea Arcangeli <aarcange@xxxxxxxxxx> >> wrote: >>> >>> On Sat, Jun 30, 2012 at 04:01:50AM +0800, Nai Xia wrote: >>>> >>>> On Sat, Jun 30, 2012 at 2:53 AM, Peter Zijlstra <a.p.zijlstra@xxxxxxxxx> >>>> wrote: >>>>> >>>>> On Fri, 2012-06-29 at 12:51 -0400, Dor Laor wrote: >>>>>> >>>>>> The previous comments were not shouts but the mother of all NAKs. >>>>> >>>>> >>>>> I never said any such thing. I just said why should I bother reading >>>>> your stuff if you're ignoring most my feedback anyway. >>>>> >>>>> If you want to read that as a NAK, not my problem. >>>> >>>> >>>> Hey guys, Can I say NAK to these patches ? >>>> >>>> Now I aware that this sampling algorithm is completely broken, if we >>>> take >>>> a few seconds to see what it is trying to solve: >>>> >>>> We all know that LRU is try to solve the question of "what are the >>>> pages recently accessed?", >>>> so its engouth to use pte bits to approximate. >>> >>> >>> I made an example about the active list to try to explain it why your >>> example is still going to work fine. >>> >>> After it becomes active (from inactive) and it's being a referenced >>> active page, it won't become _very_active_ or _very_very_active_ or >>> more no matter how many more times you look up the pagecache. >>> >>> The LRU order wasn't relevant here. >>> >>>> However, the numa balancing problem is fundamentally like this: >>>> >>>> In some time unit, >>>> >>>> W = pages_accessed * average_page_access_frequence >>>> >>>> We are trying to move process to the node having max W, right? >>> >>> >>> First of all, the mm_autonuma statistics are not in function of time >>> and there is no page access frequency there. >>> >>> mm_autonuma is static information collected by knuma_scand from the >>> pagetables. That's static and 100% accurate on the whole process and >>> definitely not generated by the numa hinting page faults. I could shut >>> off all numa hinting page faults permanently and still generate the >>> mm_autonuma information identically. >>> >>> There's a knob in /sys/kernel/mm/autonuma/knuma_scand/working_set that >>> you can enable if you want to use a "runtime" and not static >>> information for the mm_autonuma too, but that's not the default for >>> now (but I think it may be a better default, there wasn't enough time >>> to test this yet) >>> >>> The task_autonuma (thread) statistics are the only thing that is >>> sampled by default in a 10sec interval (the interval tunable too with >>> sysfs, and 10sec is likely too aggressive, 30sec sounds better, we're >>> eventually going to make it dynamic anyway) >>> >>> So even if you were right, the thread statistics only kicks in to >>> balance threads against threads of the same process, most of the time >>> what's more important are the mm_autonuma statistics. >>> >>> But in reality the thread statistics also works perfectly for the job, >>> as an approximation of the NUMA memory footprint of the thread (vs the >>> other threads). And then the rest of the memory slowly follows >>> whatever node CPUs I placed the thread (even if that's not the >>> absolutely best one at all times). >>> >>>> Andrea's patch can only approximate the pages_accessed number in a >>>> time unit(scan interval), >>>> I don't think it can catch even 1% of average_page_access_frequence >>>> on a busy workload. >>>> Blindly assuming that all the pages' average_page_access_frequence is >>>> the same is seemly >>>> broken to me. >>> >>> >>> All we need is an approximation to take a better than random decision, >>> even if you get it 1% right, it's still better than 0% right by going >>> blind. Your 1% is too pessimistic, in my tests the thread statistics >>> are more like >90% correct in average (I monitor them with the debug >>> mode constantly). >>> >>> If this 1% right, happens one a million samples, who cares, it's not >>> going to run measurably slower anyway (and it will still be better >>> than picking a 0% right node). >>> >>> What you're saying is that because the active list in the pagecache >>> won't differentiate between 10 cache hits and 20 cache hits, we should >>> drop the active list and stop activating pages and just threat them >>> all the same because in some unlucky access pattern, the active list >>> may only get right 1% of the working set. But there's a reason why the >>> active list exists despite it may get things wrong in some corner case >>> and possibly leave the large amount of pages accessed infrequently in >>> the inactive list forever (even if it gets things only 1% right in >>> those worst cases, it's still better than 0% right and no active list >>> at all). >>> >>> To say it in another way, you may still crash with the car even if >>> you're careful, but do you think it's better to watch at the street or >>> to drive blindfolded? >>> >>> numa/sched drives blindfolded, autonuma watches around every 10sec >>> very carefully for the best next turn to take with the car and to >>> avoid obstacles, you can imagine who wins. >>> >>> Watching the street carefully every 10sec doesn't mean the next moment >>> a missile won't hit your car to make you crash, you're still having >>> better chances not to crash than by driving blindfolded. >>> >>> numa/sched pretends to compete without collecting information for the >>> NUMA thread memory footprint (task_autonuma, sampled with a >>> exponential backoff at 10sec intervals), and without process >>> information (full static information from the pagetables, not >>> sampled). No matter how you compute stuff, if you've nothing >>> meaningful in input to your algorithm you lose. And it looks like you >>> believe that you can take better decisions with nothing in input to >>> your NUMA placement algorithm, because my thread info (task_autonuma) >>> isn't 100% perfect at all times and it can't predict the future. The >>> alternative is to get that information from syscalls, but even >>> ignoring the -ENOMEM from split_vma, that will lead to userland bugs >>> and overall the task_autonuma information may be more reliable in the >>> end, even if it's sampled using an exponential backoff. >>> >>> Also note the exponential backoff thing, it's not really the last >>> interval, it's the last interval plus half the previous interval plus >>> 1/4 the previous interval etc... and we can trivially control the >>> decay. >>> >>> All we need is to get a direction and knowing _exactly_ what the task >>> did over the last 10 seconds (even if it can't predict the future of >>> what the thread will do in the next 1sec), is all we need to get a >>> direction. After we take the direction then the memory will follow so >>> we cannot care less what it does in the next second because that will >>> follow the CPU (after a while, last_nid anti-false-sharing logic >>> permitting), and at least we'll know for sure that the memory accessed >>> in the last 10sec is already local and that defines the best node to >>> schedule the thread. >>> >>> I don't mean there's no room for improvement in the way the input data >>> can be computed, and even in the way the input data can be generated, >>> the exponential backoff decay can be tuned too, I just tried to do the >>> simplest computations on the data to make the workloads converge fast >>> and you're welcome to contribute. >>> >>> But I believe the task_autonuma information is extremely valuable and >>> we can trust it very much knowing we'll get a great placement. The >>> concern you have isn't invalid, but it's a very minor one and the >>> sampling rate effects you are concerned about, while real, they're >>> lost in the noise in practice. >> >> >> Well, I think I am not convinced by your this many words. And surely >> I will NOT follow your reasoning of "Having information is always >> good than nothing". We all know that an illy biased balancing is worse >> than randomness: at least randomness means "average, fair play, ...". >> With all uncertain things, I think only a comprehensive survey >> of real world workloads can tell if my concern is significant or not. >> >> So I think my suggestion to you is: Show world some solid and sound >> real world proof that your approximation is > 90% accurate, just like > > > The cover letter contained a link to the performance: > https://www.kernel.org/pub/linux/kernel/people/andrea/autonuma/autonuma_bench-20120530.pdf > > It includes, specJbb, kernelbuild, cpuHog in guests, and handful of units > tests. > > I'm sure anyone can beat most kernel algorithm with some pathological case > including LRU and CFS. The only way to improve the numa balancing stuff is > to sample more, meaning faulting more == larger overhead. > > Maybe its worth to add a measurement that if we've done too many bounding of > a particular page to stop scan that page for a while. It's an optimization > that needs to be prove it worth in real life. > Oh, sorry, I think I forgot few last comments in my last post: In case you really can take my advice and do comprehensive research, try to make sure that you compare the result of your fancy sampling algorithm with this simple logic: "Blindly select a node and bind the process and move all pages to it." Stupid it may sound, I highly suspect it can approach the benchmarks you already did. If that's really the truth, then all the sampling and weighting stuff can be cut off. Thanks, Nai -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href