Hi Rik, On Thu, Jul 05, 2012 at 02:07:06PM -0400, Rik van Riel wrote: > Once the first thread gets a NUMA pagefault on a > particular page, the page is made present in the > page tables and NO OTHER THREAD will get NUMA > page faults. Oh this is a great question, thanks for raising it. > That means when trying to compare the weighting > of NUMA accesses between different threads in a > 10 second interval, we only know THE FIRST FAULT. The task_autonuma statistics don't include only the first fault every 10sec, by the time we compare stuff in the scheduler we have a trail of fault history that decay with an exponential backoff (that can be tuned to decay slower, right now it goes down pretty aggressively with a shift right). static void cpu_follow_memory_pass(struct task_struct *p, struct task_autonuma *task_autonuma, unsigned long *task_numa_fault) { int nid; for_each_node(nid) task_numa_fault[nid] >>= 1; task_autonuma->task_numa_fault_tot >>= 1; } So depending on which thread is the first at every pass, that information will accumulate fine over mutliple task_autonuma structures if it's a huge amounts. > We have no information on whether any other threads > tried to access the same page, because we do not > get faults more frequently. If all threads access the same pages and there is false sharing, the last_nid logic will eventually trigger. You're right maybe two three passes in a row it's the same thread getting to the same page first, but eventually another thread will get there and the last_nid will change. The more nodes and the more threads, the less likely it's the same getting there first. The moment another thread in a different node access the page, any pending migration is aborted if it's still in the page_autonuma LRU and the autonuma_last_nid will have to be reconfirmed then before we migrate it anywhere again. > Not only do we not get use frequency information, > we may not get the information on which threads use > which memory, at all. > > Somehow Andrea's code still seems to work. It works when the process fits in the node because we use the mm_autonuma statistics when comparing the percentage of memory utilization per node (so called w_other/w_nid/w_cpu_nid) of threads belonging to different processes. This alone solves all false sharing if the process fits in the node. So that above issue becomes irrelevant (we already convered without using task_autonuma). Now if the process doesn't fit in the node, if there is false sharing, that will be accounted with a smaller factor, and it will be accounted for in its original memory location thanks to the last_nid logic. The memory will not be migrated because of the last_nid logic (statistically speaking). Some spillover will definitely materialize but it won't be significant as long as the NUMA trashing is not enormous. If the NUMA thrasing is unlimted, well that workload is impossible to optimize and it's impossible to converge anywhere and the best would be to do MADV_INTERLEAVE. But note that we're only talking about memory with page_mapcount=1 here, shared memory will never generate a single migration spillovers or numa hinting page fault. > How much sense does the following code still make, > considering we may never get all the info on which > threads use which memory? It is required to handle the case of threads that have local memory and the threads don't fit in a single node. That is the only case we can perfectly coverge that involves more threads than CPUs in the node. This scenario is optimally optimized thanks to the mm = p->mm code below. There can be false sharing too as long as there is some local memory too to converge (it may be impossible to converge on the false shared regions even if we would account them more aggressively). I don't exclude the reduced accounting of false shared memory that you are asking about, may actually be beneficial. The more threads are involved in the false sharing the more the accounting of the false sharing regions will be reduced, and that may help to converge without mistakes. The more threads are involved in the false sharing, the more likely it's impossible to converge on the false shared memory. Last but not the least, this is what the hardware gives us, it looks good enough info to me, but I'm just trying to live with the only information we can collect from the hardware efficiently. > > + /* > + * Generate the w_nid/w_cpu_nid from the > + * pre-computed mm/task_numa_weight[] and > + * compute w_other using the w_m/w_t info > + * collected from the other process. > + */ > + if (mm == p->mm) { > + if (w_t > w_t_t) > + w_t_t = w_t; > + w_other = w_t*AUTONUMA_BALANCE_SCALE/w_t_t; > + w_nid = task_numa_weight[nid]; > + w_cpu_nid = task_numa_weight[cpu_nid]; > + w_type = W_TYPE_THREAD; > > Andrea, what is the real reason your code works? Tried to explain above, but it's getting too long again, I wouldn't know which part to drop though. If it's too messy ignore and I'll try again later. PS. this stuff isn't fixed in stone, I'm not saying this is the best data collection or the best way to compute the data, I believe it's closer to the absolute minimum amount of info and minimum computations on the data required to perform as the hard bindings in the majority of workloads. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>