On Mon, Jun 11, 2018 at 12:04:34PM +0200, Jirka Hladky wrote: > Hi Mel, > > your suggestion about the commit which has caused the regression was right > - it's indeed this commit: > > 2c83362734dad8e48ccc0710b5cd2436a0323893 > > The question now is what can be done to improve the results. I have made > stream to run longer and I see that data are moved very slowly from NODE#1 > to NODE#0. > Ok, this is somewhat expected although I suspect the scan rate slowed a lot in the early phase of the program and that's why the migration is slow -- slow scan means fewer samples and takes longer to reach the 2-pass filter. > The process has started on NODE#1 where all memory has been allocated. > Right after the start, the process has been moved to NODE#0 but only part > of the memory has been moved to that node. numa_preferred_nid has stayed 1 > for 30 seconds. The numa_preferred_nid has changed to 0 at > 2018-Jun-09_03h35m58s and most of the memory has been finally reallocated. > See the logs below. > > Could we try to make numa_preferred_nid to change faster? > What catches us is that each element in itself makes sense, it's just not a universal win. The identified patch makes a reasonable choice in that fork shouldn't necessary spread across the machine as it hurts short-lived or communicating processes. Unfortunately, if a load is NUMA-aware and the processes are independent then automatic NUMA balancing has to take action which means there is a period of time where performance is sub-optimal. Similarly, the load balancer is making a reasonable decision when a socket gets overloaded. Fixing any part of it for STREAM will end up regressing something else. The numa_preferred_nid can probably be changed faster by adjusting the scan rate. Unfortunately, it comes with the penalty that system CPU overhead will be higher and stalls in the process increase to handle the PTE updates and the subsequent faults. This might help STREAM but anything that is latency sensitive will be hurt. Worse, if a socket is over-saturated and there is a high frequency of cross-node migrations to load balance then the scan rate might always stay at the max frequency and a very high cost incurred so we end up with another class of regression. Srikar Dronamra did have a series with two patches that increase the scan rate when there is a cross-node migration. It may be the case that it also has the impact of changing numa_preferred_nid faster but it has a real risk of introducing regressions. Still, for the purposes of testing you might be interested in testing the following two patches? Srikar Dronamra [PATCH 17/19] sched/numa: Pass destination cpu as a parameter to migrate_task_rq Srikar Dronamra [PATCH 18/19] sched/numa: Reset scan rate whenever task moves across nodes -- Mel Gorman SUSE Labs -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html