On Wed, 15 Apr 2020 16:14:38 +0800 "Huang\, Ying" <ying.huang@xxxxxxxxx> wrote: > Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx> writes: > > > On Tue, Apr 14, 2020 at 04:19:51PM +0800, Huang Ying wrote: > >> In current AutoNUMA implementation, the page tables of the processes > >> are scanned periodically to trigger the NUMA hint page faults. The > >> scanning runs in the context of the processes, so will delay the > >> running of the processes. In a test with 64 threads pmbench memory > >> accessing benchmark on a 2-socket server machine with 104 logical CPUs > >> and 256 GB memory, there are more than 20000 latency outliers that are > >> > 1 ms in 3600s run time. These latency outliers are almost all > >> caused by the AutoNUMA page table scanning. Because they almost all > >> disappear after applying this patch to scan the page tables > >> asynchronously. > >> > >> Because there are idle CPUs in system, the asynchronous running page > >> table scanning code can run on these idle CPUs instead of the CPUs the > >> workload is running on. > >> > >> So on system with enough idle CPU time, it's better to scan the page > >> tables asynchronously to take full advantages of these idle CPU time. > >> Another scenario which can benefit from this is to scan the page > >> tables on some service CPUs of the socket, so that the real workload > >> can run on the isolated CPUs without the latency outliers caused by > >> the page table scanning. > >> > >> But it's not perfect to scan page tables asynchronously too. For > >> example, on system without enough idle CPU time, the CPU time isn't > >> scheduled fairly because the page table scanning is charged to the > >> workqueue thread instead of the process/thread it works for. And > >> although the page tables are scanned for the target process, it may > >> run on a CPU that is not in the cpuset of the target process. > >> > >> One possible solution is to let the system administrator to choose the > >> better behavior for the system via a sysctl knob (implemented in the > >> patch). But it's not perfect too. Because every user space knob adds > >> maintenance overhead. > >> > >> A better solution may be to back-charge the CPU time to scan the page > >> tables to the process/thread, and find a way to run the work on the > >> proper cpuset. After some googling, I found there's some discussion > >> about this as in the following thread, > >> > >> https://lkml.org/lkml/2019/6/13/1321 > >> > >> So this patch may be not ready to be merged by upstream yet. It > >> quantizes the latency outliers caused by the page table scanning in > >> AutoNUMA. And it provides a possible way to resolve the issue for > >> users who cares about it. And it is a potential customer of the work > >> related to the cgroup-aware workqueue or other asynchronous execution > >> mechanisms. > >> > > > > The caveats you list are the important ones and the reason why it was > > not done asynchronously. In an earlier implementation all the work was > > done by a dedicated thread and ultimately abandoned. > > > > There is no guarantee there is an idle CPU available and one that is > > local to the thread that should be doing the scanning. Even if there is, > > it potentially prevents another task from scheduling on an idle CPU and > > similarly other workqueue tasks may be delayed waiting on the scanner. The > > hiding of the cost is also problematic because the CPU cost is hidden > > and mixed with other unrelated workqueues. It also has the potential > > to mask bugs. Lets say for example there is a bug whereby a task is > > scanning excessively, that can be easily missed when the work is done by > > a workqueue. > > Do you think something like cgroup-aware workqueue is a solution deserve > to be tried when it's available? It will not hide the scanning cost, > because the CPU time will be charged to the original cgroup or task. > Although the other tasks may be disturbed, cgroup can provide some kind > of management via cpusets. > > > While it's just an opinion, my preference would be to focus on reducing > > the cost and amount of scanning done -- particularly for threads. For > > example, all threads operate on the same address space but there can be > > significant overlap where all threads are potentially scanning the same > > areas or regions that the thread has no interest in. One option would be > > to track the highest and lowest pages accessed and only scan within > > those regions for example. The tricky part is that library pages may > > create very wide windows that render the tracking useless but it could > > at least be investigated. > > In general, I think it's good to reduce the scanning cost. I think the main idea of DAMON[1] might be able to applied here. Have you considered it? [1] https://lore.kernel.org/linux-mm/20200406130938.14066-1-sjpark@xxxxxxxxxx/ Thanks, SeongJae Park > > Why do you think there will be overlap between the threads of a process? > If my understanding were correctly, the threads will scan one by one > instead of simultaneously. And how to determine whether a vma need to > be scanned or not? For example, there may be only a small portion of > pages been accessed in a vma, but they may be accessed remotely and > consumes quite some inter-node bandwidth, so need to be migrated. > > Best Regards, > Huang, Ying >