Mel Gorman <mgorman@xxxxxxx> writes: > On Wed, Mar 31, 2021 at 07:20:09PM +0800, Huang, Ying wrote: >> Mel Gorman <mgorman@xxxxxxx> writes: >> >> > On Mon, Mar 29, 2021 at 02:26:51PM +0800, Huang Ying wrote: >> >> For NUMA balancing, in hint page fault handler, the faulting page will >> >> be migrated to the accessing node if necessary. During the migration, >> >> TLB will be shot down on all CPUs that the process has run on >> >> recently. Because in the hint page fault handler, the PTE will be >> >> made accessible before the migration is tried. The overhead of TLB >> >> shooting down is high, so it's better to be avoided if possible. In >> >> fact, if we delay mapping the page in PTE until migration, that can be >> >> avoided. This is what this patch doing. >> >> >> > >> > Why would the overhead be high? It was previously inaccessibly so it's >> > only parallel accesses making forward progress that trigger the need >> > for a flush. >> >> Sorry, I don't understand this. Although the page is inaccessible, the >> threads may access other pages, so TLB flushing is still necessary. >> > > You assert the overhead of TLB shootdown is high and yes, it can be > very high but you also said "the benchmark score has no visible changes" > indicating the TLB shootdown cost is not a major problem for the workload. > It does not mean we should ignore it though. > >> > <SNIP the parts that are not problematic> >> > >> > If migration is attempted, then the time until the migration PTE is >> > created is variable. The page has to be isolated from the LRU so there >> > could be contention on the LRU lock, a new page has to be allocated and >> > that allocation potentially has to enter the page allocator slow path >> > etc. During that time, parallel threads make forward progress but with >> > the patch, multiple threads potentially attempt the allocation and fail >> > instead of doing real work. >> >> If my understanding of the code were correct, only the first thread will >> attempt the isolation and allocation. Because TestClearPageLRU() is >> called in >> >> migrate_misplaced_page() >> numamigrate_isolate_page() >> isolate_lru_page() >> >> And migrate_misplaced_page() will return 0 immediately if >> TestClearPageLRU() returns false. Then the second thread will make the >> page accessible and make forward progress. >> > > Ok, that's true. While additional work is done, the cost is reasonably > low -- lower than I initially imagined and with fewer side-effects. > >> But there's still some timing difference between the original and >> patched kernel. We have several choices to reduce the difference. >> >> 1. Check PageLRU() with PTL held in do_numa_page() >> >> If PageLRU() return false, do_numa_page() can make the page accessible >> firstly. So the second thread will make the page accessible earlier. >> >> 2. Try to lock the page with PTL held in do_numa_page() >> >> If the try-locking succeeds, it's the first thread, so it can delay >> mapping. If try-locking fails, it may be the second thread, so it will >> make the page accessible firstly. We need to teach >> migrate_misplaced_page() to work with the page locked. This will >> enlarge the duration that the page is locked. Is it a problem? >> >> 3. Check page_count() with PTL held in do_numa_page() >> >> The first thread will call get_page() in numa_migrate_prep(). So if the >> second thread can detect that, it can make the page accessible firstly. >> The difficulty is that it appears hard to identify the expected >> page_count() for the file pages. For anonymous pages, that is much >> easier, so at least if a page passes the following test, we can delay >> mapping, >> >> PageAnon(page) && page_count(page) == page_mapcount(page) + !!PageSwapCache(page) >> >> This will disable the optimization for the file pages. But it may be >> good enough? >> >> Which one do you think is better? Maybe the first one is good enough? >> > > The first one is probably the most straight-forward but it's more > important to figure out why interrupts were higher with at least one > workload when the exact opposite is expected. Investigating which of > options 1-3 are best and whether it's worth the duplicated check could > be done as a separate patch. > >> > You should consider the following question -- is the potential saving >> > of an IPI transmission enough to offset the cost of parallel accesses >> > not making forward progress while one migration is setup and having >> > different migration attempts collide? >> > >> > I have tests running just in case but I think the answer may be "no". >> > So far only one useful test as completed (specjbb2005 with one VM per NUMA >> > node) and it showed a mix of small gains and losses but with *higher* >> > interrupts contrary to what was expected from the changelog. >> >> That is hard to be understood. May be caused by the bug you pointed out >> (about was_writable)? >> > > It's possible and I could not figure out what the rationale behind the > change was :/ > > Fix it and run it through your tests to make sure it works as you > expect. Assuming it passes your tests and it's posted, I'll read it again > and run it through a battery of tests. If it shows that interrupts are > lower and is either netural or improves performance in enough cases then > I think it'll be ok. Even if it's only neutral in terms of performance > but interrupts are lower, it'll be acceptable. Will do it. Thanks a lot for your help! Best Regards, Huang, Ying