I went spelunking to try to find out more about this issue, and I discovered it's Aneesh's fault from 2017 ... On Thu, Nov 09, 2023 at 09:47:24PM +0800, zhangpeng (AS) wrote: > Hi everyone, > > There is a performance issue that has been bothering us recently. > This problem can reproduce in the latest mainline version (Linux 6.6). > > We use mlockall(MCL_CURRENT | MCL_FUTURE) in the user mode process > to avoid performance problems caused by major fault. > > There is a stage in numa fault which will set pte as 0 in do_numa_page() : > ptep_modify_prot_start() will clear the vmf->pte, until > ptep_modify_prot_commit() assign a value to the vmf->pte. > > For the data segment of the user-mode program, the global variable area > is a private mapping. After the pagecache is loaded, the private > anonymous page is generated after the COW is triggered. Mlockall can > lock COW pages (anonymous pages), but the original file pages cannot > be locked and may be reclaimed. If the global variable (private anon page) > is accessed when vmf->pte is zero which is concurrently set by numa fault, > a file page fault will be triggered. > > At this time, the original private file page may have been reclaimed. > If the page cache is not available at this time, a major fault will be > triggered and the file will be read, causing additional overhead. > > Our problem scenario is as follows: > > task 1 task 2 > ------ ------ > /* scan global variables */ > do_numa_page() > spin_lock(vmf->ptl) > ptep_modify_prot_start() > /* set vmf->pte as null */ > /* Access global variables */ > handle_pte_fault() > /* no pte lock */ > do_pte_missing() > do_fault() > do_read_fault() > ptep_modify_prot_commit() > /* ptep update done */ > pte_unmap_unlock(vmf->pte, vmf->ptl) > do_fault_around() > __do_fault() > filemap_fault() > /* page cache is not available > and a major fault is triggered */ > do_sync_mmap_readahead() > /* page_not_uptodate and goto > out_retry. */ > > Is there any way to avoid such a major fault? > > -- > Best Regards, > Peng >