On 11/10/2023 6:54 AM, Yang Shi wrote: > On Thu, Nov 9, 2023 at 5:48 AM zhangpeng (AS) <zhangpeng362@xxxxxxxxxx> wrote: >> >> Hi everyone, >> >> There is a performance issue that has been bothering us recently. >> This problem can reproduce in the latest mainline version (Linux 6.6). >> >> We use mlockall(MCL_CURRENT | MCL_FUTURE) in the user mode process >> to avoid performance problems caused by major fault. >> >> There is a stage in numa fault which will set pte as 0 in do_numa_page() : >> ptep_modify_prot_start() will clear the vmf->pte, until >> ptep_modify_prot_commit() assign a value to the vmf->pte. >> >> For the data segment of the user-mode program, the global variable area >> is a private mapping. After the pagecache is loaded, the private >> anonymous page is generated after the COW is triggered. Mlockall can >> lock COW pages (anonymous pages), but the original file pages cannot >> be locked and may be reclaimed. If the global variable (private anon page) >> is accessed when vmf->pte is zero which is concurrently set by numa fault, >> a file page fault will be triggered. >> >> At this time, the original private file page may have been reclaimed. >> If the page cache is not available at this time, a major fault will be >> triggered and the file will be read, causing additional overhead. >> >> Our problem scenario is as follows: >> >> task 1 task 2 >> ------ ------ >> /* scan global variables */ >> do_numa_page() >> spin_lock(vmf->ptl) >> ptep_modify_prot_start() >> /* set vmf->pte as null */ >> /* Access global variables */ >> handle_pte_fault() >> /* no pte lock */ >> do_pte_missing() >> do_fault() >> do_read_fault() >> ptep_modify_prot_commit() >> /* ptep update done */ >> pte_unmap_unlock(vmf->pte, vmf->ptl) >> do_fault_around() >> __do_fault() >> filemap_fault() >> /* page cache is not available >> and a major fault is triggered */ >> do_sync_mmap_readahead() >> /* page_not_uptodate and goto >> out_retry. */ >> >> Is there any way to avoid such a major fault? > > IMHO I don't think it is a bug. The man page quoted by Willy says "All > mapped pages are guaranteed to be resident in RAM when the call > returns successfully", but the later COW already made the file page > unmapped, right? The PTE pointed to the COW'ed anon page. > Hypothetically if we kept the file page mlocked and unmapped, > munlock() would have not munlocked the file page at all, it would be > mlocked in memory forever. But in this case, even the COW page is mlocked. There is small window that PTE is set to null in do_numa_page(). data segment access (it's to COW page which has nothing to do with original page cache) happens in this small window will trigger filemap_fault() to fault in original page cache. I had thought to do double check whether vmf->pte is NULL in do_read_fault(). But it's not reliable enough. Matthew's idea to use protnone to block both hardware accessing and do_pte_missing() looks more promising to me. Regards Yin, Fengwei > >> >> -- >> Best Regards, >> Peng >>