On 11/14/2023 9:41 AM, Yang Shi wrote: > On Thu, Nov 9, 2023 at 5:57 PM Yin, Fengwei <fengwei.yin@xxxxxxxxx> wrote: >> >> >> >> On 11/10/2023 6:54 AM, Yang Shi wrote: >>> On Thu, Nov 9, 2023 at 5:48 AM zhangpeng (AS) <zhangpeng362@xxxxxxxxxx> wrote: >>>> >>>> Hi everyone, >>>> >>>> There is a performance issue that has been bothering us recently. >>>> This problem can reproduce in the latest mainline version (Linux 6.6). >>>> >>>> We use mlockall(MCL_CURRENT | MCL_FUTURE) in the user mode process >>>> to avoid performance problems caused by major fault. >>>> >>>> There is a stage in numa fault which will set pte as 0 in do_numa_page() : >>>> ptep_modify_prot_start() will clear the vmf->pte, until >>>> ptep_modify_prot_commit() assign a value to the vmf->pte. >>>> >>>> For the data segment of the user-mode program, the global variable area >>>> is a private mapping. After the pagecache is loaded, the private >>>> anonymous page is generated after the COW is triggered. Mlockall can >>>> lock COW pages (anonymous pages), but the original file pages cannot >>>> be locked and may be reclaimed. If the global variable (private anon page) >>>> is accessed when vmf->pte is zero which is concurrently set by numa fault, >>>> a file page fault will be triggered. >>>> >>>> At this time, the original private file page may have been reclaimed. >>>> If the page cache is not available at this time, a major fault will be >>>> triggered and the file will be read, causing additional overhead. >>>> >>>> Our problem scenario is as follows: >>>> >>>> task 1 task 2 >>>> ------ ------ >>>> /* scan global variables */ >>>> do_numa_page() >>>> spin_lock(vmf->ptl) >>>> ptep_modify_prot_start() >>>> /* set vmf->pte as null */ >>>> /* Access global variables */ >>>> handle_pte_fault() >>>> /* no pte lock */ >>>> do_pte_missing() >>>> do_fault() >>>> do_read_fault() >>>> ptep_modify_prot_commit() >>>> /* ptep update done */ >>>> pte_unmap_unlock(vmf->pte, vmf->ptl) >>>> do_fault_around() >>>> __do_fault() >>>> filemap_fault() >>>> /* page cache is not available >>>> and a major fault is triggered */ >>>> do_sync_mmap_readahead() >>>> /* page_not_uptodate and goto >>>> out_retry. */ >>>> >>>> Is there any way to avoid such a major fault? >>> >>> IMHO I don't think it is a bug. The man page quoted by Willy says "All >>> mapped pages are guaranteed to be resident in RAM when the call >>> returns successfully", but the later COW already made the file page >>> unmapped, right? The PTE pointed to the COW'ed anon page. >>> Hypothetically if we kept the file page mlocked and unmapped, >>> munlock() would have not munlocked the file page at all, it would be >>> mlocked in memory forever. >> But in this case, even the COW page is mlocked. There is small window >> that PTE is set to null in do_numa_page(). data segment access (it's to >> COW page which has nothing to do with original page cache) happens in >> this small window will trigger filemap_fault() to fault in original >> page cache. > > Yes, my point is this may not break the mlockall, but the potential > optimization by avoiding the major fault may still stand. Totally agree. Regards Yin, Fengwei > >> >> I had thought to do double check whether vmf->pte is NULL in do_read_fault(). >> But it's not reliable enough. >> >> Matthew's idea to use protnone to block both hardware accessing and >> do_pte_missing() looks more promising to me. >> >> >> Regards >> Yin, Fengwei >> >>> >>>> >>>> -- >>>> Best Regards, >>>> Peng >>>>