On Thu, Nov 9, 2023 at 5:57 PM Yin, Fengwei <fengwei.yin@xxxxxxxxx> wrote: > > > > On 11/10/2023 6:54 AM, Yang Shi wrote: > > On Thu, Nov 9, 2023 at 5:48 AM zhangpeng (AS) <zhangpeng362@xxxxxxxxxx> wrote: > >> > >> Hi everyone, > >> > >> There is a performance issue that has been bothering us recently. > >> This problem can reproduce in the latest mainline version (Linux 6.6). > >> > >> We use mlockall(MCL_CURRENT | MCL_FUTURE) in the user mode process > >> to avoid performance problems caused by major fault. > >> > >> There is a stage in numa fault which will set pte as 0 in do_numa_page() : > >> ptep_modify_prot_start() will clear the vmf->pte, until > >> ptep_modify_prot_commit() assign a value to the vmf->pte. > >> > >> For the data segment of the user-mode program, the global variable area > >> is a private mapping. After the pagecache is loaded, the private > >> anonymous page is generated after the COW is triggered. Mlockall can > >> lock COW pages (anonymous pages), but the original file pages cannot > >> be locked and may be reclaimed. If the global variable (private anon page) > >> is accessed when vmf->pte is zero which is concurrently set by numa fault, > >> a file page fault will be triggered. > >> > >> At this time, the original private file page may have been reclaimed. > >> If the page cache is not available at this time, a major fault will be > >> triggered and the file will be read, causing additional overhead. > >> > >> Our problem scenario is as follows: > >> > >> task 1 task 2 > >> ------ ------ > >> /* scan global variables */ > >> do_numa_page() > >> spin_lock(vmf->ptl) > >> ptep_modify_prot_start() > >> /* set vmf->pte as null */ > >> /* Access global variables */ > >> handle_pte_fault() > >> /* no pte lock */ > >> do_pte_missing() > >> do_fault() > >> do_read_fault() > >> ptep_modify_prot_commit() > >> /* ptep update done */ > >> pte_unmap_unlock(vmf->pte, vmf->ptl) > >> do_fault_around() > >> __do_fault() > >> filemap_fault() > >> /* page cache is not available > >> and a major fault is triggered */ > >> do_sync_mmap_readahead() > >> /* page_not_uptodate and goto > >> out_retry. */ > >> > >> Is there any way to avoid such a major fault? > > > > IMHO I don't think it is a bug. The man page quoted by Willy says "All > > mapped pages are guaranteed to be resident in RAM when the call > > returns successfully", but the later COW already made the file page > > unmapped, right? The PTE pointed to the COW'ed anon page. > > Hypothetically if we kept the file page mlocked and unmapped, > > munlock() would have not munlocked the file page at all, it would be > > mlocked in memory forever. > But in this case, even the COW page is mlocked. There is small window > that PTE is set to null in do_numa_page(). data segment access (it's to > COW page which has nothing to do with original page cache) happens in > this small window will trigger filemap_fault() to fault in original > page cache. Yes, my point is this may not break the mlockall, but the potential optimization by avoiding the major fault may still stand. > > I had thought to do double check whether vmf->pte is NULL in do_read_fault(). > But it's not reliable enough. > > Matthew's idea to use protnone to block both hardware accessing and > do_pte_missing() looks more promising to me. > > > Regards > Yin, Fengwei > > > > >> > >> -- > >> Best Regards, > >> Peng > >>