On 11/10/2023 11:39 AM, Kefeng Wang wrote: > > > On 2023/11/10 9:57, Yin, Fengwei wrote: >> >> >> On 11/10/2023 6:54 AM, Yang Shi wrote: >>> On Thu, Nov 9, 2023 at 5:48 AM zhangpeng (AS) <zhangpeng362@xxxxxxxxxx> wrote: >>>> >>>> Hi everyone, >>>> >>>> There is a performance issue that has been bothering us recently. >>>> This problem can reproduce in the latest mainline version (Linux 6.6). >>>> >>>> We use mlockall(MCL_CURRENT | MCL_FUTURE) in the user mode process >>>> to avoid performance problems caused by major fault. >>>> >>>> There is a stage in numa fault which will set pte as 0 in do_numa_page() : >>>> ptep_modify_prot_start() will clear the vmf->pte, until >>>> ptep_modify_prot_commit() assign a value to the vmf->pte. >>>> >>>> For the data segment of the user-mode program, the global variable area >>>> is a private mapping. After the pagecache is loaded, the private >>>> anonymous page is generated after the COW is triggered. Mlockall can >>>> lock COW pages (anonymous pages), but the original file pages cannot >>>> be locked and may be reclaimed. If the global variable (private anon page) >>>> is accessed when vmf->pte is zero which is concurrently set by numa fault, >>>> a file page fault will be triggered. >>>> >>>> At this time, the original private file page may have been reclaimed. >>>> If the page cache is not available at this time, a major fault will be >>>> triggered and the file will be read, causing additional overhead. >>>> >>>> Our problem scenario is as follows: >>>> >>>> task 1 task 2 >>>> ------ ------ >>>> /* scan global variables */ >>>> do_numa_page() >>>> spin_lock(vmf->ptl) >>>> ptep_modify_prot_start() >>>> /* set vmf->pte as null */ >>>> /* Access global variables */ >>>> handle_pte_fault() >>>> /* no pte lock */ >>>> do_pte_missing() >>>> do_fault() >>>> do_read_fault() >>>> ptep_modify_prot_commit() >>>> /* ptep update done */ >>>> pte_unmap_unlock(vmf->pte, vmf->ptl) >>>> do_fault_around() >>>> __do_fault() >>>> filemap_fault() >>>> /* page cache is not available >>>> and a major fault is triggered */ >>>> do_sync_mmap_readahead() >>>> /* page_not_uptodate and goto >>>> out_retry. */ >>>> >>>> Is there any way to avoid such a major fault? >>> >>> IMHO I don't think it is a bug. The man page quoted by Willy says "All >>> mapped pages are guaranteed to be resident in RAM when the call >>> returns successfully", but the later COW already made the file page >>> unmapped, right? The PTE pointed to the COW'ed anon page. >>> Hypothetically if we kept the file page mlocked and unmapped, >>> munlock() would have not munlocked the file page at all, it would be >>> mlocked in memory forever. >> But in this case, even the COW page is mlocked. There is small window >> that PTE is set to null in do_numa_page(). data segment access (it's to >> COW page which has nothing to do with original page cache) happens in >> this small window will trigger filemap_fault() to fault in original >> page cache. >> >> I had thought to do double check whether vmf->pte is NULL in do_read_fault(). >> But it's not reliable enough. >> >> Matthew's idea to use protnone to block both hardware accessing and >> do_pte_missing() looks more promising to me. > > Actual, we could revert the following patch to avoid this issue, > but this workaroud from ppc... > > commit cee216a696b2004017a5ecb583366093d90b1568 > Author: Aneesh Kumar K.V <aneesh.kumar@xxxxxxxxxxxxxxxxxx> > Date: Fri Feb 24 14:59:13 2017 -0800 > > mm/autonuma: don't use set_pte_at when updating protnone ptes > > Architectures like ppc64, use privilege access bit to mark pte non > accessible. This implies that kernel can do a copy_to_user to an > address marked for numa fault. This also implies that there can be a > parallel hardware update for the pte. set_pte_at cannot be used in such > scenarios. Hence switch the pte update to use ptep_get_and_clear and > set_pte_at combination. Oh. This means the protnone doesn't work for PPC. > >> >> >> Regards >> Yin, Fengwei >> >>> >>>> >>>> -- >>>> Best Regards, >>>> Peng >>>>