On 11/10/23 9:20 AM, Yin, Fengwei wrote: > > > On 11/10/2023 11:39 AM, Kefeng Wang wrote: >> >> >> On 2023/11/10 9:57, Yin, Fengwei wrote: >>> >>> >>> On 11/10/2023 6:54 AM, Yang Shi wrote: >>>> On Thu, Nov 9, 2023 at 5:48 AM zhangpeng (AS) <zhangpeng362@xxxxxxxxxx> wrote: >>>>> >>>>> Hi everyone, >>>>> >>>>> There is a performance issue that has been bothering us recently. >>>>> This problem can reproduce in the latest mainline version (Linux 6.6). >>>>> >>>>> We use mlockall(MCL_CURRENT | MCL_FUTURE) in the user mode process >>>>> to avoid performance problems caused by major fault. >>>>> >>>>> There is a stage in numa fault which will set pte as 0 in do_numa_page() : >>>>> ptep_modify_prot_start() will clear the vmf->pte, until >>>>> ptep_modify_prot_commit() assign a value to the vmf->pte. >>>>> >>>>> For the data segment of the user-mode program, the global variable area >>>>> is a private mapping. After the pagecache is loaded, the private >>>>> anonymous page is generated after the COW is triggered. Mlockall can >>>>> lock COW pages (anonymous pages), but the original file pages cannot >>>>> be locked and may be reclaimed. If the global variable (private anon page) >>>>> is accessed when vmf->pte is zero which is concurrently set by numa fault, >>>>> a file page fault will be triggered. >>>>> >>>>> At this time, the original private file page may have been reclaimed. >>>>> If the page cache is not available at this time, a major fault will be >>>>> triggered and the file will be read, causing additional overhead. >>>>> >>>>> Our problem scenario is as follows: >>>>> >>>>> task 1 task 2 >>>>> ------ ------ >>>>> /* scan global variables */ >>>>> do_numa_page() >>>>> spin_lock(vmf->ptl) >>>>> ptep_modify_prot_start() >>>>> /* set vmf->pte as null */ >>>>> /* Access global variables */ >>>>> handle_pte_fault() >>>>> /* no pte lock */ >>>>> do_pte_missing() >>>>> do_fault() >>>>> do_read_fault() >>>>> ptep_modify_prot_commit() >>>>> /* ptep update done */ >>>>> pte_unmap_unlock(vmf->pte, vmf->ptl) >>>>> do_fault_around() >>>>> __do_fault() >>>>> filemap_fault() >>>>> /* page cache is not available >>>>> and a major fault is triggered */ >>>>> do_sync_mmap_readahead() >>>>> /* page_not_uptodate and goto >>>>> out_retry. */ >>>>> >>>>> Is there any way to avoid such a major fault? >>>> >>>> IMHO I don't think it is a bug. The man page quoted by Willy says "All >>>> mapped pages are guaranteed to be resident in RAM when the call >>>> returns successfully", but the later COW already made the file page >>>> unmapped, right? The PTE pointed to the COW'ed anon page. >>>> Hypothetically if we kept the file page mlocked and unmapped, >>>> munlock() would have not munlocked the file page at all, it would be >>>> mlocked in memory forever. >>> But in this case, even the COW page is mlocked. There is small window >>> that PTE is set to null in do_numa_page(). data segment access (it's to >>> COW page which has nothing to do with original page cache) happens in >>> this small window will trigger filemap_fault() to fault in original >>> page cache. >>> >>> I had thought to do double check whether vmf->pte is NULL in do_read_fault(). >>> But it's not reliable enough. >>> >>> Matthew's idea to use protnone to block both hardware accessing and >>> do_pte_missing() looks more promising to me. >> >> Actual, we could revert the following patch to avoid this issue, >> but this workaroud from ppc... >> >> commit cee216a696b2004017a5ecb583366093d90b1568 >> Author: Aneesh Kumar K.V <aneesh.kumar@xxxxxxxxxxxxxxxxxx> >> Date: Fri Feb 24 14:59:13 2017 -0800 >> >> mm/autonuma: don't use set_pte_at when updating protnone ptes >> >> Architectures like ppc64, use privilege access bit to mark pte non >> accessible. This implies that kernel can do a copy_to_user to an >> address marked for numa fault. This also implies that there can be a >> parallel hardware update for the pte. set_pte_at cannot be used in such >> scenarios. Hence switch the pte update to use ptep_get_and_clear and >> set_pte_at combination. > Oh. This means the protnone doesn't work for PPC. > > That is correct. I am yet to read the full thread. Can we make ptep_modify_prot_start() not to mark pte = 0 ? One of the requirement for powerpc is to mark it hardware invalid such that not TLB entries get inserted after that. Other options is to get a proper pte_update API for generic kernel so that architectures can do this without marking the pte invalid. -aneesh