On 11/10/2023 1:32 PM, Huang, Ying wrote: > Matthew Wilcox <willy@xxxxxxxxxxxxx> writes: > >> On Thu, Nov 09, 2023 at 09:47:24PM +0800, zhangpeng (AS) wrote: >>> There is a stage in numa fault which will set pte as 0 in do_numa_page() : >>> ptep_modify_prot_start() will clear the vmf->pte, until >>> ptep_modify_prot_commit() assign a value to the vmf->pte. >> >> [...] >> >>> Our problem scenario is as follows: >>> >>> task 1 task 2 >>> ------ ------ >>> /* scan global variables */ >>> do_numa_page() >>> spin_lock(vmf->ptl) >>> ptep_modify_prot_start() >>> /* set vmf->pte as null */ >>> /* Access global variables */ >>> handle_pte_fault() >>> /* no pte lock */ >>> do_pte_missing() >>> do_fault() >>> do_read_fault() >>> ptep_modify_prot_commit() >>> /* ptep update done */ >>> pte_unmap_unlock(vmf->pte, vmf->ptl) >>> do_fault_around() >>> __do_fault() >>> filemap_fault() >>> /* page cache is not available >>> and a major fault is triggered */ >>> do_sync_mmap_readahead() >>> /* page_not_uptodate and goto >>> out_retry. */ >>> >>> Is there any way to avoid such a major fault? >> >> Yes, this looks like a bug. >> >> It seems to me that the easiest way to fix this is not to zero the pte >> but to make it protnone? That would send task 2 into do_numa_page() >> where it would take the ptl, then check pte_same(), see that it's >> changed and goto out, which will end up retrying the fault. > > There are other places in the kernel where the PTE is cleared, for > example, move_ptes() in mremap.c. IIUC, we need to audit all them. > > Another possible solution is to check PTE again with PTL held before > reading in file data. This will increase the overhead of major fault > path. Is it acceptable? What if we check the PTE without page table lock acquired? Regards Yin, Fengwei > >> I'm not particularly expert at page table manipulation, so I'll let >> somebody who is propose an actual patch. Or you could try to do it? > > -- > Best Regards, > Huang, Ying