On Wed, Mar 05, 2025 at 11:46:41AM +0100, David Hildenbrand wrote: > On 05.03.25 11:21, Dev Jain wrote: > > In __handle_mm_fault(), > > > > 1. Why is there a barrier() for the PUD logic? > > Good question. It was added in > > commit a00cc7d9dd93d66a3fb83fc52aa57a4bec51c517 > Author: Matthew Wilcox <willy@xxxxxxxxxxxxx> > Date: Fri Feb 24 14:57:02 2017 -0800 > > mm, x86: add support for PUD-sized transparent hugepages > > Maybe it was an alternative to performing a READ_ONCE(*vmf.pud). I was monkey-see, monkey-do. Here's the corresponding code as it existed at the time: } else { pmd_t orig_pmd = *vmf.pmd; barrier(); if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) { vmf.flags |= FAULT_FLAG_SIZE_PMD; vs what I added: } else { pud_t orig_pud = *vmf.pud; barrier(); if (pud_trans_huge(orig_pud) || pud_devmap(orig_pud)) { At some point, somebody added pmdp_get_lockless() and did not add a corresponding pudp_get_lockless(). And it was ... Hugh in 26e1a0c3277d If you want to add a pudp_get_lockless(), I doubt anyone will object, but it's probably pointless churn. > Maybe it was done that way, because pudp_get_lockless() does not exist. And > it would likely not be required, because on architectures where > ptep_get_lockless() does some magic (see below, mostly 32bit), PUD THP are > not applicable. > > > > 2. For the PMD logic, in the if block, we use *vmf.pmd, and in the else block > > we use pmdp_get_lockless(); what if someone changes the pmd just when we > > have begun processing the conditions in the if block, fail in the if block > > and then the else block operates on a different pmd value. Shouldn't we cache > > the value of the pmd and operate on a single consistent value until we take the > > lock and then finally check using pxd_same() and friends? > > The pmd_none(*vmf.pmd) is fine. create_huge_pmd() must be able to deal with > races, because we are not holding any locks. > > We only have to use pmdp_get_lockless() when we want to effectively perform > a READ_ONCE(), and make sure that we read something "reasonable" that we can > operate on, even with concurrent changes. (e.g., not read a garbage PFN just > because of some concurrent changes) > > We'll store the value in vmf.orig_pmd, on which we'll operate and try to > detect later changes using pmd_same(). So we really want something > consistent in there. > > See the description above ptep_get_lockless(), why we cannot simply do a > READ_ONCE on architectures where a PTE cannot be read atomically (e.g., 8 > byte PTEs on 32bit architecture). > > -- > Cheers, > > David / dhildenb >