On Thu, Jun 10, 2021 at 11:37:14PM -0700, Hugh Dickins wrote: > On Thu, 10 Jun 2021, Jason Gunthorpe wrote: > > On Thu, Jun 10, 2021 at 12:06:17PM +0300, Kirill A. Shutemov wrote: > > > On Wed, Jun 09, 2021 at 11:38:11PM -0700, Hugh Dickins wrote: > > > > page_vma_mapped_walk() cleanup: use pmd_read_atomic() with barrier() > > > > instead of READ_ONCE() for pmde: some architectures (e.g. i386 with PAE) > > > > have a multi-word pmd entry, for which READ_ONCE() is not good enough. > > > > > > > > Signed-off-by: Hugh Dickins <hughd@xxxxxxxxxx> > > > > Cc: <stable@xxxxxxxxxxxxxxx> > > > > mm/page_vma_mapped.c | 5 ++++- > > > > 1 file changed, 4 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c > > > > index 7c0504641fb8..973c3c4e72cc 100644 > > > > +++ b/mm/page_vma_mapped.c > > > > @@ -182,13 +182,16 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw) > > > > pud = pud_offset(p4d, pvmw->address); > > > > if (!pud_present(*pud)) > > > > return false; > > > > + > > > > pvmw->pmd = pmd_offset(pud, pvmw->address); > > > > /* > > > > * Make sure the pmd value isn't cached in a register by the > > > > * compiler and used as a stale value after we've observed a > > > > * subsequent update. > > > > */ > > > > - pmde = READ_ONCE(*pvmw->pmd); > > > > + pmde = pmd_read_atomic(pvmw->pmd); > > > > + barrier(); > > > > + > > > > > > Hm. It makes me wounder if barrier() has to be part of pmd_read_atomic(). > > > mm/hmm.c uses the same pattern as you are and I tend to think that the > > > rest of pmd_read_atomic() users may be broken. > > > > > > Am I wrong? > > > > I agree with you, something called _atomic should not require the > > caller to provide barriers. > > > > I think the issue is simply that the two implementations of > > pmd_read_atomic() should use READ_ONCE() internally, no? > > I've had great difficulty coming up with answers for you. > > This patch was based on two notions I've had lodged in my mind > for several years: > > 1) that pmd_read_atomic() is the creme-de-la-creme for reading a pmd_t > atomically (even if the pmd_t spans multiple words); but reading again > after all this time the comment above it, it seems to be more specialized > than I'd thought (biggest selling point being for when you want to check > pmd_none(), which we don't). And has no READ_ONCE() or barrier() inside, > so really needs that barrier() to be as safe as the previous READ_ONCE(). IMHO it is a simple bug that the 64 bit version of this is not marked with READ_ONCE() internally. It is reading data without a lock, AFAIK our modern understanding of the memory model is that requires READ_ONCE(). diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h index e896ebef8c24cb..0bf1fdec928e71 100644 --- a/arch/x86/include/asm/pgtable-3level.h +++ b/arch/x86/include/asm/pgtable-3level.h @@ -75,7 +75,7 @@ static inline void native_set_pte(pte_t *ptep, pte_t pte) static inline pmd_t pmd_read_atomic(pmd_t *pmdp) { pmdval_t ret; - u32 *tmp = (u32 *)pmdp; + u32 *tmp = READ_ONCE((u32 *)pmdp); ret = (pmdval_t) (*tmp); if (ret) { @@ -84,7 +84,7 @@ static inline pmd_t pmd_read_atomic(pmd_t *pmdp) * or we can end up with a partial pmd. */ smp_rmb(); - ret |= ((pmdval_t)*(tmp + 1)) << 32; + ret |= READ_ONCE((pmdval_t)*(tmp + 1)) << 32; } return (pmd_t) { ret }; diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 46b13780c2c8c9..c8c7a3307d2773 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -1228,7 +1228,7 @@ static inline pmd_t pmd_read_atomic(pmd_t *pmdp) * only going to work, if the pmdval_t isn't larger than * an unsigned long. */ - return *pmdp; + return READ_ONCE(*pmdp); } #endif > 2) the barrier() in mm_find_pmd(), that replaced an earlier READ_ONCE(), > because READ_ONCE() did not work (did not give the necessary guarantee? or > did not build?) on architectures with multiple word pmd_ts e.g. i386 PAE. This is really interesting, the git history e37c69827063 ("mm: replace ACCESS_ONCE with READ_ONCE or barriers") says the READ_ONCE was dropped here "because it doesn't work on non-scalar types" due to a (now 8 year old) gcc bug. According to the gcc bug READ_ONCE() on anything that is a scalar sized struct triggers GCC to ignore the READ_ONCEyness. To work around this bug then READ_ONCE can never be used on any of the struct protected page table elements. While I am not 100% sure, it looks like this is a pre gcc 4.9 bug, and since gcc 4.9 is now the minimum required compiler this is all just cruft. Use READ_ONCE() here too... > But I've now come across some changes that Will Deacon made last year: > the include/asm-generic/rwonce.h READ_ONCE() now appears to allow for > native word type *or* type sizeof(long long) (e.g. i386 PAE) - given > "a strong prevailing wind" anyway :) And 8e958839e4b9 ("sparc32: mm: > Restructure sparc32 MMU page-table layout") put an end to sparc32's > typedef struct { unsigned long pmdv[16]; } pmd_t. Indeed, that is good news > As to your questions about pmd_read_atomic() usage elsewhere: > please don't force me to think so hard! (And you've set me half- > wondering, whether there are sneaky THP transitions, perhaps of the > "unstable" kind, that page_vma_mapped_walk() should be paying more > attention to: but for sanity's sake I won't go there, not now.) If I recall, (and I didn't recheck this right now) the only thing pmd_read_atomic() provides is the special property that if the pmd's flags are observed to point to a pte table then pmd_read_atomic() will reliably return the pte table pointer. Otherwise it returns the flags and a garbage pointer because under the THP protocol a PMD pointing at a page can be converted to a PTE table if you hold the mmap sem in read mode. Upgrading a PTE table to a PMD page requires mmap sem in write mode so once a PTE table is observed 'locklessly' the value is stable.. Or at least so says the documentation pmd_read_atomic() is badly named, tricky to use, and missing the READ_ONCE() because it is so old.. As far as is page_vma_mapped_walk correct.. Everything except is_pmd_migration_entry() is fine to my eye, and I simply don't know the rules aroudn migration entries to know right/wrong. I suspect it probably is required to manipulate a migration entry while holding the mmap_sem in write mode?? There is some list of changes to the page table that require holding the mmap sem in write mode that I've never seen documented for.. Jason