On Thu, May 25, 2023 at 03:35:01PM -0700, Hugh Dickins wrote: > On Wed, 24 May 2023, Peter Xu wrote: > > On Sun, May 21, 2023 at 09:49:45PM -0700, Hugh Dickins wrote: > > > Use pmdp_get_lockless() in preference to READ_ONCE(*pmdp), to get a more > > > reliable result with PAE (or READ_ONCE as before without PAE); and remove > > > the unnecessary extra barrier()s which got left behind in its callers. > > > > Pure question: does it mean that some of below path (missing barrier() > > ones) could have problem when CONFIG_PAE, hence this can be seen as a > > (potential) bug fix? > > I don't think so; or at least, I am not claiming that this fixes any. > > It really depends on what use is made of the pmdval afterwards, and > I've not checked through them. The READ_ONCE()s which were there, > were good enough to make sure that the compiler did not reevaluate > the pmdval later on, with perhaps a confusingly different result. > > But, at least in the x86 PAE case, they were not good enough to ensure > that the two halves of the entry match up; and, sad to say, nor is that > absolutely guaranteed by these conversions to pmdp_get_lockless() - > because of the "HOWEVER" below. PeterZ's comments in linux/pgtable.h > are well worth reading through. Yes exactly - that's one major thing of my confusion on using {ptep|pmdp}_get_lockless(). In irqoff ctx, AFAICT we can see a totally messed up pte/pmd with present bit set if extremely unlucky. E.g. it can race with something like "DONTNEED (contains tlbflush) then a POPULATE_WRITE" so we can have "present -> present" conversion of pte when reading, so we can read half pfn1 and then the other half pfn2. The other confusing thing on this _lockless trick on PAE is, I think it _might_ go wrong with devmap.. The problem is here we assumed even if high & low may not match, we still can rely on most pte/pmd checks are done only on low bits (except _none() check) to guarantee at least the checks are still atomic on low bits. But it seems to me it's not true anymore if with pmd_trans_huge() after devmap introduced, e.g.: static inline int pmd_trans_huge(pmd_t pmd) { return (pmd_val(pmd) & (_PAGE_PSE|_PAGE_DEVMAP)) == _PAGE_PSE; } #define _PAGE_PSE (_AT(pteval_t, 1) << _PAGE_BIT_PSE) #define _PAGE_BIT_PSE 7 /* 4 MB (or 2MB) page */ #define _PAGE_DEVMAP (_AT(u64, 1) << _PAGE_BIT_DEVMAP) #define _PAGE_BIT_DEVMAP _PAGE_BIT_SOFTW4 #define _PAGE_BIT_SOFTW4 58 /* available for programmer */ So after devmap with CONFIG_PAE, pmd_trans_huge() checks more than low bits but also high bits. I didn't go further to check whether there can be any real issue but IIUC that's not expected when the low/high trick introduced (originally introduced in commit e585513b76f7b05d sololy for x86 PAE fast-gup only). > > You might question why I made these changes at all: some days > I question them too. Better though imperfect? Or deceptive? I think it's probably a separate topic to address in all cases, so I think this patch still make it slightly better on barrier() which I agree: Acked-by: Peter Xu <peterx@xxxxxxxxxx> Thanks, -- Peter Xu