On Thu, May 19, 2022 at 02:06:25PM -0700, Zach O'Keefe wrote: > Thanks again for the review, Peter. > > On Wed, May 18, 2022 at 11:41 AM Peter Xu <peterx@xxxxxxxxxx> wrote: > > > > On Wed, May 04, 2022 at 02:44:25PM -0700, Zach O'Keefe wrote: > > > +static int find_pmd_or_thp_or_none(struct mm_struct *mm, > > > + unsigned long address, > > > + pmd_t **pmd) > > > +{ > > > + pmd_t pmde; > > > + > > > + *pmd = mm_find_pmd_raw(mm, address); > > > + if (!*pmd) > > > + return SCAN_PMD_NULL; > > > + > > > + pmde = pmd_read_atomic(*pmd); > > > > It seems to be correct on using the atomic fetcher here. Though irrelevant > > to this patchset.. does it also mean that we miss that on mm_find_pmd()? I > > meant a separate fix like this one: > > > > ---8<--- > > diff --git a/mm/rmap.c b/mm/rmap.c > > index 69416072b1a6..61309718640f 100644 > > --- a/mm/rmap.c > > +++ b/mm/rmap.c > > @@ -785,7 +785,7 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address) > > * without holding anon_vma lock for write. So when looking for a > > * genuine pmde (in which to find pte), test present and !THP together. > > */ > > - pmde = *pmd; > > + pmde = pmd_read_atomic(pmd); > > barrier(); > > if (!pmd_present(pmde) || pmd_trans_huge(pmde)) > > pmd = NULL; > > ---8<--- > > > > As otherwise it seems it's also prone to PAE race conditions when reading > > pmd out, but I could be missing something. > > > > This is a good question. I took some time to look into this, but it's > very complicated and unfortunately I couldn't reach a conclusion in > the time I allotted to myself. My working (unverified) assumption is > that mm_find_pmd() is called in places where it doesn't care if the > pmd isn't read atomically. Frankly I am still not sure on that. Say the immediate check in mm_find_pmd() is pmd_trans_huge() afterwards, and AFAICT that is: static inline int pmd_trans_huge(pmd_t pmd) { return (pmd_val(pmd) & (_PAGE_PSE|_PAGE_DEVMAP)) == _PAGE_PSE; } Where we have: #define _PAGE_BIT_PSE 7 /* 4 MB (or 2MB) page */ #define _PAGE_BIT_DEVMAP _PAGE_BIT_SOFTW4 #define _PAGE_BIT_SOFTW4 58 /* available for programmer */ I think it means we're checking both the lower (bit 7) and upper (bit 58) of the 32bits values of the whole 64bit pmd on PAE, and I can't explain how that'll not require atomicity.. pmd_read_atomic() plays with a magic that we'll either: - Get atomic results for none or pgtable pmds (which is stable), or - Can get not-atomic result for thps (unstable) However here we're checking exactly for thps.. I think we don't need a stable PFN but we'll need stable _PAGE_PSE|_PAGE_DEVMAP bits. It should not be like that when pmd_read_atomic() was introduced because that is 2012 (in 2012, 26c191788f18129) and AFAIU DEVMAP came in 2016+. IOW I had a feeling that pmd_read_atomic() used to be bug-free but after devmap it's harder to be justified at least, it seems to me. So far I don't see a clean way to fix it but use atomic64_read() because pmd_read_atomic() isn't atomic for thp checkings using pmd_trans_huge() afaict, however the last piece of puzzle is atomic64_read() is failing somehow on Xen and that's where we come up with the pmd_read_atomic() trick to read lower then upper 32bits: commit e4eed03fd06578571c01d4f1478c874bb432c815 Author: Andrea Arcangeli <aarcange@xxxxxxxxxx> Date: Wed Jun 20 12:52:57 2012 -0700 thp: avoid atomic64_read in pmd_read_atomic for 32bit PAE In the x86 32bit PAE CONFIG_TRANSPARENT_HUGEPAGE=y case while holding the mmap_sem for reading, cmpxchg8b cannot be used to read pmd contents under Xen. I believe Andrea has a solid reason to do so at that time, but I'm still not sure why Xen won't work with atomic64_read() since it's not mentioned in the commit.. Please go ahead with your patchset without being blocked by this, because AFAICT pmd_read_atomic() is already better than pmde=*pmdp anyway, and it seems a more general problem even if it existed or I could also have missed something. > If so, does that also mean MADV_COLLAPSE is > safe? I'm not sure. These i386 PAE + THP racing issues were most > recently discussed when considering if READ_ONCE() should be used > instead of pmd_read_atomic() [1]. > > [1] https://lore.kernel.org/linux-mm/594c1f0-d396-5346-1f36-606872cddb18@xxxxxxxxxx/ > > > > + > > > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > > > + /* See comments in pmd_none_or_trans_huge_or_clear_bad() */ > > > + barrier(); > > > +#endif > > > + if (!pmd_present(pmde)) > > > + return SCAN_PMD_NULL; > > > + if (pmd_trans_huge(pmde)) > > > + return SCAN_PMD_MAPPED; > > > > Would it be safer to check pmd_bad()? I think not all mm pmd paths check > > that, frankly I don't really know what's the major cause of a bad pmd > > (either software bugs or corrupted mem), but just to check with you, > > because potentially a bad pmd can be read as SCAN_SUCCEED and go through. > > > > Likewise, I'm not sure what the cause of "bad pmds" is. > > Do you mean to check pmd_bad() instead of pmd_trans_huge()? I.e. b/c a > pmd-mapped thp counts as "bad" (at least on x86 since PSE set) or do > you mean to additionally check pmd_bad() after the pmd_trans_huge() > check? > > If it's the former, I'd say we can't claim !pmd_bad() == memory > already backed by thps / our job here is done. > > If it's the latter, I don't see it hurting much (but I can't argue > intelligently about why it's needed) and can include the check in v6. The latter. I don't think it's strongly necessary, but looks good to have if you also agree. Thanks, -- Peter Xu