On Thu, Mar 12, 2020 at 05:02:18PM +0000, Steven Price wrote: > On 12/03/2020 16:37, Jason Gunthorpe wrote: > > On Thu, Mar 12, 2020 at 04:16:33PM +0000, Steven Price wrote: > > > > Actually, while you are looking at this, do you think we should be > > > > adding at least READ_ONCE in the pagewalk.c walk_* functions? The > > > > multiple references of pmd, pud, etc without locking seems sketchy to > > > > me. > > > > > > I agree it seems worrying. I'm not entirely sure whether the holding of > > > mmap_sem is sufficient, > > > > I looked at this question, and at least for PMD, mmap_sem is not > > sufficient. I didn't easilly figure it out for the other ones > > > > I'm guessing if PMD is not safe then none of them are. > > > > > this isn't something that I changed so I've just > > > been hoping that it's sufficient since it seems to have been working > > > (whether that's by chance because the compiler didn't generate multiple > > > reads I've no idea). For walking the kernel's page tables the lack of > > > READ_ONCE is also not great, but at least for PTDUMP we don't care too much > > > about accuracy and it should be crash proof because there's no RCU grace > > > period. And again the code I was replacing didn't have any special > > > protection. > > > > > > I can't see any harm in updating the code to include READ_ONCE and I'm happy > > > to review a patch. > > > > The reason I ask is because hmm's walkers often have this pattern > > where they get the pointer and then de-ref it (again) then > > immediately have to recheck the 'again' conditions of the walker > > itself because the re-read may have given a different value. > > > > Having the walker deref the pointer and pass the value it into the ops > > for use rather than repeatedly de-refing an unlocked value seems like > > a much safer design to me. > > Yeah that sounds like a good idea. I'm looking at this now.. The PUD is also changing under the read mmap_sem - and I was able to think up some race conditiony bugs related to this. Have some patches now.. However, I haven't been able to understand why walk_page_range() doesn't check pud_present() or pmd_present() before calling pmd_offset_map() or pte_offset_map(). As far as I can see a non-present entry has a swap entry encoded in it, and thus it seems like it is a bad idea to pass a non-present entry to the two map functions. I think those should only be called when the entry points to the next level in the page table (so there is something to map?) I see you added !present tests for the !vma case, but why only there? Is this a bug? Do you know how it works? Is it something that was missed when people added non-present PUD and PMD's? Thanks, Jason