On 12/03/2020 16:37, Jason Gunthorpe wrote:
On Thu, Mar 12, 2020 at 04:16:33PM +0000, Steven Price wrote:
Actually, while you are looking at this, do you think we should be
adding at least READ_ONCE in the pagewalk.c walk_* functions? The
multiple references of pmd, pud, etc without locking seems sketchy to
me.
I agree it seems worrying. I'm not entirely sure whether the holding of
mmap_sem is sufficient,
I looked at this question, and at least for PMD, mmap_sem is not
sufficient. I didn't easilly figure it out for the other ones
I'm guessing if PMD is not safe then none of them are.
this isn't something that I changed so I've just
been hoping that it's sufficient since it seems to have been working
(whether that's by chance because the compiler didn't generate multiple
reads I've no idea). For walking the kernel's page tables the lack of
READ_ONCE is also not great, but at least for PTDUMP we don't care too much
about accuracy and it should be crash proof because there's no RCU grace
period. And again the code I was replacing didn't have any special
protection.
I can't see any harm in updating the code to include READ_ONCE and I'm happy
to review a patch.
The reason I ask is because hmm's walkers often have this pattern
where they get the pointer and then de-ref it (again) then
immediately have to recheck the 'again' conditions of the walker
itself because the re-read may have given a different value.
Having the walker deref the pointer and pass the value it into the ops
for use rather than repeatedly de-refing an unlocked value seems like
a much safer design to me.
Yeah that sounds like a good idea.
If this also implicitly relies on a RCU grace period then it is also
missing RCU locking...
True - I'm not 100% sure in what situations a page table entry can be
freed. Anshuman has added locking to deal with memory hotplug[1]. I
believe this is sufficient.
[1] bf2b59f60ee1 ("arm64/mm: Hold memory hotplug lock while walking for
kernel page table dump")
I also didn't quite understand why walk_pte_range() skipped locking
the pte in the no_vma case - I don't get why vma would be related to
locking here.
The no_vma case is for walking the kernel's page tables and they may
have entries that are not backed by struct page, so there isn't
(reliably) a PTE lock to take.
I also saw that hmm open coded the pte walk, presumably for
performance, so I was thinking of adding some kind of pte_range()
callback to avoid the expensive indirect function call per pte, but
hmm also can't have the pmd locked...
Yeah the callback per PTE is a bit heavy because of the indirect
function call. I'm not sure how to optimise it beyond open coding at the
PMD level. One option would be to provide helper functions to make it a
bit more generic.
Do you have an idea of what pte_range() would look like?
Steve