On 9/26/19 10:16 PM, Linus Torvalds wrote:
On Thu, Sep 26, 2019 at 1:09 PM Thomas Hellström (VMware)
<thomas_os@xxxxxxxxxxxx> wrote:
That said, if people are OK with me modifying the assert in
pud_trans_huge_lock() and make __walk_page_range non-static, it should
probably be possible to make it work, yes.
I don't think you need to modify that assert at all.
That thing only exists when there's a "pud_entry" op in the walker,
and then you absolutely need to have that mmap_lock.
As far as I can tell, you fundamentally only ever work on a pte level
in your address space walker already and actually have a WARN_ON() on
the pud_huge thing, so no pud entry can possibly apply.
Well, we're working on supporting huge puds and pmds in the graphics
VMAs, although in the write-notify cases we're looking at here, we would
probably want to split them down to PTE level. But if we would want to
extend that or if we would want to make this interface general, we'd
probably want to support also a pud_entry callback.
Looking at zap_pud_range() which when called from unmap_mapping_pages()
uses identical locking (no mmap_sem), it seems we should be able to get
away with i_mmap_lock(), making sure the whole page table doesn't
disappear under us. So it's not clear to me why the mmap_sem is strictly
needed here. Better to sort those restrictions out now rather than when
huge entries start appearing.
So no, the assert in pud_trans_huge_lock() does not seem to be a
reason not to just use the existing page table walkers.
And once you get rid of the walking, what is left? Just the "iterate
over the inode mappings" part. Which could just be done in
mm/pagewalk.c, and then you don't even need to remove the static.
So making it be just another walking in pagewalk.c would seem to be
the simplest model.
Call it "walk_page_mapping()". And talk extensively about how the
locking differs a lot from the usual "walk_page_vma()" things.
Sure. Can do that.
Thanks,
Thomas