On Fri, Dec 20, 2024 at 08:06:48PM +0100, David Hildenbrand wrote: > On 20.12.24 20:01, David Hildenbrand wrote: > > On 17.12.24 06:12, Alistair Popple wrote: > > > In preparation for using insert_page() for DAX, enhance > > > insert_page_into_pte_locked() to handle establishing writable > > > mappings. Recall that DAX returns VM_FAULT_NOPAGE after installing a > > > PTE which bypasses the typical set_pte_range() in finish_fault. > > > > > > Signed-off-by: Alistair Popple <apopple@xxxxxxxxxx> > > > Suggested-by: Dan Williams <dan.j.williams@xxxxxxxxx> > > > > > > --- > > > > > > Changes since v2: > > > > > > - New patch split out from "mm/memory: Add dax_insert_pfn" > > > --- > > > mm/memory.c | 45 +++++++++++++++++++++++++++++++++++++-------- > > > 1 file changed, 37 insertions(+), 8 deletions(-) > > > > > > diff --git a/mm/memory.c b/mm/memory.c > > > index 06bb29e..cd82952 100644 > > > --- a/mm/memory.c > > > +++ b/mm/memory.c > > > @@ -2126,19 +2126,47 @@ static int validate_page_before_insert(struct vm_area_struct *vma, > > > } > > > static int insert_page_into_pte_locked(struct vm_area_struct *vma, pte_t *pte, > > > - unsigned long addr, struct page *page, pgprot_t prot) > > > + unsigned long addr, struct page *page, > > > + pgprot_t prot, bool mkwrite) > > > { > > > struct folio *folio = page_folio(page); > > > + pte_t entry = ptep_get(pte); > > > pte_t pteval; > > > - if (!pte_none(ptep_get(pte))) > > > - return -EBUSY; > > > + if (!pte_none(entry)) { > > > + if (!mkwrite) > > > + return -EBUSY; > > > + > > > + /* > > > + * For read faults on private mappings the PFN passed in may not > > > + * match the PFN we have mapped if the mapped PFN is a writeable > > > + * COW page. In the mkwrite case we are creating a writable PTE > > > + * for a shared mapping and we expect the PFNs to match. If they > > > + * don't match, we are likely racing with block allocation and > > > + * mapping invalidation so just skip the update. > > > + */ > > > > Would it make sense to instead have here > > > > /* See insert_pfn(). */ > > > > But ... > > > > > + if (pte_pfn(entry) != page_to_pfn(page)) { > > > + WARN_ON_ONCE(!is_zero_pfn(pte_pfn(entry))); > > > + return -EFAULT; > > > + } > > > + entry = maybe_mkwrite(entry, vma); > > > + entry = pte_mkyoung(entry); > > > + if (ptep_set_access_flags(vma, addr, pte, entry, 1)) > > > + update_mmu_cache(vma, addr, pte); > > > > ... I am not sure if we want the above at all. Someone inserted a page, > > which is refcounted + mapcounted already. > > > > Now you ignore that and do like the second insertion "worked" ? > > > > No, that feels wrong, I suspect you will run into refcount+mapcount issues. > > > > If there is already something, inserting must fail IMHO. If you want to > > change something to upgrade write permissions, then a different > > interface should be used. > > Ah, now I realize that the early exit saves you because we won't adjust the > refcount +mapcount. Right. > I still wonder if that really belongs in here, I would prefer to not play > such tricks to upgrade write permissions if possible. As you have pointed out this was all inspired (ie. mostly copied) from the existing insert_pfn() implementation which is used from vmf_insert_mixed{_mkwrite}(). I agree a different interface to upgrade permissions would be nice. However it's tricky because in general callers of these functions (eg. FS DAX) aren't aware if the page is already mapped by a PTE/PMD. They only know a fault has occured and the faulting permissions. This wouldn't be impossible to fix - the mm does provide vm_ops->page_mkwrite() for permission upgrades. The difficulty is that most filesystems that support FS DAX (ie. ext4, XFS) don't treat a vm_ops->page_mkwrite() call any differently from a vm_ops->fault() call due to write fault. Therefore the FS DAX code is unaware of whether or not this is a permission upgrade or initial writeable mapping of the page in the VMA. A further issue in there is currently no vm_ops->huge_mkwrite() callback. Obviously this could all be plumbed through the MM/FS layers, but that would require a separate patch series. Given the current implementation has no issues beyond the cosmetic I'd rather not delay this series any longer, especially as the cosmetic defect is largely pre-existing (vmf_insert_mixed{_mkwrite}() could have equally had a separate upgrade interface). > -- > > Cheers, > > David / dhildenb >