On Tue, 8 Mar 2022 13:24:19 +0100 David Hildenbrand <david@xxxxxxxxxx> wrote: [...] > > From 1e51e8a93894f87c0a4d0e908391e0628ae56afe Mon Sep 17 00:00:00 2001 > From: David Hildenbrand <david@xxxxxxxxxx> > Date: Tue, 8 Mar 2022 12:51:26 +0100 > Subject: [PATCH] mm/gup: fix buffered I/O on s390x with pagefaults disabled > > On s390x, we actually need a pte_mkyoung() / pte_mkdirty() instead of > going via the page and leaving the PTE unmodified. E.g., if we only > mark the page accessed via mark_page_accessed() when doing a FOLL_TOUCH, > we'll miss to clear the HW invalid bit in the pte and subsequent accesses > via the MMU would still require a pagefault. > > Otherwise, buffered I/O will loop forever because it will keep stumling > over the set HW invalid bit, requiring a page fault. > > Reported-by: Andreas Gruenbacher <agruenba@xxxxxxxxxx> > Signed-off-by: David Hildenbrand <david@xxxxxxxxxx> > --- > mm/gup.c | 32 +++++++++++++++++++++++++------- > 1 file changed, 25 insertions(+), 7 deletions(-) > > diff --git a/mm/gup.c b/mm/gup.c > index a9d4d724aef7..de3311feb377 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -587,15 +587,33 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, > } > } > if (flags & FOLL_TOUCH) { > - if ((flags & FOLL_WRITE) && > - !pte_dirty(pte) && !PageDirty(page)) > - set_page_dirty(page); > /* > - * pte_mkyoung() would be more correct here, but atomic care > - * is needed to avoid losing the dirty bit: it is easier to use > - * mark_page_accessed(). > + * We have to be careful with updating the PTE on architectures > + * that have a HW dirty bit: while updating the PTE we might > + * lose that bit again and we'd need an atomic update: it is > + * easier to leave the PTE untouched for these architectures. > + * > + * s390x doesn't have a hw referenced / dirty bit and e.g., sets > + * the hw invalid bit in pte_mkold(), to catch further > + * references. We have to update the PTE here to e.g., clear the > + * invalid bit; otherwise, callers that rely on not requiring > + * an MMU fault once GUP(FOLL_TOUCH) succeeded will loop forever > + * because the page won't actually be accessible via the MMU. > */ > - mark_page_accessed(page); > + if (IS_ENABLED(CONFIG_S390)) { > + pte = pte_mkyoung(pte); > + if (flags & FOLL_WRITE) > + pte = pte_mkdirty(pte); > + if (!pte_same(pte, *ptep)) { > + set_pte_at(vma->vm_mm, address, ptep, pte); > + update_mmu_cache(vma, address, ptep); > + } > + } else { > + if ((flags & FOLL_WRITE) && > + !pte_dirty(pte) && !PageDirty(page)) > + set_page_dirty(page); > + mark_page_accessed(page); > + } > } > if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) { > /* Do not mlock pte-mapped THP */ Thanks David, your analysis looks valid, at least it seems that you found a scenario where we would have HW invalid bit set due to pte_mkold() in ptep_clear_flush_young(), and still GUP would find and return that page, IIUC. I think pte handling should be similar to pmd handling in follow_trans_huge_pmd() -> touch_pmd(), or cow_user_page() (see comment on software "accessed" bits), which is more or less what your patch does. Some possible concerns: - set_page_dirty() would not be done any more for s390, is that intended and ok? - using set_pte_at() here seems a bit dangerous, as I'm not sure if this will always only operate on invalid PTEs. Using it on active valid PTEs could result in TLB issues because of missing flush. Also not sure about kvm impact. Using ptep_set_access_flags() seems safer, again similar to touch_pmd() and also cow_user_page(). Looking at cow_user_page(), I also wonder if the arch_faults_on_old_pte() logic could be used here. I must admit that I did not really understand the "losing the dirty bit" part of the comment, but it seems that we might need to not only check for arch_faults_on_old_pte(), but also for something like "arch_faults_for_dirty_pte". Last but not least, IIUC, this issue should affect all archs that return true on arch_faults_on_old_pte(). After all, the basic problem seems to be that a pagefault is required for PTEs marked as old, in combination with GUP still returning a valid page. So maybe this should not be restricted to IS_ENABLED(CONFIG_S390).