On 08.03.22 13:24, David Hildenbrand wrote: > On 08.03.22 13:11, David Hildenbrand wrote: >> On 08.03.22 09:37, David Hildenbrand wrote: >>> On 08.03.22 09:21, David Hildenbrand wrote: >>>> On 08.03.22 00:18, Linus Torvalds wrote: >>>>> On Mon, Mar 7, 2022 at 2:52 PM Andreas Gruenbacher <agruenba@xxxxxxxxxx> wrote: >>>>>> >>>>>> After generic_file_read_iter() returns a short or empty read, we fault >>>>>> in some pages with fault_in_iov_iter_writeable(). This succeeds, but >>>>>> the next call to generic_file_read_iter() returns -EFAULT and we're >>>>>> not making any progress. >>>>> >>>>> Since this is s390-specific, I get the very strong feeling that the >>>>> >>>>> fault_in_iov_iter_writeable -> >>>>> fault_in_safe_writeable -> >>>>> __get_user_pages_locked -> >>>>> __get_user_pages >>>>> >>>>> path somehow successfully finds the page, despite it not being >>>>> properly accessible in the page tables. >>>> >>>> As raised offline already, I suspect >>>> >>>> shrink_active_list() >>>> ->page_referenced() >>>> ->page_referenced_one() >>>> ->ptep_clear_flush_young_notify() >>>> ->ptep_clear_flush_young() >>>> >>>> which results on s390x in: >>>> >>>> static inline pte_t pte_mkold(pte_t pte) >>>> { >>>> pte_val(pte) &= ~_PAGE_YOUNG; >>>> pte_val(pte) |= _PAGE_INVALID; >>>> return pte; >>>> } >>>> >>>> static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, >>>> unsigned long addr, pte_t *ptep) >>>> { >>>> pte_t pte = *ptep; >>>> >>>> pte = ptep_xchg_direct(vma->vm_mm, addr, ptep, pte_mkold(pte)); >>>> return pte_young(pte); >>>> } >>>> >>>> >>>> _PAGE_INVALID is the actual HW bit, _PAGE_PRESENT is a >>>> pure SW bit. AFAIU, pte_present() still holds: >>>> >>>> static inline int pte_present(pte_t pte) >>>> { >>>> /* Bit pattern: (pte & 0x001) == 0x001 */ >>>> return (pte_val(pte) & _PAGE_PRESENT) != 0; >>>> } >>>> >>>> >>>> pte_mkyoung() will revert that action: >>>> >>>> static inline pte_t pte_mkyoung(pte_t pte) >>>> { >>>> pte_val(pte) |= _PAGE_YOUNG; >>>> if (pte_val(pte) & _PAGE_READ) >>>> pte_val(pte) &= ~_PAGE_INVALID; >>>> return pte; >>>> } >>>> >>>> >>>> and pte_modify() will adjust it properly again: >>>> >>>> /* >>>> * The following pte modification functions only work if >>>> * pte_present() is true. Undefined behaviour if not.. >>>> */ >>>> static inline pte_t pte_modify(pte_t pte, pgprot_t newprot) >>>> { >>>> pte_val(pte) &= _PAGE_CHG_MASK; >>>> pte_val(pte) |= pgprot_val(newprot); >>>> /* >>>> * newprot for PAGE_NONE, PAGE_RO, PAGE_RX, PAGE_RW and PAGE_RWX >>>> * has the invalid bit set, clear it again for readable, young pages >>>> */ >>>> if ((pte_val(pte) & _PAGE_YOUNG) && (pte_val(pte) & _PAGE_READ)) >>>> pte_val(pte) &= ~_PAGE_INVALID; >>>> /* >>>> * newprot for PAGE_RO, PAGE_RX, PAGE_RW and PAGE_RWX has the page >>>> * protection bit set, clear it again for writable, dirty pages >>>> */ >>>> if ((pte_val(pte) & _PAGE_DIRTY) && (pte_val(pte) & _PAGE_WRITE)) >>>> pte_val(pte) &= ~_PAGE_PROTECT; >>>> return pte; >>>> } >>>> >>>> >>>> >>>> Which leaves me wondering if there is a way in GUP whereby >>>> we would lookup that page and not clear _PAGE_INVALID, >>>> resulting in GUP succeeding but faults via the MMU still >>>> faulting on _PAGE_INVALID. >>> >>> >>> follow_page_pte() has this piece of code: >>> >>> if (flags & FOLL_TOUCH) { >>> if ((flags & FOLL_WRITE) && >>> !pte_dirty(pte) && !PageDirty(page)) >>> set_page_dirty(page); >>> /* >>> * pte_mkyoung() would be more correct here, but atomic care >>> * is needed to avoid losing the dirty bit: it is easier to use >>> * mark_page_accessed(). >>> */ >>> mark_page_accessed(page); >>> } >>> >>> Which at least to me suggests that, although the page is marked accessed and GUP >>> succeeds, that the PTE might still have _PAGE_INVALID set after we succeeded GUP. >>> >>> >>> On s390x, there is no HW dirty bit, so we might just be able to do a proper >>> pte_mkyoung() here instead of the mark_page_accessed(). >>> >> >> Something hacky like this should be able to show if what I suspect is the case. >> It compiles, but I didn't actually test it. > That would be the alternative that also takes the mkdirty into account: > > > From 1e51e8a93894f87c0a4d0e908391e0628ae56afe Mon Sep 17 00:00:00 2001 > From: David Hildenbrand <david@xxxxxxxxxx> > Date: Tue, 8 Mar 2022 12:51:26 +0100 > Subject: [PATCH] mm/gup: fix buffered I/O on s390x with pagefaults disabled > > On s390x, we actually need a pte_mkyoung() / pte_mkdirty() instead of > going via the page and leaving the PTE unmodified. E.g., if we only > mark the page accessed via mark_page_accessed() when doing a FOLL_TOUCH, > we'll miss to clear the HW invalid bit in the pte and subsequent accesses > via the MMU would still require a pagefault. > > Otherwise, buffered I/O will loop forever because it will keep stumling > over the set HW invalid bit, requiring a page fault. > > Reported-by: Andreas Gruenbacher <agruenba@xxxxxxxxxx> > Signed-off-by: David Hildenbrand <david@xxxxxxxxxx> > --- > mm/gup.c | 32 +++++++++++++++++++++++++------- > 1 file changed, 25 insertions(+), 7 deletions(-) > > diff --git a/mm/gup.c b/mm/gup.c > index a9d4d724aef7..de3311feb377 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -587,15 +587,33 @@ static struct page *follow_page_pte(struct vm_area_struct *vma, > } > } > if (flags & FOLL_TOUCH) { > - if ((flags & FOLL_WRITE) && > - !pte_dirty(pte) && !PageDirty(page)) > - set_page_dirty(page); > /* > - * pte_mkyoung() would be more correct here, but atomic care > - * is needed to avoid losing the dirty bit: it is easier to use > - * mark_page_accessed(). > + * We have to be careful with updating the PTE on architectures > + * that have a HW dirty bit: while updating the PTE we might > + * lose that bit again and we'd need an atomic update: it is > + * easier to leave the PTE untouched for these architectures. > + * > + * s390x doesn't have a hw referenced / dirty bit and e.g., sets > + * the hw invalid bit in pte_mkold(), to catch further > + * references. We have to update the PTE here to e.g., clear the > + * invalid bit; otherwise, callers that rely on not requiring > + * an MMU fault once GUP(FOLL_TOUCH) succeeded will loop forever > + * because the page won't actually be accessible via the MMU. > */ > - mark_page_accessed(page); > + if (IS_ENABLED(CONFIG_S390)) { > + pte = pte_mkyoung(pte); > + if (flags & FOLL_WRITE) > + pte = pte_mkdirty(pte); > + if (!pte_same(pte, *ptep)) { > + set_pte_at(vma->vm_mm, address, ptep, pte); > + update_mmu_cache(vma, address, ptep); I just reproduced without this fix and then tried to reproduce with this fix (however, replacing pte_same() + set_pte_at() by a ptep_set_access_flags()). Seems to do the trick for me. I'll figure out the best way to handle IS_ENABLED(CONFIG_S390) before posting an official fix. -- Thanks, David / dhildenb