On 08.03.22 13:11, David Hildenbrand wrote: > On 08.03.22 09:37, David Hildenbrand wrote: >> On 08.03.22 09:21, David Hildenbrand wrote: >>> On 08.03.22 00:18, Linus Torvalds wrote: >>>> On Mon, Mar 7, 2022 at 2:52 PM Andreas Gruenbacher <agruenba@xxxxxxxxxx> wrote: >>>>> >>>>> After generic_file_read_iter() returns a short or empty read, we fault >>>>> in some pages with fault_in_iov_iter_writeable(). This succeeds, but >>>>> the next call to generic_file_read_iter() returns -EFAULT and we're >>>>> not making any progress. >>>> >>>> Since this is s390-specific, I get the very strong feeling that the >>>> >>>> fault_in_iov_iter_writeable -> >>>> fault_in_safe_writeable -> >>>> __get_user_pages_locked -> >>>> __get_user_pages >>>> >>>> path somehow successfully finds the page, despite it not being >>>> properly accessible in the page tables. >>> >>> As raised offline already, I suspect >>> >>> shrink_active_list() >>> ->page_referenced() >>> ->page_referenced_one() >>> ->ptep_clear_flush_young_notify() >>> ->ptep_clear_flush_young() >>> >>> which results on s390x in: >>> >>> static inline pte_t pte_mkold(pte_t pte) >>> { >>> pte_val(pte) &= ~_PAGE_YOUNG; >>> pte_val(pte) |= _PAGE_INVALID; >>> return pte; >>> } >>> >>> static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, >>> unsigned long addr, pte_t *ptep) >>> { >>> pte_t pte = *ptep; >>> >>> pte = ptep_xchg_direct(vma->vm_mm, addr, ptep, pte_mkold(pte)); >>> return pte_young(pte); >>> } >>> >>> >>> _PAGE_INVALID is the actual HW bit, _PAGE_PRESENT is a >>> pure SW bit. AFAIU, pte_present() still holds: >>> >>> static inline int pte_present(pte_t pte) >>> { >>> /* Bit pattern: (pte & 0x001) == 0x001 */ >>> return (pte_val(pte) & _PAGE_PRESENT) != 0; >>> } >>> >>> >>> pte_mkyoung() will revert that action: >>> >>> static inline pte_t pte_mkyoung(pte_t pte) >>> { >>> pte_val(pte) |= _PAGE_YOUNG; >>> if (pte_val(pte) & _PAGE_READ) >>> pte_val(pte) &= ~_PAGE_INVALID; >>> return pte; >>> } >>> >>> >>> and pte_modify() will adjust it properly again: >>> >>> /* >>> * The following pte modification functions only work if >>> * pte_present() is true. Undefined behaviour if not.. >>> */ >>> static inline pte_t pte_modify(pte_t pte, pgprot_t newprot) >>> { >>> pte_val(pte) &= _PAGE_CHG_MASK; >>> pte_val(pte) |= pgprot_val(newprot); >>> /* >>> * newprot for PAGE_NONE, PAGE_RO, PAGE_RX, PAGE_RW and PAGE_RWX >>> * has the invalid bit set, clear it again for readable, young pages >>> */ >>> if ((pte_val(pte) & _PAGE_YOUNG) && (pte_val(pte) & _PAGE_READ)) >>> pte_val(pte) &= ~_PAGE_INVALID; >>> /* >>> * newprot for PAGE_RO, PAGE_RX, PAGE_RW and PAGE_RWX has the page >>> * protection bit set, clear it again for writable, dirty pages >>> */ >>> if ((pte_val(pte) & _PAGE_DIRTY) && (pte_val(pte) & _PAGE_WRITE)) >>> pte_val(pte) &= ~_PAGE_PROTECT; >>> return pte; >>> } >>> >>> >>> >>> Which leaves me wondering if there is a way in GUP whereby >>> we would lookup that page and not clear _PAGE_INVALID, >>> resulting in GUP succeeding but faults via the MMU still >>> faulting on _PAGE_INVALID. >> >> >> follow_page_pte() has this piece of code: >> >> if (flags & FOLL_TOUCH) { >> if ((flags & FOLL_WRITE) && >> !pte_dirty(pte) && !PageDirty(page)) >> set_page_dirty(page); >> /* >> * pte_mkyoung() would be more correct here, but atomic care >> * is needed to avoid losing the dirty bit: it is easier to use >> * mark_page_accessed(). >> */ >> mark_page_accessed(page); >> } >> >> Which at least to me suggests that, although the page is marked accessed and GUP >> succeeds, that the PTE might still have _PAGE_INVALID set after we succeeded GUP. >> >> >> On s390x, there is no HW dirty bit, so we might just be able to do a proper >> pte_mkyoung() here instead of the mark_page_accessed(). >> > > Something hacky like this should be able to show if what I suspect is the case. > It compiles, but I didn't actually test it. That would be the alternative that also takes the mkdirty into account: