On 08.03.22 00:18, Linus Torvalds wrote: > On Mon, Mar 7, 2022 at 2:52 PM Andreas Gruenbacher <agruenba@xxxxxxxxxx> wrote: >> >> After generic_file_read_iter() returns a short or empty read, we fault >> in some pages with fault_in_iov_iter_writeable(). This succeeds, but >> the next call to generic_file_read_iter() returns -EFAULT and we're >> not making any progress. > > Since this is s390-specific, I get the very strong feeling that the > > fault_in_iov_iter_writeable -> > fault_in_safe_writeable -> > __get_user_pages_locked -> > __get_user_pages > > path somehow successfully finds the page, despite it not being > properly accessible in the page tables. As raised offline already, I suspect shrink_active_list() ->page_referenced() ->page_referenced_one() ->ptep_clear_flush_young_notify() ->ptep_clear_flush_young() which results on s390x in: static inline pte_t pte_mkold(pte_t pte) { pte_val(pte) &= ~_PAGE_YOUNG; pte_val(pte) |= _PAGE_INVALID; return pte; } static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, unsigned long addr, pte_t *ptep) { pte_t pte = *ptep; pte = ptep_xchg_direct(vma->vm_mm, addr, ptep, pte_mkold(pte)); return pte_young(pte); } _PAGE_INVALID is the actual HW bit, _PAGE_PRESENT is a pure SW bit. AFAIU, pte_present() still holds: static inline int pte_present(pte_t pte) { /* Bit pattern: (pte & 0x001) == 0x001 */ return (pte_val(pte) & _PAGE_PRESENT) != 0; } pte_mkyoung() will revert that action: static inline pte_t pte_mkyoung(pte_t pte) { pte_val(pte) |= _PAGE_YOUNG; if (pte_val(pte) & _PAGE_READ) pte_val(pte) &= ~_PAGE_INVALID; return pte; } and pte_modify() will adjust it properly again: /* * The following pte modification functions only work if * pte_present() is true. Undefined behaviour if not.. */ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot) { pte_val(pte) &= _PAGE_CHG_MASK; pte_val(pte) |= pgprot_val(newprot); /* * newprot for PAGE_NONE, PAGE_RO, PAGE_RX, PAGE_RW and PAGE_RWX * has the invalid bit set, clear it again for readable, young pages */ if ((pte_val(pte) & _PAGE_YOUNG) && (pte_val(pte) & _PAGE_READ)) pte_val(pte) &= ~_PAGE_INVALID; /* * newprot for PAGE_RO, PAGE_RX, PAGE_RW and PAGE_RWX has the page * protection bit set, clear it again for writable, dirty pages */ if ((pte_val(pte) & _PAGE_DIRTY) && (pte_val(pte) & _PAGE_WRITE)) pte_val(pte) &= ~_PAGE_PROTECT; return pte; } Which leaves me wondering if there is a way in GUP whereby we would lookup that page and not clear _PAGE_INVALID, resulting in GUP succeeding but faults via the MMU still faulting on _PAGE_INVALID. -- Thanks, David / dhildenb