On 08.03.22 09:37, David Hildenbrand wrote: > On 08.03.22 09:21, David Hildenbrand wrote: >> On 08.03.22 00:18, Linus Torvalds wrote: >>> On Mon, Mar 7, 2022 at 2:52 PM Andreas Gruenbacher <agruenba@xxxxxxxxxx> wrote: >>>> >>>> After generic_file_read_iter() returns a short or empty read, we fault >>>> in some pages with fault_in_iov_iter_writeable(). This succeeds, but >>>> the next call to generic_file_read_iter() returns -EFAULT and we're >>>> not making any progress. >>> >>> Since this is s390-specific, I get the very strong feeling that the >>> >>> fault_in_iov_iter_writeable -> >>> fault_in_safe_writeable -> >>> __get_user_pages_locked -> >>> __get_user_pages >>> >>> path somehow successfully finds the page, despite it not being >>> properly accessible in the page tables. >> >> As raised offline already, I suspect >> >> shrink_active_list() >> ->page_referenced() >> ->page_referenced_one() >> ->ptep_clear_flush_young_notify() >> ->ptep_clear_flush_young() >> >> which results on s390x in: >> >> static inline pte_t pte_mkold(pte_t pte) >> { >> pte_val(pte) &= ~_PAGE_YOUNG; >> pte_val(pte) |= _PAGE_INVALID; >> return pte; >> } >> >> static inline int ptep_test_and_clear_young(struct vm_area_struct *vma, >> unsigned long addr, pte_t *ptep) >> { >> pte_t pte = *ptep; >> >> pte = ptep_xchg_direct(vma->vm_mm, addr, ptep, pte_mkold(pte)); >> return pte_young(pte); >> } >> >> >> _PAGE_INVALID is the actual HW bit, _PAGE_PRESENT is a >> pure SW bit. AFAIU, pte_present() still holds: >> >> static inline int pte_present(pte_t pte) >> { >> /* Bit pattern: (pte & 0x001) == 0x001 */ >> return (pte_val(pte) & _PAGE_PRESENT) != 0; >> } >> >> >> pte_mkyoung() will revert that action: >> >> static inline pte_t pte_mkyoung(pte_t pte) >> { >> pte_val(pte) |= _PAGE_YOUNG; >> if (pte_val(pte) & _PAGE_READ) >> pte_val(pte) &= ~_PAGE_INVALID; >> return pte; >> } >> >> >> and pte_modify() will adjust it properly again: >> >> /* >> * The following pte modification functions only work if >> * pte_present() is true. Undefined behaviour if not.. >> */ >> static inline pte_t pte_modify(pte_t pte, pgprot_t newprot) >> { >> pte_val(pte) &= _PAGE_CHG_MASK; >> pte_val(pte) |= pgprot_val(newprot); >> /* >> * newprot for PAGE_NONE, PAGE_RO, PAGE_RX, PAGE_RW and PAGE_RWX >> * has the invalid bit set, clear it again for readable, young pages >> */ >> if ((pte_val(pte) & _PAGE_YOUNG) && (pte_val(pte) & _PAGE_READ)) >> pte_val(pte) &= ~_PAGE_INVALID; >> /* >> * newprot for PAGE_RO, PAGE_RX, PAGE_RW and PAGE_RWX has the page >> * protection bit set, clear it again for writable, dirty pages >> */ >> if ((pte_val(pte) & _PAGE_DIRTY) && (pte_val(pte) & _PAGE_WRITE)) >> pte_val(pte) &= ~_PAGE_PROTECT; >> return pte; >> } >> >> >> >> Which leaves me wondering if there is a way in GUP whereby >> we would lookup that page and not clear _PAGE_INVALID, >> resulting in GUP succeeding but faults via the MMU still >> faulting on _PAGE_INVALID. > > > follow_page_pte() has this piece of code: > > if (flags & FOLL_TOUCH) { > if ((flags & FOLL_WRITE) && > !pte_dirty(pte) && !PageDirty(page)) > set_page_dirty(page); > /* > * pte_mkyoung() would be more correct here, but atomic care > * is needed to avoid losing the dirty bit: it is easier to use > * mark_page_accessed(). > */ > mark_page_accessed(page); > } > > Which at least to me suggests that, although the page is marked accessed and GUP > succeeds, that the PTE might still have _PAGE_INVALID set after we succeeded GUP. > > > On s390x, there is no HW dirty bit, so we might just be able to do a proper > pte_mkyoung() here instead of the mark_page_accessed(). > Something hacky like this should be able to show if what I suspect is the case. It compiles, but I didn't actually test it.