On 08.03.22 18:26, Linus Torvalds wrote: > On Tue, Mar 8, 2022 at 12:21 AM David Hildenbrand <david@xxxxxxxxxx> wrote: >> >> As raised offline already, I suspect >> >> shrink_active_list() >> ->page_referenced() >> ->page_referenced_one() >> ->ptep_clear_flush_young_notify() >> ->ptep_clear_flush_young() >> >> which results on s390x in: >> >> static inline pte_t pte_mkold(pte_t pte) >> { >> pte_val(pte) &= ~_PAGE_YOUNG; >> pte_val(pte) |= _PAGE_INVALID; >> return pte; >> } > > Yeah, that looks likely. > > It looks to me like GUP just doesn't care about _PAGE_INVALID on s390, > and happily looks up that page despite it not being "present" as far > as hardware is concerned. > > Your actual patch looks pretty nasty, though. We avoid marking it > accessed on purpose (to avoid atomicity issues wrt hw-dirty bits etc), > but still, that patch makes me go "there has to be a better way". It certainly only works if we don't have hw dirty bits that might get set concurrently -- for example, on s390x there is no such requirement. As raised by Gerald, arch_faults_for_dirty_pte (and existing arch_faults_on_old_pte) might be one option to get rid of the s390x special-casing, and detect any arch that might update the dirty bit concurrently. Interestingly, mm/huge_memory.c:touch_pmd() doesn't seem to care about concurrent dirty-bit updates by the hardware. Hmm. But, of course, I'm open for alternatives, maybe we could adjust fault_in_safe_writeable() to not use GUP as raised by you in the other reply. -- Thanks, David / dhildenb