Re: Buffered I/O broken on s390x with page faults disabled (gfs2)

David Hildenbrand <david@xxxxxxxxxx> · Tue, 8 Mar 2022 18:47:56 +0100

On 08.03.22 18:26, Linus Torvalds wrote:
> On Tue, Mar 8, 2022 at 12:21 AM David Hildenbrand <david@xxxxxxxxxx> wrote:
>>
>> As raised offline already, I suspect
>>
>> shrink_active_list()
>> ->page_referenced()
>>  ->page_referenced_one()
>>   ->ptep_clear_flush_young_notify()
>>    ->ptep_clear_flush_young()
>>
>> which results on s390x in:
>>
>> static inline pte_t pte_mkold(pte_t pte)
>> {
>>         pte_val(pte) &= ~_PAGE_YOUNG;
>>         pte_val(pte) |= _PAGE_INVALID;
>>         return pte;
>> }
> 
> Yeah, that looks likely.
> 
> It looks to me like GUP just doesn't care about _PAGE_INVALID on s390,
> and happily looks up that page despite it not being "present" as far
> as hardware is concerned.
> 
> Your actual patch looks pretty nasty, though. We avoid marking it
> accessed on purpose (to avoid atomicity issues wrt hw-dirty bits etc),
> but still, that patch makes me go "there has to be a better way".

It certainly only works if we don't have hw dirty bits that might get
set concurrently -- for example, on s390x there is no such requirement.

As raised by Gerald, arch_faults_for_dirty_pte (and existing
arch_faults_on_old_pte) might be one option to get rid of the s390x
special-casing, and detect any arch that might update the dirty bit
concurrently.

Interestingly, mm/huge_memory.c:touch_pmd() doesn't seem to care about
concurrent dirty-bit updates by the hardware. Hmm.

But, of course, I'm open for alternatives, maybe we could adjust
fault_in_safe_writeable() to not use GUP as raised by you in the other
reply.

-- 
Thanks,

David / dhildenb