Re: [PATCH 1/1] mm: handle swap page faults if the faulting page can be locked

Alistair Popple <apopple@xxxxxxxxxx> · Tue, 18 Apr 2023 09:33:48 +1000

Suren Baghdasaryan <surenb@xxxxxxxxxx> writes:

> On Sun, Apr 16, 2023 at 6:06 PM Alistair Popple <apopple@xxxxxxxxxx> wrote:
>>
>>
>> Matthew Wilcox <willy@xxxxxxxxxxxxx> writes:
>>
>> > On Fri, Apr 14, 2023 at 11:00:43AM -0700, Suren Baghdasaryan wrote:
>> >> When page fault is handled under VMA lock protection, all swap page
>> >> faults are retried with mmap_lock because folio_lock_or_retry
>> >> implementation has to drop and reacquire mmap_lock if folio could
>> >> not be immediately locked.
>> >> Instead of retrying all swapped page faults, retry only when folio
>> >> locking fails.
>> >
>> > Reviewed-by: Matthew Wilcox (Oracle) <willy@xxxxxxxxxxxxx>
>> >
>> > Let's just review what can now be handled under the VMA lock instead of
>> > the mmap_lock, in case somebody knows better than me that it's not safe.
>> >
>> >  - We can call migration_entry_wait().  This will wait for PG_locked to
>> >    become clear (in migration_entry_wait_on_locked()).  As previously
>> >    discussed offline, I think this is safe to do while holding the VMA
>> >    locked.
>>
>> Do we even need to be holding the VMA locked while in
>> migration_entry_wait()? My understanding is we're just waiting for
>> PG_locked to be cleared so we can return with a reasonable chance the
>> migration entry is gone. If for example it has been unmapped or
>> protections downgraded we will simply refault.
>
> If we drop VMA lock before migration_entry_wait() then we would need
> to lock_vma_under_rcu again after the wait. In which case it might be
> simpler to retry the fault with some special return code to indicate
> that VMA lock is not held anymore and we want to retry without taking
> mmap_lock. I think it's similar to the last options Matthew suggested
> earlier. In which case we can reuse the same retry mechanism for both
> cases, here and in __folio_lock_or_retry.

Good point. Agree there is no reason to re-take the VMA lock after the
wait, although in this case we shouldn't need to retry the fault
(ie. return VM_FAULT_RETRY). Just skip calling vma_end_read() on the way
out to userspace.

>>
>> >  - We can call remove_device_exclusive_entry().  That calls
>> >    folio_lock_or_retry(), which will fail if it can't get the VMA lock.
>>
>> Looks ok to me.
>>
>> >  - We can call pgmap->ops->migrate_to_ram().  Perhaps somebody familiar
>> >    with Nouveau and amdkfd could comment on how safe this is?
>>
>> Currently this won't work because drives assume mmap_lock is held during
>> pgmap->ops->migrate_to_ram(). Primarily this is because
>> migrate_vma_setup()/migrate_vma_pages() is used to handle the fault and
>> that asserts mmap_lock is taken in walk_page_range() and also
>> migrate_vma_insert_page().
>>
>> So I don't think we can call that case without mmap_lock.
>>
>> At a glance it seems it should be relatively easy to move to using
>> lock_vma_under_rcu(). Drivers will need updating as well though because
>> migrate_vma_setup() is called outside of fault handling paths so drivers
>> will currently take mmap_lock rather than vma lock when looking up the
>> vma. See for example nouveau_svmm_bind().
>
> Thanks for the pointers, Alistair! It does look like we need to be
> more careful with the migrate_to_ram() path. For now I can fallback to
> retrying with mmap_lock for this case, like with do with all cases
> today. Afterwards this path can be made ready for working under VMA
> lock and we can remove that retry. Does that sound good?

Sounds good to me. Fixing that shouldn't be too difficult but will need
changes to at least Nouveau and amdkfd (and hmm-tests obviously). Happy
to look at doing that if/when this change makes it in. Thanks.

>>
>> >  - I believe we can't call handle_pte_marker() because we exclude UFFD
>> >    VMAs earlier.
>> >  - We can call swap_readpage() if we allocate a new folio.  I haven't
>> >    traced through all this code to tell if it's OK.
>> >
>> > So ... I believe this is all OK, but we're definitely now willing to
>> > wait for I/O from the swap device while holding the VMA lock when we
>> > weren't before.  And maybe we should make a bigger deal of it in the
>> > changelog.
>> >
>> > And maybe we shouldn't just be failing the folio_lock_or_retry(),
>> > maybe we should be waiting for the folio lock with the VMA locked.
>>