On Fri, Apr 14, 2023 at 12:48:54PM -0700, Suren Baghdasaryan wrote: > > - We can call migration_entry_wait(). This will wait for PG_locked to > > become clear (in migration_entry_wait_on_locked()). As previously > > discussed offline, I think this is safe to do while holding the VMA > > locked. Just to be clear, this particular use of PG_locked is not during I/O, it's during page migration. This is a few orders of magnitude different. > > - We can call swap_readpage() if we allocate a new folio. I haven't > > traced through all this code to tell if it's OK. ... whereas this will wait for I/O. If we decide that's not OK, we'll need to test for FAULT_FLAG_VMA_LOCK and bail out of this path. > > So ... I believe this is all OK, but we're definitely now willing to > > wait for I/O from the swap device while holding the VMA lock when we > > weren't before. And maybe we should make a bigger deal of it in the > > changelog. > > > > And maybe we shouldn't just be failing the folio_lock_or_retry(), > > maybe we should be waiting for the folio lock with the VMA locked. > > Wouldn't that cause holding the VMA lock for the duration of swap I/O > (something you said we want to avoid in the previous paragraph) and > effectively undo d065bd810b6d ("mm: retry page fault when blocking on > disk transfer") for VMA locks? I'm not certain we want to avoid holding the VMA lock for the duration of an I/O. Here's how I understand the rationale for avoiding holding the mmap_lock while we perform I/O (before the existence of the VMA lock): - If everybody is doing page faults, there is no specific problem; we all hold the lock for read and multiple page faults can be handled in parallel. - As soon as one thread attempts to manipulate the tree (eg calls mmap()), all new readers must wait (as the rwsem is fair), and the writer must wait for all existing readers to finish. That's potentially milliseconds for an I/O during which time all page faults stop. Now we have the per-VMA lock, faults which can be handled without taking the mmap_lock can still be satisfied, as long as that VMA is not being modified. It is rare for a real application to take a page fault on a VMA which is being modified. So modifications to the tree will generally not take VMA locks on VMAs which are currently handling faults, and new faults will generally not find a VMA which is write-locked. When we find a locked folio (presumably for I/O, although folios are locked for other reasons), if we fall back to taking the mmap_lock for read, we increase contention on the mmap_lock and make the page fault wait on any mmap() operation. If we simply sleep waiting for the I/O, we make any mmap() operation _which touches this VMA_ wait for the I/O to complete. But I think that's OK, because new page faults can continue to be serviced ... as long as they don't need to take the mmap_lock. So ... I think what we _really_ want here is ... +++ b/mm/filemap.c @@ -1690,7 +1690,8 @@ static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait) bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm, unsigned int flags) { - if (fault_flag_allow_retry_first(flags)) { + if (!(flags & FAULT_FLAG_VMA_LOCK) && + fault_flag_allow_retry_first(flags)) { /* * CAUTION! In this case, mmap_lock is not released * even though return 0. @@ -1710,7 +1711,8 @@ bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm, ret = __folio_lock_killable(folio); if (ret) { - mmap_read_unlock(mm); + if (!(flags & FAULT_FLAG_VMA_LOCK)) + mmap_read_unlock(mm); return false; } } else {