On Fri, Apr 14, 2023 at 1:32 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > > On Fri, Apr 14, 2023 at 12:48:54PM -0700, Suren Baghdasaryan wrote: > > > - We can call migration_entry_wait(). This will wait for PG_locked to > > > become clear (in migration_entry_wait_on_locked()). As previously > > > discussed offline, I think this is safe to do while holding the VMA > > > locked. > > Just to be clear, this particular use of PG_locked is not during I/O, > it's during page migration. This is a few orders of magnitude > different. > > > > - We can call swap_readpage() if we allocate a new folio. I haven't > > > traced through all this code to tell if it's OK. > > ... whereas this will wait for I/O. If we decide that's not OK, we'll > need to test for FAULT_FLAG_VMA_LOCK and bail out of this path. > > > > So ... I believe this is all OK, but we're definitely now willing to > > > wait for I/O from the swap device while holding the VMA lock when we > > > weren't before. And maybe we should make a bigger deal of it in the > > > changelog. > > > > > > And maybe we shouldn't just be failing the folio_lock_or_retry(), > > > maybe we should be waiting for the folio lock with the VMA locked. > > > > Wouldn't that cause holding the VMA lock for the duration of swap I/O > > (something you said we want to avoid in the previous paragraph) and > > effectively undo d065bd810b6d ("mm: retry page fault when blocking on > > disk transfer") for VMA locks? > > I'm not certain we want to avoid holding the VMA lock for the duration > of an I/O. Here's how I understand the rationale for avoiding holding > the mmap_lock while we perform I/O (before the existence of the VMA lock): > > - If everybody is doing page faults, there is no specific problem; > we all hold the lock for read and multiple page faults can be handled > in parallel. > - As soon as one thread attempts to manipulate the tree (eg calls > mmap()), all new readers must wait (as the rwsem is fair), and the > writer must wait for all existing readers to finish. That's > potentially milliseconds for an I/O during which time all page faults > stop. > > Now we have the per-VMA lock, faults which can be handled without taking > the mmap_lock can still be satisfied, as long as that VMA is not being > modified. It is rare for a real application to take a page fault on a > VMA which is being modified. > > So modifications to the tree will generally not take VMA locks on VMAs > which are currently handling faults, and new faults will generally not > find a VMA which is write-locked. > > When we find a locked folio (presumably for I/O, although folios are > locked for other reasons), if we fall back to taking the mmap_lock > for read, we increase contention on the mmap_lock and make the page > fault wait on any mmap() operation. Do you mean we increase mmap_lock contention by holding the mmap_lock between the start of pagefault retry and until we drop it in __folio_lock_or_retry? > If we simply sleep waiting for the > I/O, we make any mmap() operation _which touches this VMA_ wait for > the I/O to complete. But I think that's OK, because new page faults > can continue to be serviced ... as long as they don't need to take > the mmap_lock. Ok, so we will potentially block VMA writers for the duration of the I/O... Stupid question: why was this a bigger problem for mmap_lock? Potentially our address space can consist of only one anon VMA, so locking that VMA vs mmap_lock should be the same from swap pagefault POV. Maybe mmap_lock is taken for write in some other important cases when VMA lock is not needed? > > So ... I think what we _really_ want here is ... > > +++ b/mm/filemap.c > @@ -1690,7 +1690,8 @@ static int __folio_lock_async(struct folio *folio, struct wait_page_queue *wait) > bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm, > unsigned int flags) > { > - if (fault_flag_allow_retry_first(flags)) { > + if (!(flags & FAULT_FLAG_VMA_LOCK) && > + fault_flag_allow_retry_first(flags)) { > /* > * CAUTION! In this case, mmap_lock is not released > * even though return 0. > @@ -1710,7 +1711,8 @@ bool __folio_lock_or_retry(struct folio *folio, struct mm_struct *mm, > > ret = __folio_lock_killable(folio); > if (ret) { > - mmap_read_unlock(mm); > + if (!(flags & FAULT_FLAG_VMA_LOCK)) > + mmap_read_unlock(mm); > return false; > } > } else { >