On Fri, Oct 4, 2024 at 6:22 AM Chris Li <chrisl@xxxxxxxxxx> wrote: > > On Thu, Sep 26, 2024 at 2:20 PM Barry Song <21cnbao@xxxxxxxxx> wrote: > > > > From: Barry Song <v-songbaohua@xxxxxxxx> > > > > Commit 13ddaf26be32 ("mm/swap: fix race when skipping swapcache") > > introduced an unconditional one-tick sleep when `swapcache_prepare()` > > fails, which has led to reports of UI stuttering on latency-sensitive > > Android devices. To address this, we can use a waitqueue to wake up > > tasks that fail `swapcache_prepare()` sooner, instead of always > > sleeping for a full tick. While tasks may occasionally be woken by an > > unrelated `do_swap_page()`, this method is preferable to two scenarios: > > rapid re-entry into page faults, which can cause livelocks, and > > multiple millisecond sleeps, which visibly degrade user experience. > > > > Oven's testing shows that a single waitqueue resolves the UI > > stuttering issue. If a 'thundering herd' problem becomes apparent > > later, a waitqueue hash similar to `folio_wait_table[PAGE_WAIT_TABLE_SIZE]` > > for page bit locks can be introduced. > > > > Fixes: 13ddaf26be32 ("mm/swap: fix race when skipping swapcache") > > Cc: Kairui Song <kasong@xxxxxxxxxxx> > > Cc: "Huang, Ying" <ying.huang@xxxxxxxxx> > > Cc: Yu Zhao <yuzhao@xxxxxxxxxx> > > Cc: David Hildenbrand <david@xxxxxxxxxx> > > Cc: Chris Li <chrisl@xxxxxxxxxx> > > Cc: Hugh Dickins <hughd@xxxxxxxxxx> > > Cc: Johannes Weiner <hannes@xxxxxxxxxxx> > > Cc: Matthew Wilcox (Oracle) <willy@xxxxxxxxxxxxx> > > Cc: Michal Hocko <mhocko@xxxxxxxx> > > Cc: Minchan Kim <minchan@xxxxxxxxxx> > > Cc: Yosry Ahmed <yosryahmed@xxxxxxxxxx> > > Cc: SeongJae Park <sj@xxxxxxxxxx> > > Cc: Kalesh Singh <kaleshsingh@xxxxxxxxxx> > > Cc: Suren Baghdasaryan <surenb@xxxxxxxxxx> > > Cc: <stable@xxxxxxxxxxxxxxx> > > Reported-by: Oven Liyang <liyangouwen1@xxxxxxxx> > > Tested-by: Oven Liyang <liyangouwen1@xxxxxxxx> > > Signed-off-by: Barry Song <v-songbaohua@xxxxxxxx> > > --- > > mm/memory.c | 13 +++++++++++-- > > 1 file changed, 11 insertions(+), 2 deletions(-) > > > > diff --git a/mm/memory.c b/mm/memory.c > > index 2366578015ad..6913174f7f41 100644 > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -4192,6 +4192,8 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf) > > } > > #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ > > > > +static DECLARE_WAIT_QUEUE_HEAD(swapcache_wq); > > + > > /* > > * We enter with non-exclusive mmap_lock (to exclude vma changes, > > * but allow concurrent faults), and pte mapped but not yet locked. > > @@ -4204,6 +4206,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > { > > struct vm_area_struct *vma = vmf->vma; > > struct folio *swapcache, *folio = NULL; > > + DECLARE_WAITQUEUE(wait, current); > > struct page *page; > > struct swap_info_struct *si = NULL; > > rmap_t rmap_flags = RMAP_NONE; > > @@ -4302,7 +4305,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > * Relax a bit to prevent rapid > > * repeated page faults. > > */ > > + add_wait_queue(&swapcache_wq, &wait); > > schedule_timeout_uninterruptible(1); > > + remove_wait_queue(&swapcache_wq, &wait); > > There is only one "swapcache_wq", if we don't care about the memory > overhead, ideally should be per swap entry that fails to grab the > HAS_CACHE bit and has one wait queue. Currently all swap entries using > one wait queue will likely cause other swap entries (if any) get wait > up then find out the swap entry it cares hasn't been served yet. > even page bit locks do have a waitqueue for one page, i believe that case has much serious contention then swap-in. page bit lock depends on a waitqueue hash to decrease unrelated wake-up. if one process is woken-up by unrelated do_swap_page() and its swapcache is not released, it will sleep again after re-checking swapcache_prepare(). Too many unrelated wake-ups would be just a 'thundering herd' but not a livelock. > Another thing to consider is that, if we are using a wait queue, the > 1ms is not relevant any more. It can be longer than 1ms and it is > getting waited up by the wait queue anyway. Here you might use > indefinitely sleep to reduce the unnecessary wait up and the > complexity of the timer. not quite sure what you mean for 1ms, in an embedded system, we never use 1000HZ, the typical/maximum HZ is 250. not quite sure what you mean by "indefinitely sleep", my understanding is that we can't poll the result of swapcache_prepare() as the winner process which does swapcache_prepare() successfully will drop the swap slots. > > > goto out_page; > > } > > need_clear_cache = true; > > @@ -4609,8 +4614,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > pte_unmap_unlock(vmf->pte, vmf->ptl); > > out: > > /* Clear the swap cache pin for direct swapin after PTL unlock */ > > - if (need_clear_cache) > > + if (need_clear_cache) { > > swapcache_clear(si, entry, nr_pages); > > + wake_up(&swapcache_wq); > > Agree with Ying that here the common path will need to take a lock to > wait up the wait queue. waitqueue_active() might be a good candidate. > > Chris > > > > + } > > if (si) > > put_swap_device(si); > > return ret; > > @@ -4625,8 +4632,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > folio_unlock(swapcache); > > folio_put(swapcache); > > } > > - if (need_clear_cache) > > + if (need_clear_cache) { > > swapcache_clear(si, entry, nr_pages); > > + wake_up(&swapcache_wq); > > + } > > if (si) > > put_swap_device(si); > > return ret; > > -- > > 2.34.1 > > Thanks Barry