On Fri, 22 Dec 2023, Chris Li wrote: > On Fri, Dec 22, 2023 at 11:52:08AM -0800, Andrew Morton wrote: > > On Thu, 21 Dec 2023 22:25:39 -0800 Chris Li <chrisl@xxxxxxxxxx> wrote: > > > > > We discovered that 1% swap page fault is 100us+ while 50% of > > > the swap fault is under 20us. > > > > > > Further investigation show that a large portion of the time > > > spent in the free_swap_slots() function for the long tail case. > > > > > > The percpu cache of swap slots is freed in a batch of 64 entries > > > inside free_swap_slots(). These cache entries are accumulated > > > from previous page faults, which may not be related to the current > > > process. > > > > > > Doing the batch free in the page fault handler causes longer > > > tail latencies and penalizes the current process. > > > > > > Move free_swap_slots() outside of the swapin page fault handler into an > > > async work queue to avoid such long tail latencies. > > > > This will require a larger amount of total work than the current > > Yes, there will be a tiny little bit of extra overhead to schedule the job > on to the other work queue. > How do you quantify the impact of the delayed swap_entry_free()? Since the free and memcg uncharge are now delayed, is there not the possibility that we stay under memory pressure for longer? (Assuming at least some users are swapping because of memory pressure.) I would assume that since the free and uncharge itself is delayed that in the pathological case we'd actually be swapping *more* until the async worker can run.