On 02/19/25 at 04:34pm, Kairui Song wrote: > On Wed, Feb 19, 2025 at 3:54 PM Baoquan He <bhe@xxxxxxxxxx> wrote: > > Hi Baoquan, > > Thanks for the review! > > > > > On 02/15/25 at 01:57am, Kairui Song wrote: > > > From: Kairui Song <kasong@xxxxxxxxxxx> > > > > > > Current allocation workflow first traverses the plist with a global lock > > > held, after choosing a device, it uses the percpu cluster on that swap > > > device. This commit moves the percpu cluster variable out of being tied > > > to individual swap devices, making it a global percpu variable, and will > > > be used directly for allocation as a fast path. > > > > > > The global percpu cluster variable will never point to a HDD device, and > > > allocation on HDD devices is still globally serialized. > > > > > > This improves the allocator performance and prepares for removal of the > > > slot cache in later commits. There shouldn't be much observable behavior > > > change, except one thing: this changes how swap device allocation > > > rotation works. > > > > > > Currently, each allocation will rotate the plist, and because of the > > > existence of slot cache (64 entries), swap devices of the same priority > > > are rotated for every 64 entries consumed. And, high order allocations > > > are different, they will bypass the slot cache, and so swap device is > > > rotated for every 16K, 32K, or up to 2M allocation. > > > > > > The rotation rule was never clearly defined or documented, it was changed > > > several times without mentioning too. > > > > > > After this commit, once slot cache is gone in later commits, swap device > > > rotation will happen for every consumed cluster. Ideally non-HDD devices > > > will be rotated if 2M space has been consumed for each order, this seems > > > > This breaks the rule where the high priority swap device is always taken > > to allocate as long as there's free space in the device. After this patch, > > it will try the percpu cluster firstly which is lower priority even though > > the higher priority device has free space. However, this only happens when > > the higher priority device is exhausted, not a generic case. If this is > > expected, it may need be mentioned in log or doc somewhere at least. > > Hmm, actually this rule was already broken if you are very strict > about it. The current percpu slot cache does a pre-allocation, so the > high priority device will be removed from the plist while some CPU's > slot cache holding usable entries. Ah, right, I didn't think about this point. > > If the high priority device is exhausted, some CPU's percpu cluster > will point to a low priority device indeed, and keep using it until > the percpu cluster is drained. I think this should be > OK. The high priority device is already full, so the amount of > swapouts falls back to low priority device is only a performance > issue, I think it's a tiny change for a rare case. Agree, thanks for explanation. > > > > > > reasonable. HDD devices is rotated for every allocation regardless of the > > > allocation order, which should be OK and trivial. > > > > > > Signed-off-by: Kairui Song <kasong@xxxxxxxxxxx> > > > --- > > > include/linux/swap.h | 11 ++-- > > > mm/swapfile.c | 120 +++++++++++++++++++++++++++---------------- > > > 2 files changed, 79 insertions(+), 52 deletions(-) > > ...... > > > diff --git a/mm/swapfile.c b/mm/swapfile.c > > > index ae3bd0a862fc..791cd7ed5bdf 100644 > > > --- a/mm/swapfile.c > > > +++ b/mm/swapfile.c > > > @@ -116,6 +116,18 @@ static atomic_t proc_poll_event = ATOMIC_INIT(0); > > > > > ......snip.... > > > int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order) > > > { > > > int order = swap_entry_order(entry_order); > > > @@ -1211,19 +1251,28 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order) > > > int n_ret = 0; > > > int node; > > > > > > + /* Fast path using percpu cluster */ > > > + local_lock(&percpu_swap_cluster.lock); > > > + n_ret = swap_alloc_fast(swp_entries, > > > + SWAP_HAS_CACHE, > > > + order, n_goal); > > > + if (n_ret == n_goal) > > > + goto out; > > > + > > > + n_goal = min_t(int, n_goal - n_ret, SWAP_BATCH); > > > > Here, the behaviour is changed too. In old allocation, partial > > allocation will jump out to return. In this patch, you try the percpu > > cluster firstly, then call scan_swap_map_slots() to try best and will > > jump out even though partial allocation succeed. But the allocation from > > scan_swap_map_slots() could happen on different si device, this looks > > bizarre. Do you think we need reconsider the design? > > Right, that's a behavior change, but only temporarily affects slot cache. > get_swap_pages will only be called with size > 1 when order == 0, and > only by slot cache. (Large order allocation always use size == 1, > other users only uses order == 0 && size == 1). So I didn't' notice > it, as this series is removing slot cache. Right, I am reviewing patch 6, noticed this is temporary. > > The partial side effect would be "returned slots will be from > different devices" and "slot_cache may get drained faster as > get_swap_pages may return less slots when percpu cluster is drained". > Might be a performance issue but seems slight and trivial, slot cache > can still work. And the next commit will just remove the slot cache, > and the problem will be gone. I think I can add a comment about it > here? Sounds good. Adding comment can avoid other people being confused when checking patch 5. Thanks.