Chris Li <chrisl@xxxxxxxxxx> writes: > On Wed, Aug 7, 2024 at 12:59 AM Huang, Ying <ying.huang@xxxxxxxxx> wrote: >> >> Hi, Chris, >> >> Chris Li <chrisl@xxxxxxxxxx> writes: >> >> > This is the short term solutions "swap cluster order" listed >> > in my "Swap Abstraction" discussion slice 8 in the recent >> > LSF/MM conference. >> > >> > When commit 845982eb264bc "mm: swap: allow storage of all mTHP >> > orders" is introduced, it only allocates the mTHP swap entries >> > from the new empty cluster list. It has a fragmentation issue >> > reported by Barry. >> > >> > https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@xxxxxxxxxxxxxx/ >> > >> > The reason is that all the empty clusters have been exhausted while >> > there are plenty of free swap entries in the cluster that are >> > not 100% free. >> > >> > Remember the swap allocation order in the cluster. >> > Keep track of the per order non full cluster list for later allocation. >> > >> > This series gives the swap SSD allocation a new separate code path >> > from the HDD allocation. The new allocator use cluster list only >> > and do not global scan swap_map[] without lock any more. >> >> This sounds good. Can we use SSD allocation method for HDD too? >> We may not need a swap entry allocator optimized for HDD. > > Yes, that is the plan as well. That way we can completely get rid of > the old scan_swap_map_slots() code. Good! > However, considering the size of the series, let's focus on the > cluster allocation path first, get it tested and reviewed. OK. > For HDD optimization, mostly just the new block allocations portion > need some separate code path from the new cluster allocator to not do > the per cpu allocation. Allocating from the non free list doesn't > need to change too I suggest not consider HDD optimization at all. Just use SSD algorithm to simplify. >> >> Hi, Hugh, >> >> What do you think about this? >> >> > This streamline the swap allocation for SSD. The code matches the >> > execution flow much better. >> > >> > User impact: For users that allocate and free mix order mTHP swapping, >> > It greatly improves the success rate of the mTHP swap allocation after the >> > initial phase. >> > >> > It also performs faster when the swapfile is close to full, because the >> > allocator can get the non full cluster from a list rather than scanning >> > a lot of swap_map entries. >> >> Do you have some test results to prove this? Or which test below can >> prove this? > > The two zram tests are already proving this. The system time > improvement is about 2% on my low CPU count machine. > Kairui has a higher core count machine and the difference is higher > there. The theory is that higher CPU count has higher contentions. I will interpret this as the performance is better in theory. But there's almost no measurable results so far. > The 2% system time number does not sound like much. But consider this > two factors: > 1) swap allocator only takes a small percentage of the overall workload. > 2) The new allocator does more work. > The old allocator has a time tick budget. It will abort and fail to > find an entry when it runs out of time budget, even though there are > still some free entries on the swapfile. What is the time tick budget you mentioned? > The new allocator can get to the last few free swap entries if it is > available. If not then, the new swap allocator will work harder on > swap cache reclaim. > > From the swap cache reclaim aspect, it is very hard to optimize the > swap cache reclaim in the old allocation path because the scan > position is randomized. > The full list and frag list both design to help reduce the repeat > reclaim attempt of the swap cache. [snip] -- Best Regards, Huang, Ying