On Thu, Aug 8, 2024 at 1:38 AM Huang, Ying <ying.huang@xxxxxxxxx> wrote: > > Chris Li <chrisl@xxxxxxxxxx> writes: > > > On Wed, Aug 7, 2024 at 12:59 AM Huang, Ying <ying.huang@xxxxxxxxx> wrote: > >> > >> Hi, Chris, > >> > >> Chris Li <chrisl@xxxxxxxxxx> writes: > >> > >> > This is the short term solutions "swap cluster order" listed > >> > in my "Swap Abstraction" discussion slice 8 in the recent > >> > LSF/MM conference. > >> > > >> > When commit 845982eb264bc "mm: swap: allow storage of all mTHP > >> > orders" is introduced, it only allocates the mTHP swap entries > >> > from the new empty cluster list. It has a fragmentation issue > >> > reported by Barry. > >> > > >> > https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@xxxxxxxxxxxxxx/ > >> > > >> > The reason is that all the empty clusters have been exhausted while > >> > there are plenty of free swap entries in the cluster that are > >> > not 100% free. > >> > > >> > Remember the swap allocation order in the cluster. > >> > Keep track of the per order non full cluster list for later allocation. > >> > > >> > This series gives the swap SSD allocation a new separate code path > >> > from the HDD allocation. The new allocator use cluster list only > >> > and do not global scan swap_map[] without lock any more. > >> > >> This sounds good. Can we use SSD allocation method for HDD too? > >> We may not need a swap entry allocator optimized for HDD. > > > > Yes, that is the plan as well. That way we can completely get rid of > > the old scan_swap_map_slots() code. > > Good! > > > However, considering the size of the series, let's focus on the > > cluster allocation path first, get it tested and reviewed. > > OK. > > > For HDD optimization, mostly just the new block allocations portion > > need some separate code path from the new cluster allocator to not do > > the per cpu allocation. Allocating from the non free list doesn't > > need to change too > > I suggest not consider HDD optimization at all. Just use SSD algorithm > to simplify. Adding a global next allocating CI rather than the per CPU next CI pointer is pretty trivial as well. It is just a different way to fetch the next cluster pointer. > > >> > >> Hi, Hugh, > >> > >> What do you think about this? > >> > >> > This streamline the swap allocation for SSD. The code matches the > >> > execution flow much better. > >> > > >> > User impact: For users that allocate and free mix order mTHP swapping, > >> > It greatly improves the success rate of the mTHP swap allocation after the > >> > initial phase. > >> > > >> > It also performs faster when the swapfile is close to full, because the > >> > allocator can get the non full cluster from a list rather than scanning > >> > a lot of swap_map entries. > >> > >> Do you have some test results to prove this? Or which test below can > >> prove this? > > > > The two zram tests are already proving this. The system time > > improvement is about 2% on my low CPU count machine. > > Kairui has a higher core count machine and the difference is higher > > there. The theory is that higher CPU count has higher contentions. > > I will interpret this as the performance is better in theory. But > there's almost no measurable results so far. I am trying to understand why don't see the performance improvement in the zram setup in my cover letter as a measurable result? > > > The 2% system time number does not sound like much. But consider this > > two factors: > > 1) swap allocator only takes a small percentage of the overall workload. > > 2) The new allocator does more work. > > The old allocator has a time tick budget. It will abort and fail to > > find an entry when it runs out of time budget, even though there are > > still some free entries on the swapfile. > > What is the time tick budget you mentioned? I was under the impression that the previous swap entry allocation code will not scan 100% of the swapfile if there is only one entry left. Please let me know if my understanding is not correct. /* time to take a break? */ if (unlikely(--latency_ration < 0)) { if (n_ret) goto done; spin_unlock(&si->lock); cond_resched(); spin_lock(&si->lock); latency_ration = LATENCY_LIMIT; } > > > The new allocator can get to the last few free swap entries if it is > > available. If not then, the new swap allocator will work harder on > > swap cache reclaim. > > > > From the swap cache reclaim aspect, it is very hard to optimize the > > swap cache reclaim in the old allocation path because the scan > > position is randomized. > > The full list and frag list both design to help reduce the repeat > > reclaim attempt of the swap cache. > > [snip] > > -- > Best Regards, > Huang, Ying