Kairui Song <ryncsn@xxxxxxxxx> writes: > On Fri, Aug 16, 2024 at 3:53 PM Chris Li <chrisl@xxxxxxxxxx> wrote: >> >> On Thu, Aug 8, 2024 at 1:38 AM Huang, Ying <ying.huang@xxxxxxxxx> wrote: >> > >> > Chris Li <chrisl@xxxxxxxxxx> writes: >> > >> > > On Wed, Aug 7, 2024 at 12:59 AM Huang, Ying <ying.huang@xxxxxxxxx> wrote: >> > >> >> > >> Hi, Chris, >> > >> >> > >> Chris Li <chrisl@xxxxxxxxxx> writes: >> > >> >> > >> > This is the short term solutions "swap cluster order" listed >> > >> > in my "Swap Abstraction" discussion slice 8 in the recent >> > >> > LSF/MM conference. >> > >> > >> > >> > When commit 845982eb264bc "mm: swap: allow storage of all mTHP >> > >> > orders" is introduced, it only allocates the mTHP swap entries >> > >> > from the new empty cluster list. It has a fragmentation issue >> > >> > reported by Barry. >> > >> > >> > >> > https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@xxxxxxxxxxxxxx/ >> > >> > >> > >> > The reason is that all the empty clusters have been exhausted while >> > >> > there are plenty of free swap entries in the cluster that are >> > >> > not 100% free. >> > >> > >> > >> > Remember the swap allocation order in the cluster. >> > >> > Keep track of the per order non full cluster list for later allocation. >> > >> > >> > >> > This series gives the swap SSD allocation a new separate code path >> > >> > from the HDD allocation. The new allocator use cluster list only >> > >> > and do not global scan swap_map[] without lock any more. >> > >> >> > >> This sounds good. Can we use SSD allocation method for HDD too? >> > >> We may not need a swap entry allocator optimized for HDD. >> > > >> > > Yes, that is the plan as well. That way we can completely get rid of >> > > the old scan_swap_map_slots() code. >> > >> > Good! >> > >> > > However, considering the size of the series, let's focus on the >> > > cluster allocation path first, get it tested and reviewed. >> > >> > OK. >> > >> > > For HDD optimization, mostly just the new block allocations portion >> > > need some separate code path from the new cluster allocator to not do >> > > the per cpu allocation. Allocating from the non free list doesn't >> > > need to change too >> > >> > I suggest not consider HDD optimization at all. Just use SSD algorithm >> > to simplify. >> >> Adding a global next allocating CI rather than the per CPU next CI >> pointer is pretty trivial as well. It is just a different way to fetch >> the next cluster pointer. > > Yes, if we enable the new cluster based allocator for HDD, we can > enable THP and mTHP for HDD too, and use a global cluster_next instead > of Per-CPU for it. > It's easy to do with minimal changes, and should actually boost > performance for HDD SWAP. Currently testing this locally. I think that it's better to start with SSD algorithm. Then, you can add HDD specific optimization on top of it with supporting data. BTW, I don't know why HDD shouldn't use per-CPU cluster. Sequential writing is more important for HDD. >> > >> >> > >> Hi, Hugh, >> > >> >> > >> What do you think about this? >> > >> >> > >> > This streamline the swap allocation for SSD. The code matches the >> > >> > execution flow much better. >> > >> > >> > >> > User impact: For users that allocate and free mix order mTHP swapping, >> > >> > It greatly improves the success rate of the mTHP swap allocation after the >> > >> > initial phase. >> > >> > >> > >> > It also performs faster when the swapfile is close to full, because the >> > >> > allocator can get the non full cluster from a list rather than scanning >> > >> > a lot of swap_map entries. >> > >> >> > >> Do you have some test results to prove this? Or which test below can >> > >> prove this? >> > > >> > > The two zram tests are already proving this. The system time >> > > improvement is about 2% on my low CPU count machine. >> > > Kairui has a higher core count machine and the difference is higher >> > > there. The theory is that higher CPU count has higher contentions. >> > >> > I will interpret this as the performance is better in theory. But >> > there's almost no measurable results so far. >> >> I am trying to understand why don't see the performance improvement in >> the zram setup in my cover letter as a measurable result? > > Hi Ying, you can check the test with the 32 cores AMD machine in the > cover letter, as Chris pointed out the performance gain is higher as > core number grows. The performance gain is still not much (*yet, based > on this design thing can go much faster after HDD codes are > dropped which enables many other optimizations, this series > is mainly focusing on the fragmentation issue), but I think a > stable ~4 - 8% improvement with a build linux kernel test > could be considered measurable? Is this the test result for "when the swapfile is close to full"? -- Best Regards, Huang, Ying