On Mon, Aug 19, 2024 at 4:31 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote: > > Kairui Song <ryncsn@xxxxxxxxx> writes: > > > On Fri, Aug 16, 2024 at 3:53 PM Chris Li <chrisl@xxxxxxxxxx> wrote: > >> > >> On Thu, Aug 8, 2024 at 1:38 AM Huang, Ying <ying.huang@xxxxxxxxx> wrote: > >> > > >> > Chris Li <chrisl@xxxxxxxxxx> writes: > >> > > >> > > On Wed, Aug 7, 2024 at 12:59 AM Huang, Ying <ying.huang@xxxxxxxxx> wrote: > >> > >> > >> > >> Hi, Chris, > >> > >> > >> > >> Chris Li <chrisl@xxxxxxxxxx> writes: > >> > >> > >> > >> > This is the short term solutions "swap cluster order" listed > >> > >> > in my "Swap Abstraction" discussion slice 8 in the recent > >> > >> > LSF/MM conference. > >> > >> > > >> > >> > When commit 845982eb264bc "mm: swap: allow storage of all mTHP > >> > >> > orders" is introduced, it only allocates the mTHP swap entries > >> > >> > from the new empty cluster list. It has a fragmentation issue > >> > >> > reported by Barry. > >> > >> > > >> > >> > https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@xxxxxxxxxxxxxx/ > >> > >> > > >> > >> > The reason is that all the empty clusters have been exhausted while > >> > >> > there are plenty of free swap entries in the cluster that are > >> > >> > not 100% free. > >> > >> > > >> > >> > Remember the swap allocation order in the cluster. > >> > >> > Keep track of the per order non full cluster list for later allocation. > >> > >> > > >> > >> > This series gives the swap SSD allocation a new separate code path > >> > >> > from the HDD allocation. The new allocator use cluster list only > >> > >> > and do not global scan swap_map[] without lock any more. > >> > >> > >> > >> This sounds good. Can we use SSD allocation method for HDD too? > >> > >> We may not need a swap entry allocator optimized for HDD. > >> > > > >> > > Yes, that is the plan as well. That way we can completely get rid of > >> > > the old scan_swap_map_slots() code. > >> > > >> > Good! > >> > > >> > > However, considering the size of the series, let's focus on the > >> > > cluster allocation path first, get it tested and reviewed. > >> > > >> > OK. > >> > > >> > > For HDD optimization, mostly just the new block allocations portion > >> > > need some separate code path from the new cluster allocator to not do > >> > > the per cpu allocation. Allocating from the non free list doesn't > >> > > need to change too > >> > > >> > I suggest not consider HDD optimization at all. Just use SSD algorithm > >> > to simplify. > >> > >> Adding a global next allocating CI rather than the per CPU next CI > >> pointer is pretty trivial as well. It is just a different way to fetch > >> the next cluster pointer. > > > > Yes, if we enable the new cluster based allocator for HDD, we can > > enable THP and mTHP for HDD too, and use a global cluster_next instead > > of Per-CPU for it. > > It's easy to do with minimal changes, and should actually boost > > performance for HDD SWAP. Currently testing this locally. > > I think that it's better to start with SSD algorithm. Then, you can add > HDD specific optimization on top of it with supporting data. Yes, we are having the same idea. > > BTW, I don't know why HDD shouldn't use per-CPU cluster. Sequential > writing is more important for HDD. > >> > >> > >> > >> Hi, Hugh, > >> > >> > >> > >> What do you think about this? > >> > >> > >> > >> > This streamline the swap allocation for SSD. The code matches the > >> > >> > execution flow much better. > >> > >> > > >> > >> > User impact: For users that allocate and free mix order mTHP swapping, > >> > >> > It greatly improves the success rate of the mTHP swap allocation after the > >> > >> > initial phase. > >> > >> > > >> > >> > It also performs faster when the swapfile is close to full, because the > >> > >> > allocator can get the non full cluster from a list rather than scanning > >> > >> > a lot of swap_map entries. > >> > >> > >> > >> Do you have some test results to prove this? Or which test below can > >> > >> prove this? > >> > > > >> > > The two zram tests are already proving this. The system time > >> > > improvement is about 2% on my low CPU count machine. > >> > > Kairui has a higher core count machine and the difference is higher > >> > > there. The theory is that higher CPU count has higher contentions. > >> > > >> > I will interpret this as the performance is better in theory. But > >> > there's almost no measurable results so far. > >> > >> I am trying to understand why don't see the performance improvement in > >> the zram setup in my cover letter as a measurable result? > > > > Hi Ying, you can check the test with the 32 cores AMD machine in the > > cover letter, as Chris pointed out the performance gain is higher as > > core number grows. The performance gain is still not much (*yet, based > > on this design thing can go much faster after HDD codes are > > dropped which enables many other optimizations, this series > > is mainly focusing on the fragmentation issue), but I think a > > stable ~4 - 8% improvement with a build linux kernel test > > could be considered measurable? > > Is this the test result for "when the swapfile is close to full"? Yes, it's about 60% to 90% full during the whole test process. If ZRAM is completely full the workload will go OOM, but testing with madvice showed no performance drop.