Chris Li <chrisl@xxxxxxxxxx> writes: > On Thu, Jul 25, 2024 at 7:07 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote: >> > If the freeing of swap entry is random distribution. You need 16 >> > continuous swap entries free at the same time at aligned 16 base >> > locations. The total number of order 4 free swap space add up together >> > is much lower than the order 0 allocatable swap space. >> > If having one entry free is 50% probability(swapfile half full), then >> > having 16 swap entries is continually free is (0.5) EXP 16 = 1.5 E-5. >> > If the swapfile is 80% full, that number drops to 6.5 E -12. >> >> This depends on workloads. Quite some workloads will show some degree >> of spatial locality. For a workload with no spatial locality at all as >> above, mTHP may be not a good choice at the first place. > > The fragmentation comes from the order 0 entry not from the mTHP. mTHP > have their own valid usage case, and should be separate from how you > use the order 0 entry. That is why I consider this kind of strategy > only works on the lucky case. I would much prefer the strategy that > can guarantee work not depend on luck. It seems that you have some perfect solution. Will learn it when you post it. >> >> - Order-4 pages need to be swapped out, but no enough order-4 non-full >> >> clusters available. >> > >> > Exactly. >> > >> >> >> >> So, we need a way to migrate non-full clusters among orders to adjust to >> >> the various situations automatically. >> > >> > There is no easy way to migrate swap entries to different locations. >> > That is why I like to have discontiguous swap entries allocation for >> > mTHP. >> >> We suggest to migrate non-full swap clsuters among different lists, not >> swap entries. > > Then you have the down side of reducing the number of total high order > clusters. By chance it is much easier to fragment the cluster than > anti-fragment a cluster. The orders of clusters have a natural > tendency to move down rather than move up, given long enough time of > random access. It will likely run out of high order clusters in the > long run if we don't have any separation of orders. As my example above, you may have almost 0 high-order clusters forever. So, your solution only works for very specific use cases. It's not a general solution. >> >> But yes, data is needed for any performance related change. >> >> BTW: I think non-full cluster isn't a good name. Partial cluster is >> much better and follows the same convention as partial slab. > > I am not opposed to it. The only reason I hold off on the rename is > because there are patches from Kairui I am testing depending on it. > Let's finish up the V5 patch with the swap cache reclaim code path > then do the renaming as one batch job. We actually have more than one > list that has the clusters partially full. It helps reduce the repeat > scan of the cluster that is not full but also not able to allocate > swap entries for this order. Just the name of one of them as > "partial" is not precise either. Because the other lists are also > partially full. We'd better give them precise meaning systematically. I don't think that it's hard to do a search/replace before the next version. -- Best Regards, Huang, Ying