Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list

Chris Li <chrisl@xxxxxxxxxx> · Thu, 25 Jul 2024 22:09:25 -0700

On Thu, Jul 25, 2024 at 7:13 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
> >
> > The current proposed order also improves things step by step. The only
> > disagreement here is which patch order we introduce yet another list
> > in addition to the nonfull one. I just feel that it does not make
> > sense to invest into new code if that new code is going to be
> > completely rewrite anyway in the next two patches.
> >
> > Unless you mean is we should not do the patch 3 big rewrite and should
> > continue the scan_swap_map_try_ssd_cluster() way of only doing half of
> > the allocation job and let scan_swap_map_slots() do the complex retry
> > on top of try_ssd(). I feel the overall code is more complex and less
> > maintainable.
>
> I haven't look at [3/3], will wait for your next version for that.  So,
> I cannot say which order is better.  Please consider reviewers' effort
> too.  Small step patch is easier to be understood and reviewed.

That is exactly the reason I don't want to introduce too much new code
depending on the scan_swap_map_slots() behavior, which will be
abandoned in the big rewrite. Their constraints are very different. I
want to make the big rewrite patch 3 as small as possible. Using
incremental follow up patches to improve it.

>
> >> > That is why I want to make this change patch after patch 3. There is
> >> > also the long test cycle after the modification to make sure the swap
> >> > code path is stable. I am not resisting a change of patch orders, it
> >> > is that patch can't directly be removed before patch 3 before the big
> >> > rewrite.
> >> >
> >> >
> >> >>
> >> >> > Your original
> >> >> > suggestion appears like that you want to keep all cluster with order-N
> >> >> > on the nonfull list for order-N always unless the number of free swap
> >> >> > entry is less than 1<<N.
> >> >>
> >> >> Well I think that's certainly one of the conditions for removing it. But agree
> >> >> that if a full scan of the cluster has been performed and no swap entries have
> >> >> been freed since the scan started then it should also be removed from the list.
> >> >
> >> > Yes, in the later patch of patch, beyond patch 3, we have the almost
> >> > full cluster that for the cluster has been scan and not able to
> >> > allocate order N entry.
> >> >
> >> >>
> >> >> >
> >> >> >>> And, I understand that in some situations it may
> >> >> >>> be better to share clusters among CPUs.  So my suggestion is,
> >> >> >>>
> >> >> >>> - Make swap_cluster_info->order more accurate, don't pretend that we
> >> >> >>>   have free swap entries with that order even after we are sure that we
> >> >> >>>   haven't.
> >> >> >>
> >> >> >> Is this patch pretending that today? I don't think so?
> >> >> >
> >> >> > IIUC, in this patch swap_cluster_info->order is still "N" even if we are
> >> >> > sure that there are no order-N free swap entry in the cluster.
> >> >>
> >> >> Oh I see what you mean. I think you and Chris already discussed this? IIRC
> >> >> Chris's point was that if you move that cluster to N-1, eventually all clusters
> >> >> are for order-0 and you have no means of allocating high orders until a whole
> >> >> cluster becomes free. That logic certainly makes sense to me, so think its
> >> >> better for swap_cluster_info->order to remain static while the cluster is
> >> >> allocated. (I only skimmed that conversation so appologies if I got the
> >> >> conclusion wrong!).
> >> >
> >> > Yes, that is the original intent, keep the cluster order as much as possible.
> >> >
> >> >>
> >> >> >
> >> >> >> But I agree that a
> >> >> >> cluster should only be on the per-order nonfull list if we know there are at
> >> >> >> least enough free swap entries in that cluster to cover the order. Of course
> >> >> >> that doesn't tell us for sure because they may not be contiguous.
> >> >> >
> >> >> > We can check that when free swap entry via checking adjacent swap
> >> >> > entries.  IMHO, the performance should be acceptable.
> >> >>
> >> >> Would you then use the result of that scanning to "promote" a cluster's order?
> >> >> e.g. swap_cluster_info->order = N+1? That would be neat. But this all feels like
> >> >> a separate change on top of what Chris is doing here. For high orders there
> >> >> could be quite a bit of scanning required in the worst case for every page that
> >> >> gets freed.
> >> >
> >> > Right, I feel that is a different set of patches. Even this series is
> >> > hard enough for review. Those order promotion and demotion is heading
> >> > towards a buddy system design. I want to point out that even the buddy
> >> > system is not able to handle the case that swapfile is almost full and
> >> > the recently freed swap entries are not contiguous.
> >> >
> >> > We can invest in the buddy system, which doesn't handle all the
> >> > fragmentation issues. Or I prefer to go directly to the discontiguous
> >> > swap entry. We pay a price for the indirect mapping of swap entries.
> >> > But it will solve the fragmentation issue 100%.
> >>
> >> It's good if we can solve the fragmentation issue 100%.  Just need to
> >> pay attention to the cost.
> >
> > The cost you mean the development cost or the run time cost (memory and cpu)?
>
> I mean runtime cost.

Thanks for the clarification. Agree that we need to pay attention to
the run time cost. That is given.

> >> >> >>> My question is whether it's so important to share the per-cpu cluster
> >> >> >>> among CPUs?
> >> >> >>
> >> >> >> My rationale for sharing is that the preference previously has been to favour
> >> >> >> efficient use of swap space; we don't want to fail a request for allocation of a
> >> >> >> given order if there are actually slots available just because they have been
> >> >> >> reserved by another CPU. And I'm still asserting that it should be ~zero cost to
> >> >> >> do this. If I'm wrong about the zero cost, or in practice the sharing doesn't
> >> >> >> actually help improve allocation success, then I'm happy to take the exclusive
> >> >> >> approach.
> >> >> >>
> >> >> >>> I suggest to start with simple design, that is, per-CPU
> >> >> >>> cluster will not be shared among CPUs in most cases.
> >> >> >>
> >> >> >> I'm all for starting simple; I think that's what I already proposed (exclusive
> >> >> >> in this patch, then shared in the "big rewrite"). I'm just objecting to the
> >> >> >> current half-and-half policy in this patch.
> >> >> >
> >> >> > Sounds good to me.  We can start with exclusive solution and evaluate
> >> >> > whether shared solution is good.
> >> >>
> >> >> Yep. And also evaluate the dynamic order inc/dec idea too...
> >> >
> >> > It is not able to avoid fragementation 100% of the time. I prefer the
> >> > discontinued swap entry as the next step, which guarantees forward
> >> > progress, we will not be stuck in a situation where we are not able to
> >> > allocate swap entries due to fragmentation.
> >>
> >> If my understanding were correct, the implementation complexity of the
> >> order promotion/demotion isn't at the same level of that of discontinued
> >> swap entry.
> >
> > Discontinued swap entry has higher complexity but higher payout as
> > well. It can get us to the place where cluster promotion/demotion
> > can't.
> >
> > I also feel that if we implement something towards a buddy system
> > allocator for swap, we should do a proper buddy allocator
> > implementation of data structures.
>
> I don't think that it's easy to implement a real buddy allocator for
> swap entries.  So, I avoid to use buddy in my words.

Then such a mix of cluster order promote/demote lose some benefit of
the buddy system. Because it lacks the proper data structure to
support buddy allocation. The buddy allocator provides more general
migration between orders. For the limited usage case of cluster
promotion/demotion is supported (by luck). We need to evaluate whether
it is worth the additional complexity.

Chris