Chris Li <chrisl@xxxxxxxxxx> writes: > On Thu, Jul 25, 2024 at 7:13 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote: >> > >> > The current proposed order also improves things step by step. The only >> > disagreement here is which patch order we introduce yet another list >> > in addition to the nonfull one. I just feel that it does not make >> > sense to invest into new code if that new code is going to be >> > completely rewrite anyway in the next two patches. >> > >> > Unless you mean is we should not do the patch 3 big rewrite and should >> > continue the scan_swap_map_try_ssd_cluster() way of only doing half of >> > the allocation job and let scan_swap_map_slots() do the complex retry >> > on top of try_ssd(). I feel the overall code is more complex and less >> > maintainable. >> >> I haven't look at [3/3], will wait for your next version for that. So, >> I cannot say which order is better. Please consider reviewers' effort >> too. Small step patch is easier to be understood and reviewed. > > That is exactly the reason I don't want to introduce too much new code > depending on the scan_swap_map_slots() behavior, which will be > abandoned in the big rewrite. Their constraints are very different. I > want to make the big rewrite patch 3 as small as possible. Using > incremental follow up patches to improve it. > >> >> >> > That is why I want to make this change patch after patch 3. There is >> >> > also the long test cycle after the modification to make sure the swap >> >> > code path is stable. I am not resisting a change of patch orders, it >> >> > is that patch can't directly be removed before patch 3 before the big >> >> > rewrite. >> >> > >> >> > >> >> >> >> >> >> > Your original >> >> >> > suggestion appears like that you want to keep all cluster with order-N >> >> >> > on the nonfull list for order-N always unless the number of free swap >> >> >> > entry is less than 1<<N. >> >> >> >> >> >> Well I think that's certainly one of the conditions for removing it. But agree >> >> >> that if a full scan of the cluster has been performed and no swap entries have >> >> >> been freed since the scan started then it should also be removed from the list. >> >> > >> >> > Yes, in the later patch of patch, beyond patch 3, we have the almost >> >> > full cluster that for the cluster has been scan and not able to >> >> > allocate order N entry. >> >> > >> >> >> >> >> >> > >> >> >> >>> And, I understand that in some situations it may >> >> >> >>> be better to share clusters among CPUs. So my suggestion is, >> >> >> >>> >> >> >> >>> - Make swap_cluster_info->order more accurate, don't pretend that we >> >> >> >>> have free swap entries with that order even after we are sure that we >> >> >> >>> haven't. >> >> >> >> >> >> >> >> Is this patch pretending that today? I don't think so? >> >> >> > >> >> >> > IIUC, in this patch swap_cluster_info->order is still "N" even if we are >> >> >> > sure that there are no order-N free swap entry in the cluster. >> >> >> >> >> >> Oh I see what you mean. I think you and Chris already discussed this? IIRC >> >> >> Chris's point was that if you move that cluster to N-1, eventually all clusters >> >> >> are for order-0 and you have no means of allocating high orders until a whole >> >> >> cluster becomes free. That logic certainly makes sense to me, so think its >> >> >> better for swap_cluster_info->order to remain static while the cluster is >> >> >> allocated. (I only skimmed that conversation so appologies if I got the >> >> >> conclusion wrong!). >> >> > >> >> > Yes, that is the original intent, keep the cluster order as much as possible. >> >> > >> >> >> >> >> >> > >> >> >> >> But I agree that a >> >> >> >> cluster should only be on the per-order nonfull list if we know there are at >> >> >> >> least enough free swap entries in that cluster to cover the order. Of course >> >> >> >> that doesn't tell us for sure because they may not be contiguous. >> >> >> > >> >> >> > We can check that when free swap entry via checking adjacent swap >> >> >> > entries. IMHO, the performance should be acceptable. >> >> >> >> >> >> Would you then use the result of that scanning to "promote" a cluster's order? >> >> >> e.g. swap_cluster_info->order = N+1? That would be neat. But this all feels like >> >> >> a separate change on top of what Chris is doing here. For high orders there >> >> >> could be quite a bit of scanning required in the worst case for every page that >> >> >> gets freed. >> >> > >> >> > Right, I feel that is a different set of patches. Even this series is >> >> > hard enough for review. Those order promotion and demotion is heading >> >> > towards a buddy system design. I want to point out that even the buddy >> >> > system is not able to handle the case that swapfile is almost full and >> >> > the recently freed swap entries are not contiguous. >> >> > >> >> > We can invest in the buddy system, which doesn't handle all the >> >> > fragmentation issues. Or I prefer to go directly to the discontiguous >> >> > swap entry. We pay a price for the indirect mapping of swap entries. >> >> > But it will solve the fragmentation issue 100%. >> >> >> >> It's good if we can solve the fragmentation issue 100%. Just need to >> >> pay attention to the cost. >> > >> > The cost you mean the development cost or the run time cost (memory and cpu)? >> >> I mean runtime cost. > > Thanks for the clarification. Agree that we need to pay attention to > the run time cost. That is given. > >> >> >> >>> My question is whether it's so important to share the per-cpu cluster >> >> >> >>> among CPUs? >> >> >> >> >> >> >> >> My rationale for sharing is that the preference previously has been to favour >> >> >> >> efficient use of swap space; we don't want to fail a request for allocation of a >> >> >> >> given order if there are actually slots available just because they have been >> >> >> >> reserved by another CPU. And I'm still asserting that it should be ~zero cost to >> >> >> >> do this. If I'm wrong about the zero cost, or in practice the sharing doesn't >> >> >> >> actually help improve allocation success, then I'm happy to take the exclusive >> >> >> >> approach. >> >> >> >> >> >> >> >>> I suggest to start with simple design, that is, per-CPU >> >> >> >>> cluster will not be shared among CPUs in most cases. >> >> >> >> >> >> >> >> I'm all for starting simple; I think that's what I already proposed (exclusive >> >> >> >> in this patch, then shared in the "big rewrite"). I'm just objecting to the >> >> >> >> current half-and-half policy in this patch. >> >> >> > >> >> >> > Sounds good to me. We can start with exclusive solution and evaluate >> >> >> > whether shared solution is good. >> >> >> >> >> >> Yep. And also evaluate the dynamic order inc/dec idea too... >> >> > >> >> > It is not able to avoid fragementation 100% of the time. I prefer the >> >> > discontinued swap entry as the next step, which guarantees forward >> >> > progress, we will not be stuck in a situation where we are not able to >> >> > allocate swap entries due to fragmentation. >> >> >> >> If my understanding were correct, the implementation complexity of the >> >> order promotion/demotion isn't at the same level of that of discontinued >> >> swap entry. >> > >> > Discontinued swap entry has higher complexity but higher payout as >> > well. It can get us to the place where cluster promotion/demotion >> > can't. >> > >> > I also feel that if we implement something towards a buddy system >> > allocator for swap, we should do a proper buddy allocator >> > implementation of data structures. >> >> I don't think that it's easy to implement a real buddy allocator for >> swap entries. So, I avoid to use buddy in my words. > > Then such a mix of cluster order promote/demote lose some benefit of > the buddy system. Because it lacks the proper data structure to > support buddy allocation. The buddy allocator provides more general > migration between orders. For the limited usage case of cluster > promotion/demotion is supported (by luck). We need to evaluate whether > it is worth the additional complexity. TBH, I believe that the complexity of order promote/demote is quite low, both for development and runtime. A real buddy allocator may need to increase per-swap-entry memory footprint much. -- Best Regards, Huang, Ying