Ryan Roberts <ryan.roberts@xxxxxxx> writes: > On 18/07/2024 08:53, Huang, Ying wrote: >> Chris Li <chrisl@xxxxxxxxxx> writes: >> >>> On Wed, Jul 17, 2024 at 3:14 AM Ryan Roberts <ryan.roberts@xxxxxxx> wrote: >>>> >>>> On 16/07/2024 23:46, Chris Li wrote: >>>>> On Mon, Jul 15, 2024 at 8:40 AM Ryan Roberts <ryan.roberts@xxxxxxx> wrote: >>>>>> >>>>>> On 11/07/2024 08:29, Chris Li wrote: [snip] >>>>>>> + >>>>>>> + if (!(ci->flags & CLUSTER_FLAG_NONFULL)) { >>>>>>> + list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]); >>>>>> >>>>>> I find the transitions when you add and remove a cluster from the >>>>>> nonfull_clusters list a bit strange (if I've understood correctly): It is added >>>>>> to the list whenever there is at least one free swap entry if not already on the >>>>>> list. But you take it off the list when assigning it as the current cluster for >>>>>> a cpu in scan_swap_map_try_ssd_cluster(). >>>>>> >>>>>> So you could have this situation: >>>>>> >>>>>> - cpuA allocs cluster from free list (exclusive to that cpu) >>>>>> - cpuA allocs 1 swap entry from current cluster >>>>>> - swap entry is freed; cluster added to nonfull_clusters >>>>>> - cpuB "allocs" cluster from nonfull_clusters >>>>>> >>>>>> At this point both cpuA and cpuB share the same cluster as their current >>>>>> cluster. So why not just put the cluster on the nonfull_clusters list at >>>>>> allocation time (when removed from free_list) and only remove it from the >>>>> >>>>> The big rewrite on patch 3 does that, taking it off the free list and >>>>> moving it into nonfull. >>>> >>>> Oh, from the title, "RFC: mm: swap: seperate SSD allocation from >>>> scan_swap_map_slots()" I assumed that was just a refactoring of the code to >>>> separate the SSD and HDD code paths. Personally I'd prefer to see the >>>> refactoring separated from behavioural changes. >>> >>> It is not a refactoring. It is a big rewrite of the swap allocator >>> using the cluster. Behavior change is expected. The goal is completely >>> removing the brute force scanning of swap_map[] array for cluster swap >>> allocation. >>> >>>> >>>> Since the patch was titled RFC and I thought it was just refactoring, I was >>>> deferring review. But sounds like it is actually required to realize the test >>>> results quoted on the cover letter? >>> >>> Yes, required because it handles the previous fall out case try_ssd() >>> failed. This big rewrite has gone through a lot of testing and bug >>> fix. It is pretty stable now. The only reason I keep it as RFC is >>> because it is not feature complete. Currently it does not do swap >>> cache reclaim. The next version will have swap cache reclaim and >>> remove the RFC. >>> >>>> >>>>> I am only making the minimal change in this step so the big rewrite can land. >>>>> >>>>>> nonfull_clusters list when it is completely full (or at least definitely doesn't >>>>>> have room for an `order` allocation)? Then you allow "stealing" always instead >>>>>> of just sometimes. You would likely want to move the cluster to the end of the >>>>>> nonfull list when selecting it in scan_swap_map_try_ssd_cluster() to reduce the >>>>>> chances of multiple CPUs using the same cluster. >>>>> >>>>> For nonfull clusters it is less important to avoid multiple CPU >>>>> sharing the cluster. Because the cluster already has previous swap >>>>> entries allocated from the previous CPU. >>>> >>>> But if 2 CPUs have the same cluster, isn't there a pathalogical case where cpuA >>>> could be slightly ahead of cpuB so that cpuA allocates all the free pages and >>> >>> That happens to exist per cpu next pointer already. When the other CPU >>> advances to the next cluster pointer, it can cross with the other >>> CPU's next cluster pointer. >> >> No. si->percpu_cluster[cpu].next will keep in the current per cpu >> cluster only. If it doesn't do that, we should fix it. >> >> I agree with Ryan that we should make per cpu cluster correct. A >> cluster in per cpu cluster shouldn't be put in nonfull list. When we >> scan to the end of a per cpu cluster, we can put the cluster in nonfull >> list if necessary. And, we should make it correct in this patch instead >> of later in series. I understand that you want to make the patch itself >> simple, but it's important to make code simple to be understood too. >> Consistent design choice will do that. > > I think I'm actually arguing for the opposite of what you suggest here. Sorry, I misunderstood your words. > As I see it, there are 2 possible approaches; either a cluster is always > considered exclusive to a single cpu when its set as a per-cpu cluster, so it > does not appear on the nonfull list. Or a cluster is considered sharable in this > case, in which case it should be added to the nonfull list. > > The code at the moment sort of does both; when a cpu decides to use a cluster in > the nonfull list, it removes it from that list to make it exclusive. But as soon > as a single swap entry is freed from that cluster it is put back on the list. > This neither-one-policy-nor-the-other seems odd to me. > > I think Huang, Ying is arguing to keep it always exclusive while installed as a > per-cpu cluster. Yes. > I was arguing to make it always shared. Perhaps the best > approach is to implement the exclusive policy in this patch (you'd need a flag > to note if any pages were freed while in exclusive use, then when exclusive use > completes, put it back on the nonfull list if the flag was set). Then migrate to > the shared approach as part of the "big rewrite"? >> >>>> cpuB just ends up scanning and finding nothing to allocate. I think do want to >>>> share the cluster when you really need to, but try to avoid it if there are >>>> other options, and I think moving the cluster to the end of the list might be a >>>> way to help that? >>> >>> Simply moving to the end of the list can create a possible deadloop >>> when all clusters have been scanned and not available swap range >>> found. I also think that the shared approach has dead loop issue. >> This is another reason that we should put the cluster in >> nonfull_clusters[order--] if there are no free swap entry with "order" >> in the cluster. It makes design complex to keep it in >> nonfull_clusters[order]. >> >>> We have tried many different approaches including moving to the end of >>> the list. It can cause more fragmentation because each CPU allocates >>> their swap slot cache (64 entries) from a different cluster. >>> >>>>> Those behaviors will be fine >>>>> tuned after the patch 3 big rewrite. Try to make this patch simple. >>> >>> Again, I want to keep it simple here so patch 3 can land. >>> >>>>>> Another potential optimization (which was in my hacked version IIRC) is to only >>>>>> add/remove from nonfull list when `total - count` crosses the (1 << order) >>>>>> boundary rather than when becoming completely full. You definitely won't be able >>>>>> to allocate order-2 if there are only 3 pages available, for example. >>>>> >>>>> That is in patch 3 as well. This patch is just doing the bare minimum >>>>> to introduce the nonfull list. >>>>> [snip] -- Best Regards, Huang, Ying