Hi Ying, On Fri, Oct 25, 2024 at 1:56 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote: > > Kairui Song <ryncsn@xxxxxxxxx> writes: > > > On Thu, Oct 24, 2024 at 11:29 AM Huang, Ying <ying.huang@xxxxxxxxx> wrote: > >> > >> Hi, Kairui, > >> > >> When reading new swap cluster allocation code. Sorry, I didn't find > >> time to review your patch at the first place. I found that in > >> cluster_alloc_swap_entry(), the following code path is possible, > >> > >> cluster = this_cpu_ptr(si->percpu_cluster); > >> offset = alloc_swap_scan_cluster(); > >> ... > >> spin_unlock(&ci->lock); > >> spin_unlock(&si->lock); > >> /* migrate to another cpu */ > >> spin_lock(&ci->lock); > >> spin_lock(&si->lock); > >> cluster->next[order] = offset; > >> > >> That is, the per cpu cluster of a CPU may be changed on another CPU. I > >> guess that this will not cause some functionality issue. However, this > >> makes code harder to be reasoned. Is it possible to avoid unlock before > >> changing per cpu cluster? Or, give up updating per cpu cluster if we > >> need to unlock. > > > > Hi Ying > > > > Yes, I've noticed this when working on the new si->lock series. It may > > cause fragmentation, only if the CPU the current allocation migrates > > from requested another swap allocation, in such case it may waste one > > per cpu cluster. > > > > No functionality or leak issue, but make things harder to reason > > indeed, as you mentioned. > > > > In the new series I used a local lock to prevent migration, so the cpu > > is stable during the whole process. However that is not doable unless > > si->lock is removed from allocation path or there are lock dependency > > issue. > > > > Do we need to fix this separately in the current tree? The > > fragmentation is overall better with these patches, so I think that is > > causing any issues. Before these patches per CPU cluster can also be > > stolen by other CPUs so these were not guaranteed to be accurate. > > > > One easy way we can do is set the cluster->next[order] to 0 before > > scanning for a new cluster to use, and check if it is 0 before setting > > next[order] again. > > > > That may still cause fragmentation as migration will cause extra > > clusters to be scanned and used, which is hard to avoid without > > changing too much code (like the things being done in the new si->lock > > series). > > Find another possible more serious problem. After unlock/lock cycle, > the state of swap device may be changed (for example, swapoffed). So in > the previous code, > > si->flags += SWP_SCANNING; > > if (!(si->flags & SWP_WRITEOK)) > > is used for that. I don't find they are used after unlock/lock cycle. > Can you check it? Thanks for your review. Yes, I think you are right. I did notice the flags are not updated properly in the latest mm branch, but did not notice it was a newly introduced issue. So I fixed the issue in https://lore.kernel.org/linux-mm/20241022192451.38138-8-ryncsn@xxxxxxxxx/ But a quick fix can be posted before the series. Following patch should be enough, how do you think? diff --git a/mm/swapfile.c b/mm/swapfile.c index 46bd4b1a3c07..6b7d4ac1d4a3 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1041,7 +1041,9 @@ static int cluster_alloc_swap(struct swap_info_struct *si, VM_BUG_ON(!si->cluster_info); - while (n_ret < nr) { + si->flags += SWP_SCANNING; + while (n_ret < nr && si->flags & SWP_WRITEOK) { unsigned long offset = cluster_alloc_swap_entry(si, order, usage); if (!offset) @@ -1049,6 +1051,8 @@ static int cluster_alloc_swap(struct swap_info_struct *si, slots[n_ret++] = swp_entry(si->type, offset); } + si->flags -= SWP_SCANNING; + return n_ret; } The "+= SWP_SCANNING" looks ugly, but has to be done in such a way, just like in the original allocation code. Multiple scanners can update the flag, and SCANNING is the highest defined bit in the flag field, it's not a strictly bit flag in such cases. This will be cleaned up in a proper way in the patch I just mentioned. > > -- > Best Regards, > Huang, Ying