Kairui Song <ryncsn@xxxxxxxxx> writes: > Hi Ying, > > On Fri, Oct 25, 2024 at 1:56 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote: >> >> Kairui Song <ryncsn@xxxxxxxxx> writes: >> >> > On Thu, Oct 24, 2024 at 11:29 AM Huang, Ying <ying.huang@xxxxxxxxx> wrote: >> >> >> >> Hi, Kairui, >> >> >> >> When reading new swap cluster allocation code. Sorry, I didn't find >> >> time to review your patch at the first place. I found that in >> >> cluster_alloc_swap_entry(), the following code path is possible, >> >> >> >> cluster = this_cpu_ptr(si->percpu_cluster); >> >> offset = alloc_swap_scan_cluster(); >> >> ... >> >> spin_unlock(&ci->lock); >> >> spin_unlock(&si->lock); >> >> /* migrate to another cpu */ >> >> spin_lock(&ci->lock); >> >> spin_lock(&si->lock); >> >> cluster->next[order] = offset; >> >> >> >> That is, the per cpu cluster of a CPU may be changed on another CPU. I >> >> guess that this will not cause some functionality issue. However, this >> >> makes code harder to be reasoned. Is it possible to avoid unlock before >> >> changing per cpu cluster? Or, give up updating per cpu cluster if we >> >> need to unlock. >> > >> > Hi Ying >> > >> > Yes, I've noticed this when working on the new si->lock series. It may >> > cause fragmentation, only if the CPU the current allocation migrates >> > from requested another swap allocation, in such case it may waste one >> > per cpu cluster. >> > >> > No functionality or leak issue, but make things harder to reason >> > indeed, as you mentioned. >> > >> > In the new series I used a local lock to prevent migration, so the cpu >> > is stable during the whole process. However that is not doable unless >> > si->lock is removed from allocation path or there are lock dependency >> > issue. >> > >> > Do we need to fix this separately in the current tree? The >> > fragmentation is overall better with these patches, so I think that is >> > causing any issues. Before these patches per CPU cluster can also be >> > stolen by other CPUs so these were not guaranteed to be accurate. >> > >> > One easy way we can do is set the cluster->next[order] to 0 before >> > scanning for a new cluster to use, and check if it is 0 before setting >> > next[order] again. >> > >> > That may still cause fragmentation as migration will cause extra >> > clusters to be scanned and used, which is hard to avoid without >> > changing too much code (like the things being done in the new si->lock >> > series). >> >> Find another possible more serious problem. After unlock/lock cycle, >> the state of swap device may be changed (for example, swapoffed). So in >> the previous code, >> >> si->flags += SWP_SCANNING; >> >> if (!(si->flags & SWP_WRITEOK)) >> >> is used for that. I don't find they are used after unlock/lock cycle. >> Can you check it? > > Thanks for your review. > > Yes, I think you are right. I did notice the flags are not updated > properly in the latest mm branch, but did not notice it was a newly > introduced issue. > > So I fixed the issue in > https://lore.kernel.org/linux-mm/20241022192451.38138-8-ryncsn@xxxxxxxxx/ > > But a quick fix can be posted before the series. Following patch > should be enough, how do you think? > > diff --git a/mm/swapfile.c b/mm/swapfile.c > index 46bd4b1a3c07..6b7d4ac1d4a3 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -1041,7 +1041,9 @@ static int cluster_alloc_swap(struct swap_info_struct *si, > > VM_BUG_ON(!si->cluster_info); > > - while (n_ret < nr) { > + si->flags += SWP_SCANNING; > + while (n_ret < nr && si->flags & SWP_WRITEOK) { > unsigned long offset = cluster_alloc_swap_entry(si, > order, usage); > > if (!offset) > @@ -1049,6 +1051,8 @@ static int cluster_alloc_swap(struct swap_info_struct *si, > slots[n_ret++] = swp_entry(si->type, offset); > } > > + si->flags -= SWP_SCANNING; > + > return n_ret; > } > > The "+= SWP_SCANNING" looks ugly, but has to be done in such a way, > just like in the original allocation code. Multiple scanners can > update the flag, and SCANNING is the highest defined bit in the flag > field, it's not a strictly bit flag in such cases. This will be > cleaned up in a proper way in the patch I just mentioned. I think that the logic of SWP_SCANNING is correct. As for SWP_WRITEOK, we should avoid to allocate swap entries when other task clears it ASAP. So, the above checking may be too late, because we have already allocated the swap entry. I think that we can check it in cluster_reclaim_range() and refuse to allocate swap entry there. Combined with the per cpu cluster (cluster->next[]) changing issue we discussed in another email of the thread. I think the long term solution could be separate the swap entry allocation and reclaiming a bit more. For example, we could try to allocate some swap entries with si->lock held firstly, record possible reclaim offset if any, change cluster->next[], then unlock, reclaim, lock. -- Best Regards, Huang, Ying