Re: cluster_alloc_swap_entry() may change per-cpu cluster of another CPU randomly

Kairui Song <ryncsn@xxxxxxxxx> · Mon, 4 Nov 2024 01:11:30 +0800

Hi Ying,

On Fri, Oct 25, 2024 at 1:56 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
>
> Kairui Song <ryncsn@xxxxxxxxx> writes:
>
> > On Thu, Oct 24, 2024 at 11:29 AM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
> >>
> >> Hi, Kairui,
> >>
> >> When reading new swap cluster allocation code.  Sorry, I didn't find
> >> time to review your patch at the first place.  I found that in
> >> cluster_alloc_swap_entry(), the following code path is possible,
> >>
> >>   cluster = this_cpu_ptr(si->percpu_cluster);
> >>   offset = alloc_swap_scan_cluster();
> >>     ...
> >>       spin_unlock(&ci->lock);
> >>       spin_unlock(&si->lock);
> >>       /* migrate to another cpu */
> >>       spin_lock(&ci->lock);
> >>       spin_lock(&si->lock);
> >>   cluster->next[order] = offset;
> >>
> >> That is, the per cpu cluster of a CPU may be changed on another CPU.  I
> >> guess that this will not cause some functionality issue.  However, this
> >> makes code harder to be reasoned.  Is it possible to avoid unlock before
> >> changing per cpu cluster?  Or, give up updating per cpu cluster if we
> >> need to unlock.
> >
> > Hi Ying
> >
> > Yes, I've noticed this when working on the new si->lock series. It may
> > cause fragmentation, only if the CPU the current allocation migrates
> > from requested another swap allocation, in such case it may waste one
> > per cpu cluster.
> >
> > No functionality or leak issue, but make things harder to reason
> > indeed, as you mentioned.
> >
> > In the new series I used a local lock to prevent migration, so the cpu
> > is stable during the whole process. However that is not doable unless
> > si->lock is removed from allocation path or there are lock dependency
> > issue.
> >
> > Do we need to fix this separately in the current tree? The
> > fragmentation is overall better with these patches, so I think that is
> > causing any issues. Before these patches per CPU cluster can also be
> > stolen by other CPUs so these were not guaranteed to be accurate.
> >
> > One easy way we can do is set the cluster->next[order] to 0 before
> > scanning for a new cluster to use, and check if it is 0 before setting
> > next[order] again.
> >
> > That may still cause fragmentation as migration will cause extra
> > clusters to be scanned and used, which is hard to avoid without
> > changing too much code (like the things being done in the new si->lock
> > series).
>
> Find another possible more serious problem.  After unlock/lock cycle,
> the state of swap device may be changed (for example, swapoffed).  So in
> the previous code,
>
>         si->flags += SWP_SCANNING;
>
>         if (!(si->flags & SWP_WRITEOK))
>
> is used for that.  I don't find they are used after unlock/lock cycle.
> Can you check it?

Thanks for your review.

Yes, I think you are right. I did notice the flags are not updated
properly in the latest mm branch, but did not notice it was a newly
introduced issue.

So I fixed the issue in
https://lore.kernel.org/linux-mm/20241022192451.38138-8-ryncsn@xxxxxxxxx/

But a quick fix can be posted before the series. Following patch
should be enough, how do you think?

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 46bd4b1a3c07..6b7d4ac1d4a3 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1041,7 +1041,9 @@ static int cluster_alloc_swap(struct swap_info_struct *si,

        VM_BUG_ON(!si->cluster_info);

-       while (n_ret < nr) {
+       si->flags += SWP_SCANNING;
+       while (n_ret < nr && si->flags & SWP_WRITEOK) {
                unsigned long offset = cluster_alloc_swap_entry(si,
order, usage);

                if (!offset)
@@ -1049,6 +1051,8 @@ static int cluster_alloc_swap(struct swap_info_struct *si,
                slots[n_ret++] = swp_entry(si->type, offset);
        }

+       si->flags -= SWP_SCANNING;
+
        return n_ret;
 }

The "+= SWP_SCANNING" looks ugly, but has to be done in such a way,
just like in the original allocation code. Multiple scanners can
update the flag, and SCANNING is the highest defined bit in the flag
field, it's not a strictly bit flag in such cases. This will be
cleaned up in a proper way in the patch I just mentioned.


>
> --
> Best Regards,
> Huang, Ying