Re: cluster_alloc_swap_entry() may change per-cpu cluster of another CPU randomly

"Huang, Ying" <ying.huang@xxxxxxxxx> · Mon, 04 Nov 2024 11:02:41 +0800

Kairui Song <ryncsn@xxxxxxxxx> writes:

> Hi Ying,
>
> On Fri, Oct 25, 2024 at 1:56 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
>>
>> Kairui Song <ryncsn@xxxxxxxxx> writes:
>>
>> > On Thu, Oct 24, 2024 at 11:29 AM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
>> >>
>> >> Hi, Kairui,
>> >>
>> >> When reading new swap cluster allocation code.  Sorry, I didn't find
>> >> time to review your patch at the first place.  I found that in
>> >> cluster_alloc_swap_entry(), the following code path is possible,
>> >>
>> >>   cluster = this_cpu_ptr(si->percpu_cluster);
>> >>   offset = alloc_swap_scan_cluster();
>> >>     ...
>> >>       spin_unlock(&ci->lock);
>> >>       spin_unlock(&si->lock);
>> >>       /* migrate to another cpu */
>> >>       spin_lock(&ci->lock);
>> >>       spin_lock(&si->lock);
>> >>   cluster->next[order] = offset;
>> >>
>> >> That is, the per cpu cluster of a CPU may be changed on another CPU.  I
>> >> guess that this will not cause some functionality issue.  However, this
>> >> makes code harder to be reasoned.  Is it possible to avoid unlock before
>> >> changing per cpu cluster?  Or, give up updating per cpu cluster if we
>> >> need to unlock.
>> >
>> > Hi Ying
>> >
>> > Yes, I've noticed this when working on the new si->lock series. It may
>> > cause fragmentation, only if the CPU the current allocation migrates
>> > from requested another swap allocation, in such case it may waste one
>> > per cpu cluster.
>> >
>> > No functionality or leak issue, but make things harder to reason
>> > indeed, as you mentioned.
>> >
>> > In the new series I used a local lock to prevent migration, so the cpu
>> > is stable during the whole process. However that is not doable unless
>> > si->lock is removed from allocation path or there are lock dependency
>> > issue.
>> >
>> > Do we need to fix this separately in the current tree? The
>> > fragmentation is overall better with these patches, so I think that is
>> > causing any issues. Before these patches per CPU cluster can also be
>> > stolen by other CPUs so these were not guaranteed to be accurate.
>> >
>> > One easy way we can do is set the cluster->next[order] to 0 before
>> > scanning for a new cluster to use, and check if it is 0 before setting
>> > next[order] again.
>> >
>> > That may still cause fragmentation as migration will cause extra
>> > clusters to be scanned and used, which is hard to avoid without
>> > changing too much code (like the things being done in the new si->lock
>> > series).
>>
>> Find another possible more serious problem.  After unlock/lock cycle,
>> the state of swap device may be changed (for example, swapoffed).  So in
>> the previous code,
>>
>>         si->flags += SWP_SCANNING;
>>
>>         if (!(si->flags & SWP_WRITEOK))
>>
>> is used for that.  I don't find they are used after unlock/lock cycle.
>> Can you check it?
>
> Thanks for your review.
>
> Yes, I think you are right. I did notice the flags are not updated
> properly in the latest mm branch, but did not notice it was a newly
> introduced issue.
>
> So I fixed the issue in
> https://lore.kernel.org/linux-mm/20241022192451.38138-8-ryncsn@xxxxxxxxx/
>
> But a quick fix can be posted before the series. Following patch
> should be enough, how do you think?
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 46bd4b1a3c07..6b7d4ac1d4a3 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1041,7 +1041,9 @@ static int cluster_alloc_swap(struct swap_info_struct *si,
>
>         VM_BUG_ON(!si->cluster_info);
>
> -       while (n_ret < nr) {
> +       si->flags += SWP_SCANNING;
> +       while (n_ret < nr && si->flags & SWP_WRITEOK) {
>                 unsigned long offset = cluster_alloc_swap_entry(si,
> order, usage);
>
>                 if (!offset)
> @@ -1049,6 +1051,8 @@ static int cluster_alloc_swap(struct swap_info_struct *si,
>                 slots[n_ret++] = swp_entry(si->type, offset);
>         }
>
> +       si->flags -= SWP_SCANNING;
> +
>         return n_ret;
>  }
>
> The "+= SWP_SCANNING" looks ugly, but has to be done in such a way,
> just like in the original allocation code. Multiple scanners can
> update the flag, and SCANNING is the highest defined bit in the flag
> field, it's not a strictly bit flag in such cases. This will be
> cleaned up in a proper way in the patch I just mentioned.

I think that the logic of SWP_SCANNING is correct.

As for SWP_WRITEOK, we should avoid to allocate swap entries when other
task clears it ASAP.  So, the above checking may be too late, because we
have already allocated the swap entry.  I think that we can check it in
cluster_reclaim_range() and refuse to allocate swap entry there.

Combined with the per cpu cluster (cluster->next[]) changing issue we
discussed in another email of the thread.  I think the long term
solution could be separate the swap entry allocation and reclaiming a
bit more.  For example, we could try to allocate some swap entries with
si->lock held firstly, record possible reclaim offset if any, change
cluster->next[], then unlock, reclaim, lock.

--
Best Regards,
Huang, Ying