Re: cluster_alloc_swap_entry() may change per-cpu cluster of another CPU randomly

"Huang, Ying" <ying.huang@xxxxxxxxx> · Fri, 25 Oct 2024 13:52:28 +0800

Kairui Song <ryncsn@xxxxxxxxx> writes:

> On Thu, Oct 24, 2024 at 11:29 AM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
>>
>> Hi, Kairui,
>>
>> When reading new swap cluster allocation code.  Sorry, I didn't find
>> time to review your patch at the first place.  I found that in
>> cluster_alloc_swap_entry(), the following code path is possible,
>>
>>   cluster = this_cpu_ptr(si->percpu_cluster);
>>   offset = alloc_swap_scan_cluster();
>>     ...
>>       spin_unlock(&ci->lock);
>>       spin_unlock(&si->lock);
>>       /* migrate to another cpu */
>>       spin_lock(&ci->lock);
>>       spin_lock(&si->lock);
>>   cluster->next[order] = offset;
>>
>> That is, the per cpu cluster of a CPU may be changed on another CPU.  I
>> guess that this will not cause some functionality issue.  However, this
>> makes code harder to be reasoned.  Is it possible to avoid unlock before
>> changing per cpu cluster?  Or, give up updating per cpu cluster if we
>> need to unlock.
>
> Hi Ying
>
> Yes, I've noticed this when working on the new si->lock series. It may
> cause fragmentation, only if the CPU the current allocation migrates
> from requested another swap allocation, in such case it may waste one
> per cpu cluster.
>
> No functionality or leak issue, but make things harder to reason
> indeed, as you mentioned.
>
> In the new series I used a local lock to prevent migration, so the cpu
> is stable during the whole process. However that is not doable unless
> si->lock is removed from allocation path or there are lock dependency
> issue.
>
> Do we need to fix this separately in the current tree? The
> fragmentation is overall better with these patches, so I think that is
> causing any issues. Before these patches per CPU cluster can also be
> stolen by other CPUs so these were not guaranteed to be accurate.
>
> One easy way we can do is set the cluster->next[order] to 0 before
> scanning for a new cluster to use, and check if it is 0 before setting
> next[order] again.
>
> That may still cause fragmentation as migration will cause extra
> clusters to be scanned and used, which is hard to avoid without
> changing too much code (like the things being done in the new si->lock
> series).

Find another possible more serious problem.  After unlock/lock cycle,
the state of swap device may be changed (for example, swapoffed).  So in
the previous code,

	si->flags += SWP_SCANNING;

	if (!(si->flags & SWP_WRITEOK))

is used for that.  I don't find they are used after unlock/lock cycle.
Can you check it?

--
Best Regards,
Huang, Ying