Kairui Song <ryncsn@xxxxxxxxx> writes: > On Thu, Oct 24, 2024 at 11:29 AM Huang, Ying <ying.huang@xxxxxxxxx> wrote: >> >> Hi, Kairui, >> >> When reading new swap cluster allocation code. Sorry, I didn't find >> time to review your patch at the first place. I found that in >> cluster_alloc_swap_entry(), the following code path is possible, >> >> cluster = this_cpu_ptr(si->percpu_cluster); >> offset = alloc_swap_scan_cluster(); >> ... >> spin_unlock(&ci->lock); >> spin_unlock(&si->lock); >> /* migrate to another cpu */ >> spin_lock(&ci->lock); >> spin_lock(&si->lock); >> cluster->next[order] = offset; >> >> That is, the per cpu cluster of a CPU may be changed on another CPU. I >> guess that this will not cause some functionality issue. However, this >> makes code harder to be reasoned. Is it possible to avoid unlock before >> changing per cpu cluster? Or, give up updating per cpu cluster if we >> need to unlock. > > Hi Ying > > Yes, I've noticed this when working on the new si->lock series. It may > cause fragmentation, only if the CPU the current allocation migrates > from requested another swap allocation, in such case it may waste one > per cpu cluster. > > No functionality or leak issue, but make things harder to reason > indeed, as you mentioned. > > In the new series I used a local lock to prevent migration, so the cpu > is stable during the whole process. However that is not doable unless > si->lock is removed from allocation path or there are lock dependency > issue. > > Do we need to fix this separately in the current tree? The > fragmentation is overall better with these patches, so I think that is > causing any issues. Before these patches per CPU cluster can also be > stolen by other CPUs so these were not guaranteed to be accurate. > > One easy way we can do is set the cluster->next[order] to 0 before > scanning for a new cluster to use, and check if it is 0 before setting > next[order] again. > > That may still cause fragmentation as migration will cause extra > clusters to be scanned and used, which is hard to avoid without > changing too much code (like the things being done in the new si->lock > series). Find another possible more serious problem. After unlock/lock cycle, the state of swap device may be changed (for example, swapoffed). So in the previous code, si->flags += SWP_SCANNING; if (!(si->flags & SWP_WRITEOK)) is used for that. I don't find they are used after unlock/lock cycle. Can you check it? -- Best Regards, Huang, Ying