Kairui Song <ryncsn@xxxxxxxxx> writes: > On Mon, Nov 4, 2024 at 11:06 AM Huang, Ying <ying.huang@xxxxxxxxx> wrote: >> >> Kairui Song <ryncsn@xxxxxxxxx> writes: >> >> I think that the logic of SWP_SCANNING is correct. >> >> As for SWP_WRITEOK, we should avoid to allocate swap entries when other >> task clears it ASAP. So, the above checking may be too late, because we >> have already allocated the swap entry. I think that we can check it in >> cluster_reclaim_range() and refuse to allocate swap entry there. > > Thanks for the feedback. > > Not sure if this is a big issue. With the above patch SWP_WRITEOK will > prevents any new allocations from using the device, already ongoing > allocations may still use this device indeed, but in the end swapoff > will still succeed, maybe very slightly slower as swapoff may have to > unuse a few more entries. I can add another SWP_WRITEOK check to make > it better, will post a patch. IIUC, this is more serious than delaying swapoff. try_to_unuse() checks si->inuse_pages, if it becomes 0, then swapoff() will go ahead without checking it again. However, the current kernel may allocate swap entry even if si->inuse_pages is 0 and SWP_WRITEOK is cleared. >> Combined with the per cpu cluster (cluster->next[]) changing issue we >> discussed in another email of the thread. I think the long term >> solution could be separate the swap entry allocation and reclaiming a >> bit more. For example, we could try to allocate some swap entries with >> si->lock held firstly, record possible reclaim offset if any, change >> cluster->next[], then unlock, reclaim, lock. -- Best Regards, Huang, Ying