On Tue, Jan 23, 2024 at 1:01 AM Huang, Ying <ying.huang@xxxxxxxxx> wrote: > > Yosry Ahmed <yosryahmed@xxxxxxxxxx> writes: > > > In swap_range_free(), we update inuse_pages then do some cleanups (arch > > invalidation, zswap invalidation, swap cache cleanups, etc). During > > swapoff, try_to_unuse() uses inuse_pages to make sure all swap entries > > are freed. Make sure we only update inuse_pages after we are done with > > the cleanups. > > > > In practice, this shouldn't matter, because swap_range_free() is called > > with the swap info lock held, and the swapoff code will spin for that > > lock after try_to_unuse() anyway. > > > > The goal is to make it obvious and more future proof that once > > try_to_unuse() returns, all cleanups are done. > > Defines "all cleanups". Apparently, some other operations are still > to be done after try_to_unuse() in swap_off(). I am referring to the cleanups in swap_range_free() that I mentioned above. How about s/all the cleanups/all the cleanups in swap_range_free()? > > > This also facilitates a > > following zswap cleanup patch which uses this fact to simplify > > zswap_swapoff(). > > > > Signed-off-by: Yosry Ahmed <yosryahmed@xxxxxxxxxx> > > --- > > mm/swapfile.c | 4 ++-- > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > diff --git a/mm/swapfile.c b/mm/swapfile.c > > index 556ff7347d5f0..2fedb148b9404 100644 > > --- a/mm/swapfile.c > > +++ b/mm/swapfile.c > > @@ -737,8 +737,6 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset, > > if (was_full && (si->flags & SWP_WRITEOK)) > > add_to_avail_list(si); > > } > > - atomic_long_add(nr_entries, &nr_swap_pages); > > - WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries); > > if (si->flags & SWP_BLKDEV) > > swap_slot_free_notify = > > si->bdev->bd_disk->fops->swap_slot_free_notify; > > @@ -752,6 +750,8 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset, > > offset++; > > } > > clear_shadow_from_swap_cache(si->type, begin, end); > > + atomic_long_add(nr_entries, &nr_swap_pages); > > + WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries); > > This isn't enough. You need to use smp_wmb() here and smp_rmb() in > somewhere reading si->inuse_pages. Hmm, good point. Although as I mentioned in the commit message, this shouldn't matter today as swap_range_free() executes with the lock held, and we spin on the lock after try_to_unuse() returns. It may still be more future-proof to add the memory barriers. In swap_range_free, we want to make sure that the write to si->inuse_pages in swap_range_free() happens *after* the cleanups (specifically zswap_invalidate() in this case). In swap_off, we want to make sure that the cleanups following try_to_unuse() (e.g. zswap_swapoff) happen *after* reading si->inuse_pages == 0 in try_to_unuse(). So I think we want smp_wmb() in swap_range_free() and smp_mb() in try_to_unuse(). Does the below look correct to you? diff --git a/mm/swapfile.c b/mm/swapfile.c index 2fedb148b9404..a2fa2f65a8ddd 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -750,6 +750,12 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset, offset++; } clear_shadow_from_swap_cache(si->type, begin, end); + + /* + * Make sure that try_to_unuse() observes si->inuse_pages reaching 0 + * only after the above cleanups are done. + */ + smp_wmb(); atomic_long_add(nr_entries, &nr_swap_pages); WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries); } @@ -2130,6 +2136,11 @@ static int try_to_unuse(unsigned int type) return -EINTR; } + /* + * Make sure that further cleanups after try_to_unuse() returns happen + * after swap_range_free() reduces si->inuse_pages to 0. + */ + smp_mb(); return 0; } Alternatively, we may just hold the spinlock in try_to_unuse() when we check si->inuse_pages at the end. This will also ensure that any calls to swap_range_free() have completed. Let me know what you prefer.