On Mon, Jan 13, 2025 at 2:33 PM Kairui Song <ryncsn@xxxxxxxxx> wrote: > > On Fri, Jan 10, 2025 at 7:24 PM Baoquan He <bhe@xxxxxxxxxx> wrote: > > > > On 01/09/25 at 10:15am, Kairui Song wrote: > > > On Wed, Jan 8, 2025 at 7:10 PM Baoquan He <bhe@xxxxxxxxxx> wrote: > > > > > > > > > > Thanks for the very detailed review! > > > > > > > On 12/31/24 at 01:46am, Kairui Song wrote: > > > > ......snip..... > > > > > --- > > > > > include/linux/swap.h | 3 +- > > > > > mm/swapfile.c | 435 ++++++++++++++++++++++++------------------- > > > > > 2 files changed, 246 insertions(+), 192 deletions(-) > > > > > > > > > > diff --git a/include/linux/swap.h b/include/linux/swap.h > > > > > index 339d7f0192ff..c4ff31cb6bde 100644 > > > > > --- a/include/linux/swap.h > > > > > +++ b/include/linux/swap.h > > > > > @@ -291,6 +291,7 @@ enum swap_cluster_flags { > > > > > * throughput. > > > > > */ > > > > > struct percpu_cluster { > > > > > + local_lock_t lock; /* Protect the percpu_cluster above */ > > > > > unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */ > > > > > }; > > > > > > > > > > @@ -313,7 +314,7 @@ struct swap_info_struct { > > > > > /* list of cluster that contains at least one free slot */ > > > > > struct list_head frag_clusters[SWAP_NR_ORDERS]; > > > > > /* list of cluster that are fragmented or contented */ > > > > > - unsigned int frag_cluster_nr[SWAP_NR_ORDERS]; > > > > > + atomic_long_t frag_cluster_nr[SWAP_NR_ORDERS]; > > > > > unsigned int pages; /* total of usable pages of swap */ > > > > > atomic_long_t inuse_pages; /* number of those currently in use */ > > > > > struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap location */ > > > > > diff --git a/mm/swapfile.c b/mm/swapfile.c > > > > > index 7795a3d27273..dadd4fead689 100644 > > > > > --- a/mm/swapfile.c > > > > > +++ b/mm/swapfile.c > > ...snip... > > > > > @@ -475,39 +488,90 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si, > > > > > > > > > > static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci) > > > > > { > > > > > - lockdep_assert_held(&si->lock); > > > > > lockdep_assert_held(&ci->lock); > > > > > cluster_move(si, ci, &si->free_clusters, CLUSTER_FLAG_FREE); > > > > > ci->order = 0; > > > > > } > > > > > > > > > > +/* > > > > > + * Isolate and lock the first cluster that is not contented on a list, > > > > > + * clean its flag before taken off-list. Cluster flag must be in sync > > > > > + * with list status, so cluster updaters can always know the cluster > > > > > + * list status without touching si lock. > > > > > + * > > > > > + * Note it's possible that all clusters on a list are contented so > > > > > + * this returns NULL for an non-empty list. > > > > > + */ > > > > > +static struct swap_cluster_info *cluster_isolate_lock( > > > > > + struct swap_info_struct *si, struct list_head *list) > > > > > +{ > > > > > + struct swap_cluster_info *ci, *ret = NULL; > > > > > + > > > > > + spin_lock(&si->lock); > > > > > + > > > > > + if (unlikely(!(si->flags & SWP_WRITEOK))) > > > > > + goto out; > > > > > + > > > > > + list_for_each_entry(ci, list, list) { > > > > > + if (!spin_trylock(&ci->lock)) > > > > > + continue; > > > > > + > > > > > + /* We may only isolate and clear flags of following lists */ > > > > > + VM_BUG_ON(!ci->flags); > > > > > + VM_BUG_ON(ci->flags > CLUSTER_FLAG_USABLE && > > > > > + ci->flags != CLUSTER_FLAG_FULL); > > > > > + > > > > > + list_del(&ci->list); > > > > > + ci->flags = CLUSTER_FLAG_NONE; > > > > > + ret = ci; > > > > > + break; > > > > > + } > > > > > +out: > > > > > + spin_unlock(&si->lock); > > > > > + > > > > > + return ret; > > > > > +} > > > > > + > > > > > /* > > > > > * Doing discard actually. After a cluster discard is finished, the cluster > > > > > - * will be added to free cluster list. caller should hold si->lock. > > > > > -*/ > > > > > -static void swap_do_scheduled_discard(struct swap_info_struct *si) > > > > > + * will be added to free cluster list. Discard cluster is a bit special as > > > > > + * they don't participate in allocation or reclaim, so clusters marked as > > > > > + * CLUSTER_FLAG_DISCARD must remain off-list or on discard list. > > > > > + */ > > > > > +static bool swap_do_scheduled_discard(struct swap_info_struct *si) > > > > > { > > > > > struct swap_cluster_info *ci; > > > > > + bool ret = false; > > > > > unsigned int idx; > > > > > > > > > > + spin_lock(&si->lock); > > > > > while (!list_empty(&si->discard_clusters)) { > > > > > ci = list_first_entry(&si->discard_clusters, struct swap_cluster_info, list); > > > > > + /* > > > > > + * Delete the cluster from list but don't clear its flags until > > > > > + * discard is done, so isolation and relocation will skip it. > > > > > + */ > > > > > list_del(&ci->list); > > > > > > > > I don't understand above comment. ci has been taken off list. While > > > > allocation need isolate from a usable list. Even though we clear > > > > ci->flags now, how come isolation and relocation will touch it. I may > > > > miss anything here. > > > > > > There are many cases, one possible and common situation is that the > > > percpu cluster (si->percpu_cluster of another CPU) is still pointing > > > to it. > > > > > > Also, this commit removed protection of si lock on allocation, and > > > allocation path may also drop ci lock to call reclaim, which means one > > > cluster could be used or freed by anyone before allocator reacquire > > > the ci lock again. In that case, the allocator could see a discard > > > cluster. > > > > > > So we don't clear the discard flag, in case anyone misuse it. > > > > > > I can add more inline comments on this, this is already some related > > > comments above the function relocate_cluster, could add some more > > > referencing that. > > Hi Baoquan, > > > > > Thanks for your great explanation. I understand that si->percpu_cluster > > could point to a discarded ci, and a ci could be got from non-full, > > frag lists but later become discarded if that ci is freed on other cpu > > during cluster_reclaim_range() invocation. I haven't got how isolation > > could see a discarded ci in cluster_isolate_lock(). Could you help give > > an exmaple on how that happen? > > cluster_isolate_lock shouldn't see a discard cluster, and there is a > VM_BUG_ON for that. Oh, now I realize what you mean, the comment in swap_do_scheduled_discard mentioned cluster_isolate_lock may see a discard cluster, that is not true, it was added in an early version of this series and forgot to update the comment. I'll just drop that. > > > > > Surely, I understand keeping the discarded flag is very necessary so > > that checking like cluster_is_usable() will return expected value. > > > > And by the way, I haven't got when the ' if (!ci->count)' case could > > happen in relocate_cluster() since we have filtered away discarded ci > > with the 'if (cluster_is_discard(ci))' checking. I asked in another > > thread, could you help explain it? > > Many swap devices doesn't need discard so the cluster could be freed > directly. And actually the ci->count check in relocate_cluster is not > necessarily related to that. > > The caller of relocate_cluster, may fail an allocation (eg. race with > swapoff) and that could end up calling relocate_cluster with a empty > cluster, such cluster should go to the free list (swapoff might fail > too). > > The swapoff case is extremely rare but let's just be more robust here, > covering free cluster have almost no overhead but save a lot of > efforts. I can add some comments on this. > > > > > static void relocate_cluster(struct swap_info_struct *si, > > struct swap_cluster_info *ci) > > { > > lockdep_assert_held(&ci->lock); > > > > /* Discard cluster must remain off-list or on discard list */ > > if (cluster_is_discard(ci)) > > return; > > > > if (!ci->count) { > > free_cluster(si, ci); > > ... > > }