Ryan Roberts <ryan.roberts@xxxxxxx> 于2024年3月15日周五 18:57写道: > > On 15/03/2024 08:34, Chuanhua Han wrote: > > Ryan Roberts <ryan.roberts@xxxxxxx> 于2024年3月14日周四 21:43写道: > >> > >> On 14/03/2024 13:12, Chuanhua Han wrote: > >>> Ryan Roberts <ryan.roberts@xxxxxxx> 于2024年3月12日周二 02:51写道: > >>>> > >>>> On 04/03/2024 08:13, Barry Song wrote: > >>>>> From: Chuanhua Han <hanchuanhua@xxxxxxxx> > >>>>> > >>>>> While swapping in a large folio, we need to free swaps related to the whole > >>>>> folio. To avoid frequently acquiring and releasing swap locks, it is better > >>>>> to introduce an API for batched free. > >>>>> > >>>>> Signed-off-by: Chuanhua Han <hanchuanhua@xxxxxxxx> > >>>>> Co-developed-by: Barry Song <v-songbaohua@xxxxxxxx> > >>>>> Signed-off-by: Barry Song <v-songbaohua@xxxxxxxx> > >>>>> --- > >>>>> include/linux/swap.h | 6 ++++++ > >>>>> mm/swapfile.c | 35 +++++++++++++++++++++++++++++++++++ > >>>>> 2 files changed, 41 insertions(+) > >>>>> > >>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h > >>>>> index 2955f7a78d8d..d6ab27929458 100644 > >>>>> --- a/include/linux/swap.h > >>>>> +++ b/include/linux/swap.h > >>>>> @@ -481,6 +481,7 @@ extern void swap_shmem_alloc(swp_entry_t); > >>>>> extern int swap_duplicate(swp_entry_t); > >>>>> extern int swapcache_prepare(swp_entry_t); > >>>>> extern void swap_free(swp_entry_t); > >>>>> +extern void swap_nr_free(swp_entry_t entry, int nr_pages); > >>>> > >>>> nit: In my swap-out v4 series, I've created a batched version of > >>>> free_swap_and_cache() and called it free_swap_and_cache_nr(). Perhaps it is > >>>> preferable to align the naming schemes - i.e. call this swap_free_nr(). Your > >>>> scheme doesn't really work when applied to free_swap_and_cache(). > >>> Thanks for your suggestions, and for the next version, we'll see which > >>> package is more appropriate! > >>>> > >>>>> extern void swapcache_free_entries(swp_entry_t *entries, int n); > >>>>> extern int free_swap_and_cache(swp_entry_t); > >>>>> int swap_type_of(dev_t device, sector_t offset); > >>>>> @@ -561,6 +562,11 @@ static inline void swap_free(swp_entry_t swp) > >>>>> { > >>>>> } > >>>>> > >>>>> +void swap_nr_free(swp_entry_t entry, int nr_pages) > >>>>> +{ > >>>>> + > >>>>> +} > >>>>> + > >>>>> static inline void put_swap_folio(struct folio *folio, swp_entry_t swp) > >>>>> { > >>>>> } > >>>>> diff --git a/mm/swapfile.c b/mm/swapfile.c > >>>>> index 3f594be83b58..244106998a69 100644 > >>>>> --- a/mm/swapfile.c > >>>>> +++ b/mm/swapfile.c > >>>>> @@ -1341,6 +1341,41 @@ void swap_free(swp_entry_t entry) > >>>>> __swap_entry_free(p, entry); > >>>>> } > >>>>> > >>>>> +/* > >>>>> + * Called after swapping in a large folio, batched free swap entries > >>>>> + * for this large folio, entry should be for the first subpage and > >>>>> + * its offset is aligned with nr_pages > >>>>> + */ > >>>>> +void swap_nr_free(swp_entry_t entry, int nr_pages) > >>>>> +{ > >>>>> + int i; > >>>>> + struct swap_cluster_info *ci; > >>>>> + struct swap_info_struct *p; > >>>>> + unsigned type = swp_type(entry); > >>>> > >>>> nit: checkpatch.py will complain about bare "unsigned", preferring "unsigned > >>>> int" or at least it did for me when I did something similar in my swap-out patch > >>>> set. > >>> Gee, thanks for pointing that out! > >>>> > >>>>> + unsigned long offset = swp_offset(entry); > >>>>> + DECLARE_BITMAP(usage, SWAPFILE_CLUSTER) = { 0 }; > >>>> > >>>> I don't love this, as it could blow the stack if SWAPFILE_CLUSTER ever > >>>> increases. But the only other way I can think of is to explicitly loop over > >>>> fixed size chunks, and that's not much better. > >>> Is it possible to save kernel stack better by using bit_map here? If > >>> SWAPFILE_CLUSTER=512, we consume only (512/64)*8= 64 bytes. > >> > >> I'm not sure I've understood what you are saying? You're already using > >> DECLARE_BITMAP(), so its already consuming 64 bytes if SWAPFILE_CLUSTER=512, no? > >> > >> I actually did a bad job of trying to express a couple of different points: > >> > >> - Are there any configurations today where SWAPFILE_CLUSTER > 512? I'm not sure. > >> Certainly not for arm64, but not sure about other architectures. For example if > >> an arch had 64K pages with 8192 entries per THP and supports SWAP_THP, that's 1K > >> for the bitmap, which is now looking pretty big for the stack. > > I agree with you.The current bit_map grows linearly with the > > SWAPFILE_CLUSTER, which may cause the kernel stack to swell. > > I need to think of a way to save more memory . > >> > >> - Would it be better to decouple stack usage from SWAPFILE_CLUSTER and instead > >> define a fixed stack size (e.g. 64 bytes -> 512 entries). Then free the range of > >> entries in batches no bigger than this size. This approach could also allow > >> removing the constraint that the range has to be aligned and fit in a single > >> cluster. Personally I think an approach like this would be much more robust, in > >> return for a tiny bit more complexity. > > Because we cannot determine how many swap entries a cluster has in an > > architecture or a configuration, we do not know how large the variable > > needs to be defined? > > My point is that we could define a fixed size, then loop through the passed in > range, operating on batches of that fixed size. You could even take into > consideration the cluster boundaries so that you take the correct lock for every > batch and can drop the "must be naturally aligned, must be no bigger than > cluster size" constraint. Thank you. I understand it! > > > >> > >>>> > >>>>> + > >>>>> + /* all swap entries are within a cluster for mTHP */ > >>>>> + VM_BUG_ON(offset % SWAPFILE_CLUSTER + nr_pages > SWAPFILE_CLUSTER); > >>>>> + > >>>>> + if (nr_pages == 1) { > >>>>> + swap_free(entry); > >>>>> + return; > >>>>> + } > >>>>> + > >>>>> + p = _swap_info_get(entry); > >>>> > >>>> You need to handle this returning NULL, like swap_free() does. > >>> Yes, you're right! We did forget to judge NULL here. > >>>> > >>>>> + > >>>>> + ci = lock_cluster(p, offset); > >>>> > >>>> The existing swap_free() calls lock_cluster_or_swap_info(). So if swap is backed > >>>> by rotating media, and clusters are not in use, it will lock the whole swap > >>>> info. But your new version only calls lock_cluster() which won't lock anything > >>>> if clusters are not in use. So I think this is a locking bug. > >>> Again, you're right, it's bug! > >>>> > >>>>> + for (i = 0; i < nr_pages; i++) { > >>>>> + if (__swap_entry_free_locked(p, offset + i, 1)) > >>>>> + __bitmap_set(usage, i, 1); > >>>>> + } > >>>>> + unlock_cluster(ci); > >>>>> + > >>>>> + for_each_clear_bit(i, usage, nr_pages) > >>>>> + free_swap_slot(swp_entry(type, offset + i)); > >>>>> +} > >>>>> + > >>>>> /* > >>>>> * Called after dropping swapcache to decrease refcnt to swap entries. > >>>>> */ > >>>> > >>>> Thanks, > >>>> Ryan > >>>> > >>>> > >>> > >>> > >> > > > > > -- Thanks, Chuanhua