Re: [PATCH 0/8] mm/swap: optimize swap cache search space

Chris Li <chrisl@xxxxxxxxxx> · Sun, 28 Apr 2024 22:50:33 -0700

On Sun, Apr 28, 2024 at 10:37 AM Kairui Song <ryncsn@xxxxxxxxx> wrote:
>
> On Sat, Apr 27, 2024 at 7:16 AM Chris Li <chrisl@xxxxxxxxxx> wrote:
> >
> > Hi Ying,
> >
> > On Tue, Apr 23, 2024 at 7:26 PM Huang, Ying <ying.huang@xxxxxxxxx> wrote:
> > >
> > > Hi, Matthew,
> > >
> > > Matthew Wilcox <willy@xxxxxxxxxxxxx> writes:
> > >
> > > > On Mon, Apr 22, 2024 at 03:54:58PM +0800, Huang, Ying wrote:
> > > >> Is it possible to add "start_offset" support in xarray, so "index"
> > > >> will subtract "start_offset" before looking up / inserting?
> > > >
> > > > We kind of have that with XA_FLAGS_ZERO_BUSY which is used for
> > > > XA_FLAGS_ALLOC1.  But that's just one bit for the entry at 0.  We could
> > > > generalise it, but then we'd have to store that somewhere and there's
> > > > no obvious good place to store it that wouldn't enlarge struct xarray,
> > > > which I'd be reluctant to do.
> > > >
> > > >> Is it possible to use multiple range locks to protect one xarray to
> > > >> improve the lock scalability?  This is why we have multiple "struct
> > > >> address_space" for one swap device.  And, we may have same lock
> > > >> contention issue for large files too.
> > > >
> > > > It's something I've considered.  The issue is search marks.  If we delete
> > > > an entry, we may have to walk all the way up the xarray clearing bits as
> > > > we go and I'd rather not grab a lock at each level.  There's a convenient
> > > > 4 byte hole between nr_values and parent where we could put it.
> > > >
> > > > Oh, another issue is that we use i_pages.xa_lock to synchronise
> > > > address_space.nrpages, so I'm not sure that a per-node lock will help.
> > >
> > > Thanks for looking at this.
> > >
> > > > But I'm conscious that there are workloads which show contention on
> > > > xa_lock as their limiting factor, so I'm open to ideas to improve all
> > > > these things.
> > >
> > > I have no idea so far because my very limited knowledge about xarray.
> >
> > For the swap file usage, I have been considering an idea to remove the
> > index part of the xarray from swap cache. Swap cache is different from
> > file cache in a few aspects.
> > For one if we want to have a folio equivalent of "large swap entry".
> > Then the natural alignment of those swap offset on does not make
> > sense. Ideally we should be able to write the folio to un-aligned swap
> > file locations.
> >
>
> Hi Chris,
>
> This sound interesting, I have a few questions though...
>
> Are you suggesting we handle swap on file and swap on device
> differently? Swap on file is much less frequently used than swap on
> device I think.

That is not what I have in mind. The swap struct idea did not
distinguish the swap file vs swap device.BTW, I sometimes use swap on
file because I did not allocate a swap partition in advance.

>
> And what do you mean "index part of the xarray"? If we need a cache,
> xarray still seems one of the best choices to hold the content.

We still need to look up swap file offset -> folio. However if we
allocate each swap offset a "struct swap", then the folio lookup can
be as simple as get the swap_struc by offset, then atomic read of
swap_structt->folio.

Not sure how you come to the conclusion for "best choices"?  It is one
choice, but it has its drawbacks. The natural alignment requirement of
xarray, e.g. 2M large swap entries need to be written to 2M aligned
offset, that is an unnecessary restriction. If we allocate the "struct
swap" ourselves, we have more flexibility.

> > The other aspect for swap files is that, we already have different
> > data structures organized around swap offset, swap_map and
> > swap_cgroup. If we group the swap related data structure together. We
> > can add a pointer to a union of folio or a shadow swap entry. We can
> > use atomic updates on the swap struct member or breakdown the access
> > lock by ranges just like swap cluster does.
> >
> > I want to discuss those ideas in the upcoming LSF/MM meet up as well.
>
> Looking forward to it!

Thanks, I will post more when I get more progress on that.

>
> Oh, and BTW I'm also trying to breakdown the swap address space range
> (from 64M to 16M, SWAP_ADDRESS_SPACE_SHIFT from 14 to
> 12). It's a simple approach, but the coupling and increased memory
> usage of address_space structure makes the performance go into
> regression (about -2% for worst real world workload). I found this

Yes, that sounds plausible.

> part very performance sensitive, so basically I'm not making much
> progress for the future items I mentioned in this cover letter. New
> ideas could be very helpful!
>

The swap_struct idea is very different from what you are trying to do
in this series. It is more related to my LSF/MM topic on the swap back
end overhaul. More long term and bigger undertakings.

Chris