Re: [RFC PATCH v2 1/1] mm/vmscan: move the written-back folios to the tail of LRU after shrinking

Chris Li <chrisl@xxxxxxxxxx> · Tue, 26 Nov 2024 16:08:56 -0800

On Sun, Nov 17, 2024 at 8:22 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
>
> On Mon, Nov 18, 2024 at 05:14:14PM +1300, Barry Song wrote:
> > On Mon, Nov 18, 2024 at 5:03 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
> > >
> > > On Sat, Nov 16, 2024 at 09:16:58AM +0000, Chen Ridong wrote:
> > > > 2. In shrink_page_list function, if folioN is THP(2M), it may be splited
> > > >    and added to swap cache folio by folio. After adding to swap cache,
> > > >    it will submit io to writeback folio to swap, which is asynchronous.
> > > >    When shrink_page_list is finished, the isolated folios list will be
> > > >    moved back to the head of inactive lru. The inactive lru may just look
> > > >    like this, with 512 filioes have been move to the head of inactive lru.
> > >
> > > I was hoping that we'd be able to stop splitting the folio when adding
> > > to the swap cache.  Ideally. we'd add the whole 2MB and write it back
> > > as a single unit.
> >
> > This is already the case: adding to the swapcache doesn’t require splitting
> > THPs, but failing to allocate 2MB of contiguous swap slots will.
>
> Agreed we need to understand why this is happening.  As I've said a few
> times now, we need to stop requiring contiguity.  Real filesystems don't
> need the contiguity (they become less efficient, but they can scatter a
> single 2MB folio to multiple places).
>
> Maybe Chris has a solution to this in the works?

Hi Matthew  and Chenridong,

Sorry for the late reply.

I don't have a working solution yet. I just have some ideas.

One of the big challenges is what to do with swap cache. Currently
when a folio was added to the swap cache, it assumed continued swap
entry. There will be a lot of complexity to break that assumption. To
make things worse, the discontiguous swap entry might belong to a
different xarray due to the 64M swap address sharding.

One idea is that we can have a special kind of swap device to do swap
entry redirecting.

For the swap out path,

Let's say the real swapfile A is almost full. We want to allocate an
order of 4 swap entries to folio F.

If there are contiguous swap entries in A, the swap allocator just
returns entry [A9 ..A12], with A9 as the head swap entry. That is the
same as the normal path we have now.

On the other hand, if there is no contiguous swap entry in A. Only
non-contiguous swap entry A1, A3, A5, A7.

Instead, we allocate from a special redirecting swap device R as R1,
R2, R3, R4 with an IO redirecting array as [R1, A1, A3, A5, A7]. Swap
device R is virtual, there is no real file backing on it, so the swap
file size on R can grow or shrink as needed.

In add_to_swap_cache(), we set folio F->swap = R1. Add F into swap
cache S with entry [R1..R4] pointing to folio F. In other words,
S[R1..R4] = F.  Add additional lookup xarray L[R1..R4] = [R1, A1, A3,
A5, A7]. For the rest of the code, we pass the R1 as the continuous
swap entry to folio F.

The swap_writepage_bdev_async() will recognize R as a special device.
It will do the lookup xarray L[R1] to get the [R1, A1, A3, A5, A7],
use that entry list to build the bio with 4 iovec instead of 1. Fill
up the [A1,A3,A5,A7] into the bio vec. That is the swap write path.

For the swap in, the page fault handler gets a fault at address X and
looks up the pte containing swap entry R3.  Look up the swap cache of
S[R3] and get nothing, folio F is not in the swap cache.
Recognize the R is a remapping device. The swap core will lookup L[R3]
= [R1, A1,A3,A5,A7]. If we want to swap in order 2 folio. Then
construct swap_read_folio_bdev_async() with iovec [A1, A3, A5, A7].
If we just want to swap in a 4k page. We can construct iovec as [A3]
alone, given the swap entry starts from R1.

That is the read path.

For the simplicity, there is a lot of detail omitted in the
description. Also on the implementation side, a lot of optimizations
we might be able to do, e.g. using pointer lookup of R1 instead of
xarray, we can use struct to hold R1 and [A1, A3, A5, A7] etc.

This approach avoids a lot of complexity in breaking the continuity
assumption of swap cache entries, at the cost of additional swap cache
address space R. The lookup mapping L[R1..R4] = [R1, A1, A3, A5, A7]
are minimally necessary data structures to track the IO remapping. I
think that is unavoidable.

Please let me know if you see any problem with the above approach. As
always, feedback is welcome as well.

Thanks

Chris