Re: [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"

Chris Li <chrisl@xxxxxxxxxx> · Wed, 6 Mar 2024 09:56:41 -0800

On Tue, Mar 5, 2024 at 10:05 PM Barry Song <21cnbao@xxxxxxxxx> wrote:
>
> On Wed, Mar 6, 2024 at 4:00 PM Chris Li <chrisl@xxxxxxxxxx> wrote:
> >
> > On Tue, Mar 5, 2024 at 5:15 PM Barry Song <21cnbao@xxxxxxxxx> wrote:
> > > > Another limitation I would like to address is that swap_writepage can
> > > > only write out IO in one contiguous chunk, not able to perform
> > > > non-continuous IO. When the swapfile is close to full, it is likely
> > > > the unused entry will spread across different locations. It would be
> > > > nice to be able to read and write large folio using discontiguous disk
> > > > IO locations.
> > >
> > > I don't find it will be too difficult for swap_writepage to only write
> > > out a large folio which has discontiguous swap offsets. taking
> > > zRAM as an example, as long as bio can be organized correctly,
> > > zram should be able to write a large folio one by one for its all
> > > subpages.
> >
> > Yes.
> >
> > >
> > > static void zram_bio_write(struct zram *zram, struct bio *bio)
> > > {
> > >         unsigned long start_time = bio_start_io_acct(bio);
> > >         struct bvec_iter iter = bio->bi_iter;
> > >
> > >         do {
> > >                 u32 index = iter.bi_sector >> SECTORS_PER_PAGE_SHIFT;
> > >                 u32 offset = (iter.bi_sector & (SECTORS_PER_PAGE - 1)) <<
> > >                                 SECTOR_SHIFT;
> > >                 struct bio_vec bv = bio_iter_iovec(bio, iter);
> > >
> > >                 bv.bv_len = min_t(u32, bv.bv_len, PAGE_SIZE - offset);
> > >
> > >                 if (zram_bvec_write(zram, &bv, index, offset, bio) < 0) {
> > >                         atomic64_inc(&zram->stats.failed_writes);
> > >                         bio->bi_status = BLK_STS_IOERR;
> > >                         break;
> > >                 }
> > >
> > >                 zram_slot_lock(zram, index);
> > >                 zram_accessed(zram, index);
> > >                 zram_slot_unlock(zram, index);
> > >
> > >                 bio_advance_iter_single(bio, &iter, bv.bv_len);
> > >         } while (iter.bi_size);
> > >
> > >         bio_end_io_acct(bio, start_time);
> > >         bio_endio(bio);
> > > }
> > >
> > > right now , add_to_swap() is lacking a way to record discontiguous
> > > offset for each subpage, alternatively, we have a folio->swap.
> > >
> > > I wonder if we can somehow make it page granularity, for each
> > > subpage, it can have its own offset somehow like page->swap,
> > > then in swap_writepage(), we can make a bio with multiple
> > > discontiguous I/O index. then we allow add_to_swap() to get
> > > nr_pages different swap offsets, and fill into each subpage.
> >
> > The key is where to store the subpage offset. It can't be stored on
> > the tail page's page->swap because some tail page's page struct are
> > just mapping of the head page's page struct. I am afraid this mapping
> > relationship has to be stored on the swap back end. That is the idea,
> > have swap backend keep track of an array of subpage's swap location.
> > This array is looked up by the head swap offset.
>
> I assume "some tail page's page struct are just mapping of the head
> page's page struct" is only true of hugeTLB larger than PMD-mapped
> hugeTLB (for example 2MB) for this moment? more widely mTHP
> less than PMD-mapped size will still have all tail page struct?

That is the HVO for huge pages. Yes, I consider using the tail page
struct to store the swap entry a step back from the folio. The folio
is about all these 4k pages having the same property and they can look
like one big page. If we move to the memdesc world, those tail pages
will not exist in any way. It is doable in some situations, I am just
not sure it aligns with our future goal.

>
> "Having swap backend keep track of an array of subpage's swap
> location" means we will save this metadata on swapfile?  will we
> have more I/O especially if a large folio's mapping area might be
> partially unmap, for example, by MADV_DONTNEED even after
> the large folio is swapped-out, then we have to update the
> metadata? right now, we only need to change PTE entries
> and swap_map[] for the same case. do we have some way to keep
> those data in memory instead?

I actually consider keeping those arrays in memory, index by xarray
and looking up by the head swap entry offset.

>
> >
> > > But will this be a step back for folio?
> >
> > I think this should be separate from the folio. It is on the swap
> > backend. From folio's point of view, it is just writing out a folio.
> > The swap back end knows how to write out into subpage locations. From
> > folio's point of view. It is just one swap page write.
> >
> > > > Some possible ideas for the fragmentation issue.
> > > >
> > > > a) buddy allocator for swap entities. Similar to the buddy allocator
> > > > in memory. We can use a buddy allocator system for the swap entry to
> > > > avoid the low order swap entry fragment too much of the high order
> > > > swap entry. It should greatly reduce the fragmentation caused by
> > > > allocate and free of the swap entry of different sizes. However the
> > > > buddy allocator has its own limit as well. Unlike system memory, we
> > > > can move and compact the memory. There is no rmap for swap entry, it
> > > > is much harder to move a swap entry to another disk location. So the
> > > > buddy allocator for swap will help, but not solve all the
> > > > fragmentation issues.
> > >
> > > I agree buddy will help. Meanwhile, we might need some way similar
> > > with MOVABLE, UNMOVABLE migratetype. For example, try to gather
> > > swap applications for small folios together and don't let them spread
> > > throughout the whole swapfile.
> > > we might be able to dynamically classify swap clusters to be for small
> > > folios, for large folios, and avoid small folios to spread all clusters.
> >
> > This really depends on the swap entries allocation and free cycle. In
> > this extreme case, all swap entries have been allocated full.
> > Then it free some of the 4K entry at discotinuges locations. Buddy
> > allocator or cluster allocator are not going to save you from ending
> > up with fragmented swap entries.  That is why I think we still need
> > b).
>
> I agree. I believe that classifying clusters has the potential to alleviate
> fragmentation to some degree while it can not resolve it. We can
> to some extent prevent the spread of small swaps' applications.

Yes, as I state earlier, it will help but not solve it completely.

>
> >
> > > > b) Large swap entries. Take file as an example, a file on the file
> > > > system can write to a discontinuous disk location. The file system
> > > > responsible for tracking how to map the file offset into disk
> > > > location. A large swap entry can have a similar indirection array map
> > > > out the disk location for different subpages within a folio.  This
> > > > allows a large folio to write out dis-continguos swap entries on the
> > > > swap file. The array will need to store somewhere as part of the
> > > > overhead.When allocating swap entries for the folio, we can allocate a
> > > > batch of smaller 4k swap entries into an array. Use this array to
> > > > read/write the large folio. There will be a lot of plumbing work to
> > > > get it to work.
> > >
> > > we already have page struct, i wonder if we can record the offset
> > > there if this is not a step back to folio. on the other hand, while
> >
> > No for the tail pages. Because some of the tail page "struct page" are
> > just remapping of the head page "struct page".
> >
> > > swap-in, we can also allow large folios be swapped in from non-
> > > discontiguous places and those offsets are actually also in PTE
> > > entries.
> >
> > This discontinues sub page location needs to store outside of folio.
> > Keep in mind that you can have more than one PTE in different
> > processes. Those PTE on different processes might not agree with each
> > other. BTW, shmem store the swap entry in page cache not PTE.
>
> I don't quite understand what you mean by "Those PTE on different
> processes might not agree with each other". Can we have a concrete
> example?

Process A allocates memory back by large folio, A fork as process B.
Both A and B swap out the large folio. Then B MADVICE zap some PTE
from the large folio (Maybe zap before the swap out). While A did not
change the large folio at all.

> I assume this is also true for small folios but it won't be a problem
> as the process which is doing swap-in only cares about its own
> PTE entries?

It will be a challenge if we maintain a large swap entry with its
internal array mapping to different swap device offset. You get
different partial mapping of the same large folio. That is a problem
we need to solve, I don't have all the answers yet.

Chris

>
> > >
> > > I feel we have "page" to record offset before pageout() is done
> > > and we have PTE entries to record offset after pageout() is
> > > done.
> > >
> > > But still (a) is needed as we really hope large folios can be put
> > > in contiguous offsets, with this, we might have other benefit
> > > like saving the whole compressed large folio as one object rather than
> > > nr_pages objects in zsmalloc and decompressing them together
> > > while swapping  in (a patchset is coming in a couple of days for this).
> > > when a large folio is put in nr_pages different places, hardly can we do
> > > this in zsmalloc. But at least, we can still swap-out large folios
> > > without splitting and swap-in large folios though we read it
> > > back from nr_pages different objects.
> >
> > Exactly.
> >
> > Chris
>
> Thanks
> Barry
>