Re: [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"

Barry Song <21cnbao@xxxxxxxxx> · Wed, 6 Mar 2024 19:05:06 +1300

On Wed, Mar 6, 2024 at 4:00 PM Chris Li <chrisl@xxxxxxxxxx> wrote:
>
> On Tue, Mar 5, 2024 at 5:15 PM Barry Song <21cnbao@xxxxxxxxx> wrote:
> > > Another limitation I would like to address is that swap_writepage can
> > > only write out IO in one contiguous chunk, not able to perform
> > > non-continuous IO. When the swapfile is close to full, it is likely
> > > the unused entry will spread across different locations. It would be
> > > nice to be able to read and write large folio using discontiguous disk
> > > IO locations.
> >
> > I don't find it will be too difficult for swap_writepage to only write
> > out a large folio which has discontiguous swap offsets. taking
> > zRAM as an example, as long as bio can be organized correctly,
> > zram should be able to write a large folio one by one for its all
> > subpages.
>
> Yes.
>
> >
> > static void zram_bio_write(struct zram *zram, struct bio *bio)
> > {
> >         unsigned long start_time = bio_start_io_acct(bio);
> >         struct bvec_iter iter = bio->bi_iter;
> >
> >         do {
> >                 u32 index = iter.bi_sector >> SECTORS_PER_PAGE_SHIFT;
> >                 u32 offset = (iter.bi_sector & (SECTORS_PER_PAGE - 1)) <<
> >                                 SECTOR_SHIFT;
> >                 struct bio_vec bv = bio_iter_iovec(bio, iter);
> >
> >                 bv.bv_len = min_t(u32, bv.bv_len, PAGE_SIZE - offset);
> >
> >                 if (zram_bvec_write(zram, &bv, index, offset, bio) < 0) {
> >                         atomic64_inc(&zram->stats.failed_writes);
> >                         bio->bi_status = BLK_STS_IOERR;
> >                         break;
> >                 }
> >
> >                 zram_slot_lock(zram, index);
> >                 zram_accessed(zram, index);
> >                 zram_slot_unlock(zram, index);
> >
> >                 bio_advance_iter_single(bio, &iter, bv.bv_len);
> >         } while (iter.bi_size);
> >
> >         bio_end_io_acct(bio, start_time);
> >         bio_endio(bio);
> > }
> >
> > right now , add_to_swap() is lacking a way to record discontiguous
> > offset for each subpage, alternatively, we have a folio->swap.
> >
> > I wonder if we can somehow make it page granularity, for each
> > subpage, it can have its own offset somehow like page->swap,
> > then in swap_writepage(), we can make a bio with multiple
> > discontiguous I/O index. then we allow add_to_swap() to get
> > nr_pages different swap offsets, and fill into each subpage.
>
> The key is where to store the subpage offset. It can't be stored on
> the tail page's page->swap because some tail page's page struct are
> just mapping of the head page's page struct. I am afraid this mapping
> relationship has to be stored on the swap back end. That is the idea,
> have swap backend keep track of an array of subpage's swap location.
> This array is looked up by the head swap offset.

I assume "some tail page's page struct are just mapping of the head
page's page struct" is only true of hugeTLB larger than PMD-mapped
hugeTLB (for example 2MB) for this moment? more widely mTHP
less than PMD-mapped size will still have all tail page struct?

"Having swap backend keep track of an array of subpage's swap
location" means we will save this metadata on swapfile?  will we
have more I/O especially if a large folio's mapping area might be
partially unmap, for example, by MADV_DONTNEED even after
the large folio is swapped-out, then we have to update the
metadata? right now, we only need to change PTE entries
and swap_map[] for the same case. do we have some way to keep
those data in memory instead?

>
> > But will this be a step back for folio?
>
> I think this should be separate from the folio. It is on the swap
> backend. From folio's point of view, it is just writing out a folio.
> The swap back end knows how to write out into subpage locations. From
> folio's point of view. It is just one swap page write.
>
> > > Some possible ideas for the fragmentation issue.
> > >
> > > a) buddy allocator for swap entities. Similar to the buddy allocator
> > > in memory. We can use a buddy allocator system for the swap entry to
> > > avoid the low order swap entry fragment too much of the high order
> > > swap entry. It should greatly reduce the fragmentation caused by
> > > allocate and free of the swap entry of different sizes. However the
> > > buddy allocator has its own limit as well. Unlike system memory, we
> > > can move and compact the memory. There is no rmap for swap entry, it
> > > is much harder to move a swap entry to another disk location. So the
> > > buddy allocator for swap will help, but not solve all the
> > > fragmentation issues.
> >
> > I agree buddy will help. Meanwhile, we might need some way similar
> > with MOVABLE, UNMOVABLE migratetype. For example, try to gather
> > swap applications for small folios together and don't let them spread
> > throughout the whole swapfile.
> > we might be able to dynamically classify swap clusters to be for small
> > folios, for large folios, and avoid small folios to spread all clusters.
>
> This really depends on the swap entries allocation and free cycle. In
> this extreme case, all swap entries have been allocated full.
> Then it free some of the 4K entry at discotinuges locations. Buddy
> allocator or cluster allocator are not going to save you from ending
> up with fragmented swap entries.  That is why I think we still need
> b).

I agree. I believe that classifying clusters has the potential to alleviate
fragmentation to some degree while it can not resolve it. We can
to some extent prevent the spread of small swaps' applications.

>
> > > b) Large swap entries. Take file as an example, a file on the file
> > > system can write to a discontinuous disk location. The file system
> > > responsible for tracking how to map the file offset into disk
> > > location. A large swap entry can have a similar indirection array map
> > > out the disk location for different subpages within a folio.  This
> > > allows a large folio to write out dis-continguos swap entries on the
> > > swap file. The array will need to store somewhere as part of the
> > > overhead.When allocating swap entries for the folio, we can allocate a
> > > batch of smaller 4k swap entries into an array. Use this array to
> > > read/write the large folio. There will be a lot of plumbing work to
> > > get it to work.
> >
> > we already have page struct, i wonder if we can record the offset
> > there if this is not a step back to folio. on the other hand, while
>
> No for the tail pages. Because some of the tail page "struct page" are
> just remapping of the head page "struct page".
>
> > swap-in, we can also allow large folios be swapped in from non-
> > discontiguous places and those offsets are actually also in PTE
> > entries.
>
> This discontinues sub page location needs to store outside of folio.
> Keep in mind that you can have more than one PTE in different
> processes. Those PTE on different processes might not agree with each
> other. BTW, shmem store the swap entry in page cache not PTE.

I don't quite understand what you mean by "Those PTE on different
processes might not agree with each other". Can we have a concrete
example?
I assume this is also true for small folios but it won't be a problem
as the process which is doing swap-in only cares about its own
PTE entries?

> >
> > I feel we have "page" to record offset before pageout() is done
> > and we have PTE entries to record offset after pageout() is
> > done.
> >
> > But still (a) is needed as we really hope large folios can be put
> > in contiguous offsets, with this, we might have other benefit
> > like saving the whole compressed large folio as one object rather than
> > nr_pages objects in zsmalloc and decompressing them together
> > while swapping  in (a patchset is coming in a couple of days for this).
> > when a large folio is put in nr_pages different places, hardly can we do
> > this in zsmalloc. But at least, we can still swap-out large folios
> > without splitting and swap-in large folios though we read it
> > back from nr_pages different objects.
>
> Exactly.
>
> Chris

Thanks
Barry