Re: [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"

David Hildenbrand <david@xxxxxxxxxx> · Fri, 8 Mar 2024 09:55:51 +0100

On 06.03.24 07:05, Barry Song wrote:
On Wed, Mar 6, 2024 at 4:00 PM Chris Li <chrisl@xxxxxxxxxx> wrote:

On Tue, Mar 5, 2024 at 5:15 PM Barry Song <21cnbao@xxxxxxxxx> wrote:
Another limitation I would like to address is that swap_writepage can
only write out IO in one contiguous chunk, not able to perform
non-continuous IO. When the swapfile is close to full, it is likely
the unused entry will spread across different locations. It would be
nice to be able to read and write large folio using discontiguous disk
IO locations.

I don't find it will be too difficult for swap_writepage to only write
out a large folio which has discontiguous swap offsets. taking
zRAM as an example, as long as bio can be organized correctly,
zram should be able to write a large folio one by one for its all
subpages.

Yes.

static void zram_bio_write(struct zram *zram, struct bio *bio)
{
         unsigned long start_time = bio_start_io_acct(bio);
         struct bvec_iter iter = bio->bi_iter;

         do {
                 u32 index = iter.bi_sector >> SECTORS_PER_PAGE_SHIFT;
                 u32 offset = (iter.bi_sector & (SECTORS_PER_PAGE - 1)) <<
                                 SECTOR_SHIFT;
                 struct bio_vec bv = bio_iter_iovec(bio, iter);

                 bv.bv_len = min_t(u32, bv.bv_len, PAGE_SIZE - offset);

                 if (zram_bvec_write(zram, &bv, index, offset, bio) < 0) {
                         atomic64_inc(&zram->stats.failed_writes);
                         bio->bi_status = BLK_STS_IOERR;
                         break;
                 }

                 zram_slot_lock(zram, index);
                 zram_accessed(zram, index);
                 zram_slot_unlock(zram, index);

                 bio_advance_iter_single(bio, &iter, bv.bv_len);
         } while (iter.bi_size);

         bio_end_io_acct(bio, start_time);
         bio_endio(bio);
}

right now , add_to_swap() is lacking a way to record discontiguous
offset for each subpage, alternatively, we have a folio->swap.

I wonder if we can somehow make it page granularity, for each
subpage, it can have its own offset somehow like page->swap,
then in swap_writepage(), we can make a bio with multiple
discontiguous I/O index. then we allow add_to_swap() to get
nr_pages different swap offsets, and fill into each subpage.

The key is where to store the subpage offset. It can't be stored on
the tail page's page->swap because some tail page's page struct are
just mapping of the head page's page struct. I am afraid this mapping
relationship has to be stored on the swap back end. That is the idea,
have swap backend keep track of an array of subpage's swap location.
This array is looked up by the head swap offset.

I assume "some tail page's page struct are just mapping of the head
page's page struct" is only true of hugeTLB larger than PMD-mapped
hugeTLB (for example 2MB) for this moment? more widely mTHP
less than PMD-mapped size will still have all tail page struct?

We just successfully stopped using subpages to store swap offsets, and 
even accidentally fixed a bug that was lurking for years. I am confident 
that we don't want to go back. The current direction is to move as much 
information we can out of the subpages: So if we can find ways to avoid 
messing with subpages, that would be great.

--
Cheers,

David / dhildenb