Re: Swap Min Odrer

Chris Li <chrisl@xxxxxxxxxx> · Wed, 8 Jan 2025 13:09:12 -0800

On Tue, Jan 7, 2025 at 8:41 AM David Hildenbrand <david@xxxxxxxxxx> wrote:
>
> On 07.01.25 13:29, Daniel Gomez wrote:
> > On Tue, Jan 07, 2025 at 11:31:05AM +0100, David Hildenbrand wrote:
> >> On 07.01.25 10:43, Daniel Gomez wrote:
> >>> Hi,
> >>
> >> Hi,
> >>
> >>>
> >>> High-capacity SSDs require writes to be aligned with the drive's
> >>> indirection unit (IU), which is typically >4 KiB, to avoid RMW. To
> >>> support swap on these devices, we need to ensure that writes do not
> >>> cross IU boundaries. So, I think this may require increasing the minimum
> >>> allocation size for swap users.
> >>
> >> How would we handle swapout/swapin when we have smaller pages (just imagine
> >> someone does a mmap(4KiB))?
> >
> > Swapout would require to be aligned to the IU. An mmap of 4 KiB would
> > have to perform an IU KiB write, e.g. 16 KiB or 32 KiB, to avoid any
> > potential RMW penalty. So, I think aligning the mmap allocation to the
> > IU would guarantee a write of the required granularity and alignment.
>
> We must be prepared to handle and VMA layout with single-page VMAs,
> single-page holes etc ... :/ IMHO we should try to handle this
> transparently to the application.
>
> > But let's also look at your suggestion below with swapcache.
> >
> > Swapin can still be performed at LBA format levels (e.g. 4 KiB) without
> > the same write penalty implications, and only affecting performance
> > if I/Os are not conformant to these boundaries. So, reading at IU
> > boundaries is preferred to get optimal performance, not a 'requirement'.
> >
> >>
> >> Could this be something that gets abstracted/handled by the swap
> >> implementation? (i.e., multiple small folios get added to the swapcache but
> >> get written out / read in as a single unit?).
> >
> > Do you mean merging like in the block layer? I'm not entirely sure if
> > this could guarantee deterministically the I/O boundaries the same way
> > it does min order large folio allocations in the page cache. But I guess
> > is worth exploring as optimization.
>
> Maybe the swapcache could somehow abstract that? We currently have the
> swap slot allocator, that assigns slots to pages.
>
> Assuming we have a 16 KiB BS but a 4 KiB page, we might have various
> options to explore.
>
> For example, we could size swap slots 16 KiB, and assign even 4 KiB
> pages a single slot. This would waste swap space with small folios, that
> would go away with large folios.

We can group multiple swap 4K swap entries into one 16K write unit.
There will be no waste of the SSD.

>
> If we stick to 4 KiB swap slots, maybe pageout() could be taught to
> effectively writeback "everything" residing in the relevant swap slots
> that span a BS?
>
> I recall there was a discussion about atomic writes involving multiple
> pages, and how it is hard. Maybe with swaping it is "easier"? Absolutely
> no expert on that, unfortunately. Hoping Chris has some ideas.

Yes, see my other email about the "virtual swapfile" idea. More
detailed write up coming next week.

Chris

>
>
> >
> >>
> >> I recall that we have been talking about a better swap abstraction for years
> >> :)
> >
> > Adding Chris Li to the cc list in case he has more input.
> >
> >>
> >> Might be a good topic for LSF/MM (might or might not be a better place than
> >> the MM alignment session).
> >
> > Both options work for me. LSF/MM is in 12 weeks so, having a previous
> > session would be great.
>
> Both work for me.
>
> --
> Cheers,
>
> David / dhildenb
>