On Tue, Jan 7, 2025 at 8:41 AM David Hildenbrand <david@xxxxxxxxxx> wrote: > > On 07.01.25 13:29, Daniel Gomez wrote: > > On Tue, Jan 07, 2025 at 11:31:05AM +0100, David Hildenbrand wrote: > >> On 07.01.25 10:43, Daniel Gomez wrote: > >>> Hi, > >> > >> Hi, > >> > >>> > >>> High-capacity SSDs require writes to be aligned with the drive's > >>> indirection unit (IU), which is typically >4 KiB, to avoid RMW. To > >>> support swap on these devices, we need to ensure that writes do not > >>> cross IU boundaries. So, I think this may require increasing the minimum > >>> allocation size for swap users. > >> > >> How would we handle swapout/swapin when we have smaller pages (just imagine > >> someone does a mmap(4KiB))? > > > > Swapout would require to be aligned to the IU. An mmap of 4 KiB would > > have to perform an IU KiB write, e.g. 16 KiB or 32 KiB, to avoid any > > potential RMW penalty. So, I think aligning the mmap allocation to the > > IU would guarantee a write of the required granularity and alignment. > > We must be prepared to handle and VMA layout with single-page VMAs, > single-page holes etc ... :/ IMHO we should try to handle this > transparently to the application. > > > But let's also look at your suggestion below with swapcache. > > > > Swapin can still be performed at LBA format levels (e.g. 4 KiB) without > > the same write penalty implications, and only affecting performance > > if I/Os are not conformant to these boundaries. So, reading at IU > > boundaries is preferred to get optimal performance, not a 'requirement'. > > > >> > >> Could this be something that gets abstracted/handled by the swap > >> implementation? (i.e., multiple small folios get added to the swapcache but > >> get written out / read in as a single unit?). > > > > Do you mean merging like in the block layer? I'm not entirely sure if > > this could guarantee deterministically the I/O boundaries the same way > > it does min order large folio allocations in the page cache. But I guess > > is worth exploring as optimization. > > Maybe the swapcache could somehow abstract that? We currently have the > swap slot allocator, that assigns slots to pages. > > Assuming we have a 16 KiB BS but a 4 KiB page, we might have various > options to explore. > > For example, we could size swap slots 16 KiB, and assign even 4 KiB > pages a single slot. This would waste swap space with small folios, that > would go away with large folios. We can group multiple swap 4K swap entries into one 16K write unit. There will be no waste of the SSD. > > If we stick to 4 KiB swap slots, maybe pageout() could be taught to > effectively writeback "everything" residing in the relevant swap slots > that span a BS? > > I recall there was a discussion about atomic writes involving multiple > pages, and how it is hard. Maybe with swaping it is "easier"? Absolutely > no expert on that, unfortunately. Hoping Chris has some ideas. Yes, see my other email about the "virtual swapfile" idea. More detailed write up coming next week. Chris > > > > > >> > >> I recall that we have been talking about a better swap abstraction for years > >> :) > > > > Adding Chris Li to the cc list in case he has more input. > > > >> > >> Might be a good topic for LSF/MM (might or might not be a better place than > >> the MM alignment session). > > > > Both options work for me. LSF/MM is in 12 weeks so, having a previous > > session would be great. > > Both work for me. > > -- > Cheers, > > David / dhildenb >