On Tue, Jan 7, 2025 at 4:29 AM Daniel Gomez <da.gomez@xxxxxxxxxxx> wrote: > > On Tue, Jan 07, 2025 at 11:31:05AM +0100, David Hildenbrand wrote: > > On 07.01.25 10:43, Daniel Gomez wrote: > > > Hi, > > > > Hi, > > > > > > > > High-capacity SSDs require writes to be aligned with the drive's > > > indirection unit (IU), which is typically >4 KiB, to avoid RMW. To > > > support swap on these devices, we need to ensure that writes do not > > > cross IU boundaries. So, I think this may require increasing the minimum > > > allocation size for swap users. > > > > How would we handle swapout/swapin when we have smaller pages (just imagine > > someone does a mmap(4KiB))? > > Swapout would require to be aligned to the IU. An mmap of 4 KiB would > have to perform an IU KiB write, e.g. 16 KiB or 32 KiB, to avoid any > potential RMW penalty. So, I think aligning the mmap allocation to the > IU would guarantee a write of the required granularity and alignment. > But let's also look at your suggestion below with swapcache. I think only the writer needs to be grouped by IU size. Ideally the swap front end doesn't have to know about the IU size. There are many reasons forcing the swap entry size on the swap cache would be tricky. e.g. If the folio is 4K, it is tricky to force it to be 16K. Only 1 4K page is cold, the other nearby page is hot. etc etc. > > Swapin can still be performed at LBA format levels (e.g. 4 KiB) without > the same write penalty implications, and only affecting performance > if I/Os are not conformant to these boundaries. So, reading at IU > boundaries is preferred to get optimal performance, not a 'requirement'. > > > > > Could this be something that gets abstracted/handled by the swap > > implementation? (i.e., multiple small folios get added to the swapcache but > > get written out / read in as a single unit?). Yes. > > Do you mean merging like in the block layer? I'm not entirely sure if > this could guarantee deterministically the I/O boundaries the same way > it does min order large folio allocations in the page cache. But I guess > is worth exploring as optimization. > > > > > I recall that we have been talking about a better swap abstraction for years > > :) > > Adding Chris Li to the cc list in case he has more input. Sorry I'm a bit late to the party. Yes I do have some ideas I want to propose on the LSF/MM as topics, maybe early next week. Here are some highlights of it. I think we need to have a separation of the swap cache and the backing of IO of the swap file. I call it the "virtual swapfile". It is virtual in two aspect: 1) There is an up front size at swap on, but no up front allocation of the vmalloc array. The array grows as needed. 2) There is a virtual to physical swap entry mapping. The cost is 4 bytes per swap entry. But it will solve a lot of problems all together. IU size write grouping would be a good user of this virtual layer. Another usage case if we want to write a compressed zswap/zram entry into the SSD, we might actually encounter the size problem in another direction. e.g. writing swap entries smaller than 4K. I am still working on the write up. More details will come. Chris > > > > > Might be a good topic for LSF/MM (might or might not be a better place than > > the MM alignment session). > > Both options work for me. LSF/MM is in 12 weeks so, having a previous > session would be great. > > Daniel > > > > > -- > > Cheers, > > > > David / dhildenb > >