On Feb 22, 2024, at 3:45 PM, Chris Li <chrisl@xxxxxxxxxx> wrote: > > Hi David, > > On Fri, Feb 2, 2024 at 1:10 AM David Howells <dhowells@xxxxxxxxxx> wrote: >> >> Hi, >> >> The topic came up in a recent discussion about how to deal with large folios >> when it comes to swap as a swap device is normally considered a simple array >> of PAGE_SIZE-sized elements that can be indexed by a single integer. > > Sorry for being late for the party. I think I was the one that brought > this topic up in the online discussion with Will and You. Let me know > if you are referring to a different discussion. > >> >> With the advent of large folios, however, we might need to change this in >> order to be better able to swap out a compound page efficiently. Swap >> fragmentation raises its head, as does the need to potentially save multiple >> indices per folio. Does swap need to grow more filesystem features? > > Yes, with a large folio, it is harder to allocate continuous swap > entries where 4K swap entries are allocated and free all the time. The > fragmentation will likely make the swap file have very little > continuous swap entries. One option would be to reuse the multi-block allocator (mballoc) from ext4, which has quite efficient power-of-two buddy allocation. That would naturally aggregate contiguous pages as they are freed. Since the swap partition is not containing anything useful across a remount there is no need to save allocation bitmaps persistently. Cheers, Andreas > We can change that assumption, allow large folio reading and writing > of discontinued blocks on the block device level. We will likely need > a file system like kind of the indirection layer to store the location > of those blocks. In other words, the folio needs to read/write a list > of io vectors, not just one block. > >> >> Further to this, we have at least two ways to cache data on disk/flash/etc. - >> swap and fscache - and both want to set aside disk space for their operation. >> Might it be possible to combine the two? >> >> One thing I want to look at for fscache is the possibility of switching from a >> file-per-object-based approach to a tagged cache more akin to the way OpenAFS >> does things. In OpenAFS, you have a whole bunch of small files, each >> containing a single block (e.g. 256K) of data, and an index that maps a >> particular {volume,file,version,block} to one of these files in the cache. >> >> Now, I could also consider holding all the data blocks in a single file (or >> blockdev) - and this might work for swap. For fscache, I do, however, need to >> have some sort of integrity across reboots that swap does not require. > > The main trade off is the memory usage for the meta data and latency > of reading and writing. > The file system has typically a different IO pattern than swap, e.g. > file reads can be batched and have good locality. > Where swap is a lot of random location read/write. > > Current swap using array like swap entry, one of the pros of that is > just one IO required for one folio. > The performance gets worse when swap needs to read the metadata first > to locate the block, then read the block of data in. > Page fault latency will get longer. That is one of the trade-offs we > need to consider. > > Chris > Cheers, Andreas
Attachment:
signature.asc
Description: Message signed with OpenPGP