Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Swap Abstraction "the pony"

Zi Yan <ziy@xxxxxxxxxx> · Thu, 16 May 2024 11:04:07 -0400

On 14 Mar 2024, at 5:03, Jan Kara wrote:

> On Fri 08-03-24 05:17:46, Barry Song wrote:
>> On Fri, Mar 8, 2024 at 5:06 AM Jared Hulbert <jaredeh@xxxxxxxxx> wrote:
>>>
>>> On Thu, Mar 7, 2024 at 9:35 AM Jan Kara <jack@xxxxxxx> wrote:
>>>>
>>>> Well, but then if you fill in space of a particular order and need to swap
>>>> out a page of that order what do you do? Return ENOSPC prematurely?
>>>>
>>>> Frankly as I'm reading the discussions here, it seems to me you are trying
>>>> to reinvent a lot of things from the filesystem space :) Like block
>>>> allocation with reasonably efficient fragmentation prevention, transparent
>>>> data compression (zswap), hierarchical storage management (i.e., moving
>>>> data between different backing stores), efficient way to get from
>>>> VMA+offset to the place on disk where the content is stored. Sure you still
>>>> don't need a lot of things modern filesystems do like permissions,> directory structure (or even more complex namespacing stuff), all the stuff
>>>> achieving fs consistency after a crash, etc. But still what you need is a
>>>> notable portion of what filesystems do.
>>>>
>>>> So maybe it would be time to implement swap as a proper filesystem? Or even
>>>> better we could think about factoring out these bits out of some existing
>>>> filesystem to share code?
>>>
>>> Yes.  Thank you.  I've been struggling to communicate this.
>>>
>>> I'm thinking you can just use existing filesystems as a first step
>>> with a modest glue layer.  See the branch of this thread where I'm
>>> babbling on to Chris about this.
>>>
>>> "efficient way to get from VMA+offset to place on the disk where
>>> content is stored"
>>> You mean treat swapped pages like they were mmap'ed files and use the
>>> same code paths?  How big of a project is that?  That seems either
>>> deceptively easy or really hard... I've been away too long and was
>>> never really good enough to have a clear vision of the scale.
>>
>> I don't understand why we need this level of complexity. All we need to
>> know are the offsets during pageout. After that, the large folio is
>> destroyed, and all offsets are stored in page table entries (PTEs) or xa.
>> Swap-in doesn't depend on a complex file system; it can make its own
>> decision on how to swap-in based on the values it reads from PTEs.
>
> Well, but once compression chimes in (like with zswap) or if you need to
> perform compaction on swap space and move swapped out data, things aren't
> that simple anymore, are they? So as I was reading this thread I had the
> impression that swap complexity is coming close to a complexity of a
> (relatively simple) filesystem so I was brainstorming about possibility of
> sharing some code between filesystems and swap...

I think all the complexity comes from that we want to preserve folios as
a whole, thus need to handle fragmentation issues. But Barry’s approach
is trying to get us away from it. The downside is what you mentioned
about compression, since 64KB should give better compression ratio than
4KB. For swap without compression, we probably can use Barry’s
approach to keep everything simple, just split all folios when they go
into swap, but I am not sure about if there is disk throughput loss.
For zswap, there will be design tradeoff between better compression ratio
and complexity.

Best Regards,
Yan, Zi
Attachment:
signature.asc

Description: OpenPGP digital signature