Re: Swap Min Odrer

David Hildenbrand <david@xxxxxxxxxx> · Mon, 20 Jan 2025 13:02:24 +0100

On 16.01.25 09:38, Chris Li wrote:
On Wed, Jan 8, 2025 at 1:24 PM David Hildenbrand <david@xxxxxxxxxx> wrote:

On 08.01.25 22:19, Chris Li wrote:
On Wed, Jan 8, 2025 at 12:36 PM David Hildenbrand <david@xxxxxxxxxx> wrote:

Maybe the swapcache could somehow abstract that? We currently have the swap
slot allocator, that assigns slots to pages.

Assuming we have a 16 KiB BS but a 4 KiB page, we might have various options
to explore.

For example, we could size swap slots 16 KiB, and assign even 4 KiB pages a
single slot. This would waste swap space with small folios, that would go
away with large folios.

So batching order-0 folios in bigger slots that match the FS BS (e.g. 16
KiB) to perform disk writes, right?

Batching might be one idea, but the first idea I raised here would be
that the swap slot size will match the BS (e.g., 16 KiB) and contain at
most one folio.

So a order-0 folio would get a single slot assigned and effectively
"waste" 12 KiB of disk space.

I prefer not to "waste" that. It will be wasted on the write
amplification as well.

If it can be implemented fairly easily, sure! :)

Looking forward to hearing about the proposal!

Hi David,

Hi!

Sorry I have been pretty busy with other work related stuff recently.

I'm in a similar situation :D

I did not have a chance to do the write up yet.
I might not be able to make the next Wednesday upstream alignment
meeting for this topic.

Adding Kairui to the CC list, I have been collerating with him on the
swap related changes.

Is this similar to

https://lkml.kernel.org/r/20250116092254.204549-1-nphamcs@xxxxxxxxx

?

I do see it is beneficial to separate out the swap cache part of the
swap entries (virtual) and block layer write locations (physical).
So the current swap allocator allocates the virtual swap entry and
still keeps the property of swap entry contiguous within a folio. The
virtual swap entry also owns the current swap count and swap cache
reclaim.

Right.

Have a lookup array to translate the virtual entry to the physical
location. The physical location also needs an allocator, but much
simpler. The physical location allocation does not participate in swap
cache reclaim, those happen in the virtual entry. Nor does it have the
swap count, only 1 bit of information used or not. The physical entry
allocation does not need to be contiguous within the folio either.

Agreed.

This redirection layer will provide the flexibility to do more. e.g.
bridge the gap between the block size between virtual entry and
physical entry. It can provide the IO batching layer to merge more
than one virtual swap entry into a larger physical writing block.
Similarly it can allow swap to write out compressed zswap/zram into
the SSD, using similar IO batching.

The memory overhead is 4 byte per swap entry for the lookup table.
Maybe 1 bit per physical entry for that location is used or not.

That is the key part of the idea.

Okay, rings a bell, I think that was raised in some form in the past.

There are other ideas like dynamic growing the vmalloc array pages can
be viewed as incremental local improvement, it does not change the
core data structure of swap much.

Interesting, thanks!

--
Cheers,

David / dhildenb