Re: Project idea: Swap to zoned block devices

Hannes Reinecke <hare@xxxxxxx> · Tue, 15 Oct 2019 17:22:34 +0200



On 10/15/19 5:09 PM, Matthew Wilcox wrote:
> On Tue, Oct 15, 2019 at 03:48:47PM +0200, Hannes Reinecke wrote:
>> On 10/15/19 1:35 PM, Matthew Wilcox wrote:
>>> On Tue, Oct 15, 2019 at 01:38:27PM +0900, Naohiro Aota wrote:
>>>> A zoned block device consists of a number of zones. Zones are
>>>> either conventional and accepting random writes or sequential and
>>>> requiring that writes be issued in LBA order from each zone write
>>>> pointer position. For the write restriction, zoned block devices are
>>>> not suitable for a swap device. Disallow swapon on them.
>>>
>>> That's unfortunate.  I wonder what it would take to make the swap code be
>>> suitable for zoned devices.  It might even perform better on conventional
>>> drives since swapout would be a large linear write.  Swapin would be a
>>> fragmented, seeky set of reads, but this would seem like an excellent
>>> university project.
>>
>> The main problem I'm seeing is the eviction of pages from swap.
>> While swapin is easy (as you can do random access on reads), evict pages
>> from cache becomes extremely tricky as you can only delete entire zones.
>> So how to we mark pages within zones as being stale?
>> Or can we modify the swapin code to always swap in an entire zone and
>> discard it immediately?
> 
> I thought zones were too big to swap in all at once?  What's a typical
> zone size these days?  (the answer looks very different if a zone is 1MB
> or if it's 1GB)
> 
Currently things have settled at 256MB, might be increased for ZNS.
But GB would be the upper limit I'd assume.

> Fundamentally an allocated anonymous page has 5 states:
> 
> A: In memory, not written to swap (allocated)
> B: In memory, dirty, not written to swap (app modifies page)
> C: In memory, clean, written to swap (kernel decides to write it)
> D: Not in memory, written to swap (kernel decides to reuse the memory)
> E: In memory, clean, written to swap (app faults it back in for read)
> 
> We currently have a sixth state which is a page that has previously been
> written to swap but has been redirtied by the app.  It will be written
> back to the allocated location the next time it's targetted for writeout.
> 
> That would have to change; since we can't do random writes, pages would
> transition from states D or E back to B.  Swapping out a page that has
> previously been swapped will now mean appending to the tail of the swap,
> not writing in place.
> 
> So the swap code will now need to keep track of which pages are still
> in use in storage and will need to be relocated once we decide to reuse
> the zone.  Not an insurmountable task, but not entirely trivial.
> 
Precisely my worries.
However, clearing stuff is _really_ fast (you just have to reset the
pointer which is kept in NVRAM of the device). Which might help a bit.

> There'd be some other gunk to deal with around handling badblocks.
> Those are currently stored in page 1, so adding new ones would be
> a rewrite of that block.
> 
Bah. Can't we make that optional?
We really only need badblocks when writing to crappy media (or NV-DIMM
:-). Zoned devices _will_ have proper error recovery in place, so the
only time where badblocks might be used is when the device is
essentially dead ;-)

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      Teamlead Storage & Networking
hare@xxxxxxx			                  +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 247165 (AG München), GF: Felix Imendörffer