On 10/15/19 5:09 PM, Matthew Wilcox wrote: > On Tue, Oct 15, 2019 at 03:48:47PM +0200, Hannes Reinecke wrote: >> On 10/15/19 1:35 PM, Matthew Wilcox wrote: >>> On Tue, Oct 15, 2019 at 01:38:27PM +0900, Naohiro Aota wrote: >>>> A zoned block device consists of a number of zones. Zones are >>>> either conventional and accepting random writes or sequential and >>>> requiring that writes be issued in LBA order from each zone write >>>> pointer position. For the write restriction, zoned block devices are >>>> not suitable for a swap device. Disallow swapon on them. >>> >>> That's unfortunate. I wonder what it would take to make the swap code be >>> suitable for zoned devices. It might even perform better on conventional >>> drives since swapout would be a large linear write. Swapin would be a >>> fragmented, seeky set of reads, but this would seem like an excellent >>> university project. >> >> The main problem I'm seeing is the eviction of pages from swap. >> While swapin is easy (as you can do random access on reads), evict pages >> from cache becomes extremely tricky as you can only delete entire zones. >> So how to we mark pages within zones as being stale? >> Or can we modify the swapin code to always swap in an entire zone and >> discard it immediately? > > I thought zones were too big to swap in all at once? What's a typical > zone size these days? (the answer looks very different if a zone is 1MB > or if it's 1GB) > Currently things have settled at 256MB, might be increased for ZNS. But GB would be the upper limit I'd assume. > Fundamentally an allocated anonymous page has 5 states: > > A: In memory, not written to swap (allocated) > B: In memory, dirty, not written to swap (app modifies page) > C: In memory, clean, written to swap (kernel decides to write it) > D: Not in memory, written to swap (kernel decides to reuse the memory) > E: In memory, clean, written to swap (app faults it back in for read) > > We currently have a sixth state which is a page that has previously been > written to swap but has been redirtied by the app. It will be written > back to the allocated location the next time it's targetted for writeout. > > That would have to change; since we can't do random writes, pages would > transition from states D or E back to B. Swapping out a page that has > previously been swapped will now mean appending to the tail of the swap, > not writing in place. > > So the swap code will now need to keep track of which pages are still > in use in storage and will need to be relocated once we decide to reuse > the zone. Not an insurmountable task, but not entirely trivial. > Precisely my worries. However, clearing stuff is _really_ fast (you just have to reset the pointer which is kept in NVRAM of the device). Which might help a bit. > There'd be some other gunk to deal with around handling badblocks. > Those are currently stored in page 1, so adding new ones would be > a rewrite of that block. > Bah. Can't we make that optional? We really only need badblocks when writing to crappy media (or NV-DIMM :-). Zoned devices _will_ have proper error recovery in place, so the only time where badblocks might be used is when the device is essentially dead ;-) Cheers, Hannes -- Dr. Hannes Reinecke Teamlead Storage & Networking hare@xxxxxxx +49 911 74053 688 SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 247165 (AG München), GF: Felix Imendörffer