Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations

Hannes Reinecke <hare@xxxxxxx> · Sun, 5 Mar 2023 12:22:15 +0100

On 3/4/23 18:54, Matthew Wilcox wrote:
On Sat, Mar 04, 2023 at 06:17:35PM +0100, Hannes Reinecke wrote:
On 3/4/23 17:47, Matthew Wilcox wrote:
On Sat, Mar 04, 2023 at 12:08:36PM +0100, Hannes Reinecke wrote:
We could implement a (virtual) zoned device, and expose each zone as a
block. That gives us the required large block characteristics, and with
a bit of luck we might be able to dial up to really large block sizes
like the 256M sizes on current SMR drives.
ublk might be a good starting point.

Ummmm.  Is supporting 256MB block sizes really a desired goal?  I suggest
that is far past the knee of the curve; if we can only write 256MB chunks
as a single entity, we're looking more at a filesystem redesign than we
are at making filesystems and the MM support 256MB size blocks.

Naa, not really. It _would_ be cool as we could get rid of all the cludges
which have nowadays re sequential writes.
And, remember, 256M is just a number someone thought to be a good
compromise. If we end up with a lower number (16M?) we might be able
to convince the powers that be to change their zone size.
Heck, with 16M block size there wouldn't be a _need_ for zones in
the first place.

But yeah, 256M is excessive. Initially I would shoot for something
like 2M.

I think we're talking about different things (probably different storage
vendors want different things, or even different people at the same
storage vendor want different things).

Luis and I are talking about larger LBA sizes.  That is, the minimum
read/write size from the block device is 16kB or 64kB or whatever.
In this scenario, the minimum amount of space occupied by a file goes
up from 512 bytes or 4kB to 64kB.  That's doable, even if somewhat
suboptimal.

And so do I. One can view zones as really large LBAs.

Indeed it might be suboptimal from the OS point of view.
But from the device point of view it won't.
And, in fact, with devices becoming faster and faster the question is
whether sticking with relatively small sectors won't become a limiting 
factor eventually.

Your concern seems to be more around shingled devices (or their equivalent
in SSD terms) where there are large zones which are append-only, but
you can still random-read 512 byte LBAs.  I think there are different
solutions to these problems, and people are working on both of these
problems.

My point being that zones are just there because the I/O stack can only 
deal with sectors up to 4k. If the I/O stack would be capable of dealing
with larger LBAs one could identify a zone with an LBA, and the entire 
issue of append-only and sequential writes would be moot.
Even the entire concept of zones becomes irrelevant as the OS would 
trivially only write entire zones.

But if storage vendors are really pushing for 256MB LBAs, then that's
going to need a third kind of solution, and I'm not aware of anyone
working on that.

What I was saying is that 256M is not set in stone. It's just a 
compromise vendors used. Even if in the course of development we arrive
at a lower number of max LBA we can handle (say, 2MB) I am pretty
sure vendors will be quite interested in that.

Cheers,

Hannes
--
Dr. Hannes Reinecke                Kernel Storage Architect
hare@xxxxxxx                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Ivo Totev, Andrew
Myers, Andrew McDonald, Martje Boudien Moerman