Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations

Luis Chamberlain <mcgrof@xxxxxxxxxx> · Sat, 4 Mar 2023 10:53:47 -0800

On Sat, Mar 04, 2023 at 05:54:38PM +0000, Matthew Wilcox wrote:
> On Sat, Mar 04, 2023 at 06:17:35PM +0100, Hannes Reinecke wrote:
> > On 3/4/23 17:47, Matthew Wilcox wrote:
> > > On Sat, Mar 04, 2023 at 12:08:36PM +0100, Hannes Reinecke wrote:
> > > > We could implement a (virtual) zoned device, and expose each zone as a
> > > > block. That gives us the required large block characteristics, and with
> > > > a bit of luck we might be able to dial up to really large block sizes
> > > > like the 256M sizes on current SMR drives.
> > > > ublk might be a good starting point.
> > > 
> > > Ummmm.  Is supporting 256MB block sizes really a desired goal?  I suggest
> > > that is far past the knee of the curve; if we can only write 256MB chunks
> > > as a single entity, we're looking more at a filesystem redesign than we
> > > are at making filesystems and the MM support 256MB size blocks.
> > > 
> > Naa, not really. It _would_ be cool as we could get rid of all the cludges
> > which have nowadays re sequential writes.
> > And, remember, 256M is just a number someone thought to be a good
> > compromise. If we end up with a lower number (16M?) we might be able
> > to convince the powers that be to change their zone size.
> > Heck, with 16M block size there wouldn't be a _need_ for zones in
> > the first place.
> > 
> > But yeah, 256M is excessive. Initially I would shoot for something
> > like 2M.
> 
> I think we're talking about different things (probably different storage
> vendors want different things, or even different people at the same
> storage vendor want different things).
> 
> Luis and I are talking about larger LBA sizes.  That is, the minimum
> read/write size from the block device is 16kB or 64kB or whatever.
> In this scenario, the minimum amount of space occupied by a file goes
> up from 512 bytes or 4kB to 64kB.  That's doable, even if somewhat
> suboptimal.

Yes.

> Your concern seems to be more around shingled devices (or their equivalent
> in SSD terms) where there are large zones which are append-only, but
> you can still random-read 512 byte LBAs.  I think there are different
> solutions to these problems, and people are working on both of these
> problems.
> 
> But if storage vendors are really pushing for 256MB LBAs, then that's
> going to need a third kind of solution, and I'm not aware of anyone
> working on that.

Hannes had replied to my suggestion about a way to *virtualize* *optimally*
a real storage controller with larger LBA, in that thread I was hinting to 
avoid using on the hypervisor cache=passthrough and instead use something
like cache=writeback or even cache=unsafe for experimentation for
virtio-blk-pci. For a more elaborate description of these see [0] but the
skinny is cache=writeback uses the host stroage controller while the
other rely on the host page cache.

The overhead of latencies incurred by anything to replicate larger LBAs
should be mitigated, so I don't think using a zone storage zone for it
would be good.

I was asking whether or not experimenting with a different host page cache
PAGE_SIZE might help replicate things more a bit realistically, even if
if was suboptimal for the host for the reasons previously noted as stupid.

If sticking to PAGE_SIZE on the host another idea may be to use tmpfs +
huge pages so to at least mitigate TLB lookups.

[0] https://github.com/linux-kdevops/kdevops/commit/94844c4684a51997cb327d2fb0ce491fe4429dfc

  Luis