Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations

Keith Busch <kbusch@xxxxxxxxxx> · Thu, 2 Mar 2023 18:58:58 -0700

On Tue, Feb 28, 2023 at 10:52:15PM -0500, Theodore Ts'o wrote:
> Emulated block devices offered by cloud VM’s can provide functionality
> to guest kernels and applications that traditionally have not been
> available to users of consumer-grade HDD and SSD’s.  For example,
> today it’s possible to create a block device in Google’s Persistent
> Disk with a 16k physical sector size, which promises that aligned 16k
> writes will be atomically.  With NVMe, it is possible for a storage
> device to promise this without requiring read-modify-write updates for
> sub-16k writes. 

I'm not sure it does. NVMe spec doesn't say AWUN writes are never a RMW
operation. NVMe suggests aligning to NPWA is the best way to avoid RMW, but
doesn't guarantee that, nor does it require this limit aligns to atomic
boundaries. NVMe provides a lot of hints, but stops short of promises. Vendors
can promise whatever they want, but that's outside spec.

> All that is necessary are some changes in the block
> layer so that the kernel does not inadvertently tear a write request
> when splitting a bio because it is too large (perhaps because it got
> merged with some other request, and then it gets split at an
> inconvenient boundary).

All the limits needed to optimally split on phyiscal boundaries exist, so I
hope we're using them correctly via get_max_io_size().

That said, I was hoping you were going to suggest supporting 16k logical block
sizes. Not a problem on some arch's, but still problematic when PAGE_SIZE is
4k. :)