Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations

Gao Xiang <hsiangkao@xxxxxxxxxxxxxxxxx> · Wed, 1 Mar 2023 12:18:30 +0800

On 2023/3/1 11:52, Theodore Ts'o wrote:
Emulated block devices offered by cloud VM’s can provide functionality
to guest kernels and applications that traditionally have not been
available to users of consumer-grade HDD and SSD’s.  For example,
today it’s possible to create a block device in Google’s Persistent
Disk with a 16k physical sector size, which promises that aligned 16k
writes will be atomically.  With NVMe, it is possible for a storage
device to promise this without requiring read-modify-write updates for
sub-16k writes.  All that is necessary are some changes in the block
layer so that the kernel does not inadvertently tear a write request
when splitting a bio because it is too large (perhaps because it got
merged with some other request, and then it gets split at an
inconvenient boundary).

Yeah, most cloud vendors (including Alibaba Cloud) now use ext4 bigalloc
to avoid mysql double write buffers. In addition to improve performance,
this method can also minimize unnecessary I/O traffic between computing
and storage nodes.

Once I hacked a COW-based in-house approach in XFS by using the optimized
always_cow with some tricks to avoid storage dependency.  But nowadays
AWS and Google Cloud are all using ext4 bigalloc, so.. ;-)

There are also more interesting, advanced optimizations that might be
possible.  For example, Jens had observed the passing hints that
journaling writes (either from file systems or databases) could be
potentially useful.  Unfortunately most common storage devices have
not supported write hints, and support for write hints were ripped out
last year.  That can be easily reversed, but there are some other
interesting related subjects that are very much suited for LSF/MM.

For example, most cloud storage devices are doing read-ahead to try to
anticipate read requests from the VM.  This can interfere with the
read-ahead being done by the guest kernel.  So being able to tell
cloud storage device whether a particular read request is stemming
from a read-ahead or not.  At the moment, as Matthew Wilcox has
pointed out, we currently use the read-ahead code path for synchronous
buffered reads.  So plumbing this information so it can passed through
multiple levels of the mm, fs, and block layers will probably be
needed.

It seems that is also useful as well, yet if my understanding is correct,
it's somewhat unclear for me if we could do more and have a better form
compared with the current REQ_RAHEAD (currently REQ_RAHEAD use cases and
impacts are quite limited.)

Thanks,
Gao Xiang