Re: [LSF/MM/BPF TOPIC] Cloud storage optimizations

"Martin K. Petersen" <martin.petersen@xxxxxxxxxx> · Thu, 02 Mar 2023 21:54:59 -0500

Hi Ted!

> With NVMe, it is possible for a storage device to promise this without
> requiring read-modify-write updates for sub-16k writes.  All that is
> necessary are some changes in the block layer so that the kernel does
> not inadvertently tear a write request when splitting a bio because it
> is too large (perhaps because it got merged with some other request,
> and then it gets split at an inconvenient boundary).

We have been working on support for atomic writes and it is not a simple
as it sounds. Atomic operations in SCSI and NVMe have semantic
differences which are challenging to reconcile. On top of that, both the
SCSI and NVMe specs are buggy in the atomics department. We are working
to get things fixed in both standards and aim to discuss our
implementation at LSF/MM.

> There are also more interesting, advanced optimizations that might be
> possible.  For example, Jens had observed the passing hints that
> journaling writes (either from file systems or databases) could be
> potentially useful.

Yep. We got very impressive results identifying journal writes and the
kernel implementation was completely trivial, but...

> Unfortunately most common storage devices have not supported write
> hints, and support for write hints were ripped out last year.  That
> can be easily reversed, but there are some other interesting related
> subjects that are very much suited for LSF/MM.

Hinting didn't see widespread adoption because we in Linux, as well as
the various interested databases, preferred hints to be per-I/O
properties. Whereas $OTHER_OS insisted that hints should be statically
assigned to LBA ranges on media. This left vendors having to choose
between two very different approaches and consequently they chose not to
support any of them.

However, hints are coming back in various forms for non-enterprise and
cloud storage devices so it's good to revive this discussion.

> For example, most cloud storage devices are doing read-ahead to try to
> anticipate read requests from the VM.  This can interfere with the
> read-ahead being done by the guest kernel.  So being able to tell
> cloud storage device whether a particular read request is stemming
> from a read-ahead or not.

Indeed. In our experience the hints that work best are the ones which
convey to the storage device why the I/O is being performed.

-- 
Martin K. Petersen	Oracle Linux Engineering