Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ

Kent Overstreet <kent.overstreet@xxxxxxxxx> · Mon, 17 Mar 2025 14:21:29 -0400

On Mon, Mar 17, 2025 at 01:57:53PM -0400, Martin K. Petersen wrote:
> 
> Kent,
> 
> >> At least for SCSI, given how FUA is usually implemented, I consider
> >> it quite unlikely that two read operations back to back would somehow
> >> cause different data to be transferred. Regardless of which flags you
> >> use.
> >
> > Based on what, exactly?
> 
> Based on the fact that many devices will either blindly flush on FUA or
> they'll do the equivalent of a media verify operation. In neither case
> will you get different data returned. The emphasis for FUA is on media
> durability, not caching.
> 
> In most implementations the cache isn't an optional memory buffer thingy
> that can be sidestepped. It is the only access mechanism that exists
> between the media and the host interface. Working memory if you will. So
> bypassing the device cache is not really a good way to think about it.
> 
> The purpose of FUA is to ensure durability for future reads, it is a
> media management flag. As such, any effect FUA may have on the device
> cache is incidental.
> 
> For SCSI there is a different flag to specify caching behavior. That
> flag is orthogonal to FUA and did not get carried over to NVMe.
> 
> > We _know_ devices are not perfect, and your claim that "it's quite
> > unlikely that two reads back to back would return different data"
> > amounts to claiming that there are no bugs in a good chunk of the IO
> > path and all that is implemented perfectly.
> 
> I'm not saying that devices are perfect or that the standards make
> sense. I'm just saying that your desired behavior does not match the
> reality of how a large number of these devices are actually implemented.
> 
> The specs are largely written by device vendors and therefore
> deliberately ambiguous. Many of the explicit cache management bits and
> bobs have been removed from SCSI or are defined as hints because device
> vendors don't want the OS to interfere with how they manage resources,
> including caching. I get what your objective is. I just don't think FUA
> offers sufficient guarantees in that department.

If you're saying this is going to be a work in progress to get the
behaviour we need in this scenario - yes, absolutely.

Beyond making sure that retries go to the physical media, there's "retry
level" in the NVME spec which needs to be plumbed, and that one will be
particularly useful in multi device scenarios. (Crank retry level up
or down based on whether we can retry from different devices).

But we've got to start somewhere, and given that the spec says "bypass
the cache" - that looks like the place to start. If devices don't
support the behaviour we want today, then nudging the drive
manufacturers to support it is infinitely saner than getting a whole
nother bit plumbed through the NVME standard, especially given that the
letter of the spec does describe exactly what we want.

So: I understand your concerns, but they're out of scope for the moment.
As long as nothing actively breaks when we feed it READ|FUA, it's
totally fine.

Later on I can imagine us adding some basic sanity tests to alert if
READ|FUA is behaving as expected; it's pretty easy to test latency of
random vs. reads to the same location vs. FUA reads to the same
location.

So yes: there will be more work to do in this area.

> Also, given the amount of hardware checking done at the device level, my
> experience tells me that you are way more likely to have undetected
> corruption problems on the host side than inside the storage device. In
> general storage devices implement very extensive checking on both
> control and data paths. And they will return an error if there is a
> mismatch (as opposed to returning random data).

Bugs host side are very much a concern, yes (we can only do so much
against memory stompers, and the RMW that buffered IO does is a giant
hole), but at the filesystem level there are techniques for avoiding
corruption that are not available at the block level.  The main one
being, every pointer carries the checksum that validates the data it
points to - the ZFS model.

So we do that, and additionally we're _very_ careful when moving around
data to carry around existing checksums, and validate the old after
generating the new (or sum up new checksums to validate them against the
old directly) when moving data around.

IOW - in my world, it's the filesystem's job to verify and authenticate
everything.

And layers above bcachefs do their own verification and authentication
as well - nixos package builders and git being the big ones that have
caught bugs for us.

It all matters...