Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ

Keith Busch <kbusch@xxxxxxxxxx> · Mon, 17 Mar 2025 14:39:07 -0600

On Mon, Mar 17, 2025 at 03:40:10PM -0400, Kent Overstreet wrote:
> On Mon, Mar 17, 2025 at 01:24:23PM -0600, Keith Busch wrote:
> > This is a bit dangerous to assume. I don't find anywhere in any nvme
> > specifications (also checked T10 SBC) with text saying anything similiar
> > to "bypass" in relation to the cache for FUA reads. I am reasonably
> > confident some vendors, especially ones developing active-active
> > controllers, will fight you to the their win on the spec committee for
> > this if you want to take it up in those forums.
> 
> "Read will come direct from media" reads pretty clear to me.
>
> But even if it's not supported the way we want, I'm not seeing anything
> dangerous about using it this way. Worst case, our retries aren't as
> good as we want them to be, and it'll be an item to work on in the
> future.

I don't think you're appreciating the complications that active-active
and multi-host brings to the scenario. Those are why this is not the
forum to solve it. The specs need to be clear on the guarantees, and
what they currently guarnatee might provide some overlap with what
you're seeking in specific scenarios, but I really think (and I believe
Martin agrees) your use is outside its targeted use case.

> As long as drives aren't barfing when we give them a read fua (and so
> far they haven't when running this code), we're fine for now.

In this specific regard, I think its safe to assume the devices will
remain operational.

> > > If devices don't support the behaviour we want today, then nudging the
> > > drive manufacturers to support it is infinitely saner than getting a
> > > whole nother bit plumbed through the NVME standard, especially given
> > > that the letter of the spec does describe exactly what we want.
> > 
> > I my experience, the storage standards committees are more aligned to
> > accomodate appliance vendors than anything Linux specific. Your desired
> > read behavior would almost certainly be a new TPAR in NVMe to get spec
> > defined behavior. It's not impossible, but I'll just say it is an uphill
> > battle and the end result may or may not look like what you have in
> > mind.
> 
> I'm not so sure.
> 
> If there are users out there depending on a different meaning of read
> fua, then yes, absolutely (and it sounds like Martin might have been
> alluding to that - but why wouldn't the write have been done fua? I'd
> want to hear more about that)

As I mentioned, READ FUA provides an optimization opportunity that
can be used instead of Write + Flush or WriteFUA when the host isn't
sure about the persistence needs at the time of the initial Write: it
can be used as a checkpoint on a specific block range that you may have
written and overwritten. This kind of "read" command provides a well
defined persistence barrier. Thinking of Read FUA as a barrier is better
aligned with how the standards and device makers intended it to be used.

> If, OTOH, this is just something that hasn't come up before - the
> language in the spec is already there, so once code is out there with
> enough users and a demonstrated use case then it might be a pretty
> simple nudge - "shoot down this range of the cache, don't just flush it"
> is a pretty simple code change, as far as these things go.

So you're telling me you've never written SSD firmware then waited for
the manufacturer to release it to your users? Yes, I jest, and maybe
YMMV; but relying on that process is putting your destiny in the wrong
hands.