Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ

Kent Overstreet <kent.overstreet@xxxxxxxxx> · Mon, 17 Mar 2025 20:27:21 -0400

On Mon, Mar 17, 2025 at 02:39:07PM -0600, Keith Busch wrote:
> On Mon, Mar 17, 2025 at 03:40:10PM -0400, Kent Overstreet wrote:
> > On Mon, Mar 17, 2025 at 01:24:23PM -0600, Keith Busch wrote:
> > > This is a bit dangerous to assume. I don't find anywhere in any nvme
> > > specifications (also checked T10 SBC) with text saying anything similiar
> > > to "bypass" in relation to the cache for FUA reads. I am reasonably
> > > confident some vendors, especially ones developing active-active
> > > controllers, will fight you to the their win on the spec committee for
> > > this if you want to take it up in those forums.
> > 
> > "Read will come direct from media" reads pretty clear to me.
> >
> > But even if it's not supported the way we want, I'm not seeing anything
> > dangerous about using it this way. Worst case, our retries aren't as
> > good as we want them to be, and it'll be an item to work on in the
> > future.
> 
> I don't think you're appreciating the complications that active-active
> and multi-host brings to the scenario. Those are why this is not the
> forum to solve it. The specs need to be clear on the guarantees, and
> what they currently guarnatee might provide some overlap with what
> you're seeking in specific scenarios, but I really think (and I believe
> Martin agrees) your use is outside its targeted use case.

You do realize this is a single node filesystem we're talking about,
right?

Crazy multi-homing stuff, while cool, has no bearing here.

>  
> > As long as drives aren't barfing when we give them a read fua (and so
> > far they haven't when running this code), we're fine for now.
> 
> In this specific regard, I think its safe to assume the devices will
> remain operational.
> 
> > > > If devices don't support the behaviour we want today, then nudging the
> > > > drive manufacturers to support it is infinitely saner than getting a
> > > > whole nother bit plumbed through the NVME standard, especially given
> > > > that the letter of the spec does describe exactly what we want.
> > > 
> > > I my experience, the storage standards committees are more aligned to
> > > accomodate appliance vendors than anything Linux specific. Your desired
> > > read behavior would almost certainly be a new TPAR in NVMe to get spec
> > > defined behavior. It's not impossible, but I'll just say it is an uphill
> > > battle and the end result may or may not look like what you have in
> > > mind.
> > 
> > I'm not so sure.
> > 
> > If there are users out there depending on a different meaning of read
> > fua, then yes, absolutely (and it sounds like Martin might have been
> > alluding to that - but why wouldn't the write have been done fua? I'd
> > want to hear more about that)
> 
> As I mentioned, READ FUA provides an optimization opportunity that
> can be used instead of Write + Flush or WriteFUA when the host isn't
> sure about the persistence needs at the time of the initial Write: it
> can be used as a checkpoint on a specific block range that you may have
> written and overwritten. This kind of "read" command provides a well
> defined persistence barrier. Thinking of Read FUA as a barrier is better
> aligned with how the standards and device makers intended it to be used.

Yeah, I got that. Again, a neat trick, but who in their right mind would
use that sort of thing when the sane thing to do is just write fua?

So I'm skeptical that that sort of thing is an actual use with any
bearing on the devices any single node filesystem targets.

Now, in crazy enterprise multi homing land, sure.

Now, if you're saying you think the standard should be interpreted in a
way such that read fua does what it seems you and Martin want it to do
in crazy enterprise multi homing land... now _that_ would be a fun
argument to have in a standards committee :)

But I mostly jest, because my suspicion is that there wouldn't be any
real conflict, just a bit of the "I hadn't thought to use it that way"
anxiety.

> > If, OTOH, this is just something that hasn't come up before - the
> > language in the spec is already there, so once code is out there with
> > enough users and a demonstrated use case then it might be a pretty
> > simple nudge - "shoot down this range of the cache, don't just flush it"
> > is a pretty simple code change, as far as these things go.
> 
> So you're telling me you've never written SSD firmware then waited for
> the manufacturer to release it to your users? Yes, I jest, and maybe
> YMMV; but relying on that process is putting your destiny in the wrong
> hands.

Nah, back when I was at an employer that did SSD drivers/firmware, it
was all in house - no waiting to ship required :)

And incidentally, it's been fun watching the "FTL host or device side"
thing go back and forth since then; we had the same back-and-forth
happen just between different generations of the internal stuff being
built at Google that's been happening now with NVME and ZNS.

The appeal of a host side FTL was fairly obvious back then, but the FTL
ended up moving from the host to the device because people wanted to do
complete kernel IO stack bypass. The AIO and DIO code were really bad
back then especially in certain multithreaded scenarios - profiles were
absolutely atrocious, and the performance gains of a host side FTL
only really happen if you can do it in combination with the filesystem,
cutting out a lot of redundancy and improving on GC efficiency (this is
a big one) because in the filesystem, you have a lot more information on
what goes with what, that's lost at the block layer. IO tail latency in
particular improves, especially on loaded machines.

But a filesystem meant for that didn't exist at the time, nor did
hardware with any kind of an open interface...

Since then the kernel IO stack has gotten massively faster, ZNS hardware
exists, and bcachefs was pretty much designed from the start for
directly driving flash. There's about a month of work left, max, for
finish that off and driving hardware I have sitting on my desk.

Which means there should be more interesting things happen in the NVME
transport area in the future. In particular, moving the FTL into the
filesystem ought to allow for much more gracefully degrading failure
modes instead of the whole SSD going offline (and signals to the user
about when flash is going bad! flash ECC algorithms give you this, it's
just not exposed!).

So should be fun times.