Re: [PATCH 13/14] block: Allow REQ_FUA|REQ_READ

Kent Overstreet <kent.overstreet@xxxxxxxxx> · Mon, 17 Mar 2025 10:49:55 -0400

On Mon, Mar 17, 2025 at 08:13:59AM -0600, Keith Busch wrote:
> On Mon, Mar 17, 2025 at 08:15:51AM -0400, Kent Overstreet wrote:
> > On Sun, Mar 16, 2025 at 11:00:20PM -0700, Christoph Hellwig wrote:
> > > On Sat, Mar 15, 2025 at 02:41:33PM -0400, Kent Overstreet wrote:
> > > > That's the sort of thing that process is supposed to avoid. To be clear,
> > > > you and Christoph are the two reasons I've had to harp on process in the
> > > > past - everyone else in the kernel has been fine.
> > 
> > ...
> > 
> > > In this case you very clearly misunderstood how the FUA bit works on
> > > reads (and clearly documented that misunderstanding in the commit log).
> > 
> > FUA reads are clearly documented in the NVME spec.
> 
> NVMe's Read with FUA documentation is a pretty short:
> 
>   Force Unit Access (FUA): If this bit is set to `1´, then for data and
>   metadata, if any, associated with logical blocks specified by the Read
>   command, the controller shall:
> 
>     1) commit that data and metadata, if any, to non-volatile medium; and
>     2) read the data and metadata, if any, from non-volatile medium.
> 
> So this aligns with ATA and SCSI.
> 
> The concern about what you're doing is with respect to "1". Avoiding that
> requires all writes also set FUA, or you sent a flush command before doing any
> reads. Otherwise how would you know whether or not read data came from a
> previous cached write?

Line 1 is not the relevant concern here. That would be relevant if we
were trying to verify the integrity of writes end-to-end immediately
after they'd been completed - and that would be a lovely thing to do in
the spirit of "true end to end data integrity", but we're not ready to
contemplate such things due to the obvious performance implications.

The scenario here is: we go to read old data, we get a bit error, and
then we want to retry the read to verify that it wasn't a transient
error before declaring it dead (marking it poisoned).

In this scenario, "dirty data in the writeback cache" is something that
would only ever happen if userspace is doing very funny things with
O_DIRECT, and the drive flushing that data on FUA read is totally fine.
It's an irrelevant corner case.

But we do absolutely require checking for transient read errors, and we
will miss things and corrupt data unnecessarily without FUA reads that
bypass the controller cache.

> I'm actually more curious how you came to use this mechanism for recovery.
> Have you actually seen this approach fix a real problem? In my experience, a
> Read FUA is used to ensure what you're reading has been persistently committed.
> It can be used as an optimization to flush a specific LBA range, but I'd not
> seen this used as a bit error recovery mechanism.

It's not data recovery - in the conventional sense.

Typically, on conventional filesystems, this doesn't come up because a:
most filesystems don't even do data checksumming, and even then on a
filesystem with data checksumming this won't automatically come up -
because like other other transient errors, it's generally perfectly
acceptable to return the error and let the upper layer retry later; the
filesystem generaly doesn't have to care if the error is transient or
not.

Where this becomes a filesystem concern is when those errors start
driving state changes within the filesystem, i.e. when we need to track
and know about them: in that case, we absolutely do need to try hard to
distinguish between transient and permanent errors.

Specifically, bcachefs needs to be able to move data around - even if
it's bitrotted and generating checksum errors. There's a couple
different reasons we may need to move data:

- rebalance, this is the task that does arbitrary data processing in the
  background based on user policy (moving it to a slower device,
  compressing it or recompressing it with a more expensive algorithm)

- copygc, which compacts fragmented data in mostly-empty buckets and is
  required for the allocator and the whole system to be able to make
  forward progress

- device evacuate/removal

Rebalance is optional (and still needs to bitrotted extents to be
tracked so that it doesn't spin on them, retrying them endlessly), but
copygc and evacuate are not.

And the only way for these processes to handle bitrotted extents
(besides throwing them out, which obviously we're not going to do) is

- generate a new checksum
- mark them as poisoned, so that reads to userspace still return the
  appropriate error (unless it's a read done with a "I know there's
  errors, give me what you still have" API).

But the critical thing here is, marking an extent as poisoned will turn
a transient corruption into a permanent error unless we've made damn
sure there is no transient error with additional FUA retries.