On Mon, Mar 17, 2025 at 08:13:59AM -0600, Keith Busch wrote: > On Mon, Mar 17, 2025 at 08:15:51AM -0400, Kent Overstreet wrote: > > On Sun, Mar 16, 2025 at 11:00:20PM -0700, Christoph Hellwig wrote: > > > On Sat, Mar 15, 2025 at 02:41:33PM -0400, Kent Overstreet wrote: > > > > That's the sort of thing that process is supposed to avoid. To be clear, > > > > you and Christoph are the two reasons I've had to harp on process in the > > > > past - everyone else in the kernel has been fine. > > > > ... > > > > > In this case you very clearly misunderstood how the FUA bit works on > > > reads (and clearly documented that misunderstanding in the commit log). > > > > FUA reads are clearly documented in the NVME spec. > > NVMe's Read with FUA documentation is a pretty short: > > Force Unit Access (FUA): If this bit is set to `1´, then for data and > metadata, if any, associated with logical blocks specified by the Read > command, the controller shall: > > 1) commit that data and metadata, if any, to non-volatile medium; and > 2) read the data and metadata, if any, from non-volatile medium. > > So this aligns with ATA and SCSI. > > The concern about what you're doing is with respect to "1". Avoiding that > requires all writes also set FUA, or you sent a flush command before doing any > reads. Otherwise how would you know whether or not read data came from a > previous cached write? Line 1 is not the relevant concern here. That would be relevant if we were trying to verify the integrity of writes end-to-end immediately after they'd been completed - and that would be a lovely thing to do in the spirit of "true end to end data integrity", but we're not ready to contemplate such things due to the obvious performance implications. The scenario here is: we go to read old data, we get a bit error, and then we want to retry the read to verify that it wasn't a transient error before declaring it dead (marking it poisoned). In this scenario, "dirty data in the writeback cache" is something that would only ever happen if userspace is doing very funny things with O_DIRECT, and the drive flushing that data on FUA read is totally fine. It's an irrelevant corner case. But we do absolutely require checking for transient read errors, and we will miss things and corrupt data unnecessarily without FUA reads that bypass the controller cache. > I'm actually more curious how you came to use this mechanism for recovery. > Have you actually seen this approach fix a real problem? In my experience, a > Read FUA is used to ensure what you're reading has been persistently committed. > It can be used as an optimization to flush a specific LBA range, but I'd not > seen this used as a bit error recovery mechanism. It's not data recovery - in the conventional sense. Typically, on conventional filesystems, this doesn't come up because a: most filesystems don't even do data checksumming, and even then on a filesystem with data checksumming this won't automatically come up - because like other other transient errors, it's generally perfectly acceptable to return the error and let the upper layer retry later; the filesystem generaly doesn't have to care if the error is transient or not. Where this becomes a filesystem concern is when those errors start driving state changes within the filesystem, i.e. when we need to track and know about them: in that case, we absolutely do need to try hard to distinguish between transient and permanent errors. Specifically, bcachefs needs to be able to move data around - even if it's bitrotted and generating checksum errors. There's a couple different reasons we may need to move data: - rebalance, this is the task that does arbitrary data processing in the background based on user policy (moving it to a slower device, compressing it or recompressing it with a more expensive algorithm) - copygc, which compacts fragmented data in mostly-empty buckets and is required for the allocator and the whole system to be able to make forward progress - device evacuate/removal Rebalance is optional (and still needs to bitrotted extents to be tracked so that it doesn't spin on them, retrying them endlessly), but copygc and evacuate are not. And the only way for these processes to handle bitrotted extents (besides throwing them out, which obviously we're not going to do) is - generate a new checksum - mark them as poisoned, so that reads to userspace still return the appropriate error (unless it's a read done with a "I know there's errors, give me what you still have" API). But the critical thing here is, marking an extent as poisoned will turn a transient corruption into a permanent error unless we've made damn sure there is no transient error with additional FUA retries.