Re: [PATCH 00/14] better handling of checksum errors/bitrot

Kent Overstreet <kent.overstreet@xxxxxxxxx> · Thu, 20 Mar 2025 13:15:32 -0400

On Tue, Mar 18, 2025 at 10:47:51AM -0400, John Stoffel wrote:
> >>>>> "Kent" == Kent Overstreet <kent.overstreet@xxxxxxxxx> writes:
> 
> > On Mon, Mar 17, 2025 at 04:55:24PM -0400, John Stoffel wrote:
> >> >>>>> "Kent" == Kent Overstreet <kent.overstreet@xxxxxxxxx> writes:
> >> 
> >> > Roland Vet spotted a good one: currently, rebalance/copygc get stuck if
> >> > we've got an extent that can't be read, due to checksum error/bitrot.
> >> 
> >> > This took some doing to fix properly, because
> >> 
> >> > - We don't want to just delete such data (replace it with
> >> >   KEY_TYPE_error); we never want to delete anything except when
> >> >   explicitly told to by the user, and while we don't yet have an API for
> >> >   "read this file even though there's errors, just give me what you
> >> >   have" we definitely will in the future.
> >> 
> >> So will open() just return an error?  How will this look from
> >> userspace?  
> 
> > Not the open, the read - the typical case is only a single extent goes
> > bad; it's like any other IO error.
> 
> Good. But then how would an application know we got a checksum error
> for data corruption?  Would I have to update all my code to do another
> special call when I get a read/write error to see if it was a
> corruption issue?  

We can only return -EIO via the normal IO paths.

> 
> >> > - Not being able to move data is a non-option: that would block copygc,
> >> >   device removal, etc.
> >> 
> >> > - And, moving said extent requires giving it a new checksum - strictly
> >> >   required if the move has to fragment it, teaching the write/move path
> >> >   about extents with bad checksums is unpalateable, and anyways we'd
> >> >   like to be able to guard against more bitrot, if we can.
> >> 
> >> Why does it need a new checksum if there are read errors?  What
> >> happens if there are more read errors?   If I have a file on a
> >> filesystem which is based in spinning rust and I get a single bit
> >> flip, I'm super happy you catch it.  
> 
> > The data move paths very strictly verify checksums as they move data
> > around so they don't introduce bitrot.
> 
> Good.  This is something I really liked as an idea in ZFS, happy to
> see it coming to bcachefs as well. 

Coming? It's been that way since long before bcachefs was merged.

> > The aforementioned poison bit ensures that you still get the error from
> > the original checksum error when you read that data - unless you use an
> > appropriate "give it to me anyways" API.
> 
> So this implies that I need to do extra work to A) know I'm on
> bcachefs filesystem, B) that I got a read/write error and I need to do
> some more checks to see what the error exactly is.  
> 
> And if I want to re-write the file I can either copy it to a new name,
> but only when I use the new API to say "give me all the data, even if
> you have a checksum error".  

This is only if you want an API for "give me possibly corrupted data".

That's pretty niche.

> > The process doing the read() can't be killed during this, no. If
> > requested this could be changed, but keep in mind retries are limited in
> > number.
> 
> How limited?  And in the worse case, if I have 10 or 100 readers of a
> file with checksum errors, now I've blocked all those processes for X
> amount of time.  Will this info be logged somewhere so a sysadm could
> possibly just do an 'rm' on the file to nuke it, or have some means of
> forcing a scrub?  

The poison bit means we won't attempt additional retries on an extent,
once we've reached the (configurable) limit. Future IOs will just return
an error without doing the actual read, since we already know it's bad.

So no, this won't bog down your system.

> > Nothing else is "locked up", everything else proceeds as normal.
> 
> But is the filesystem able to be unmounted when there's a locked up
> process?  I'm just thinking in terms of system shutdowns when you have
> failing hardware and want to get things closed as cleanly as possible
> since you're going to clone the underlying block device onto new media
> ASAP in an offline manner.  

The number of retries defaults to 3, and there's no real reason to make
it more than 5, so this isn't a real change over the current situatino
where drives sometimes take forever on a read as they're going out.

Eventually we could add a configurable hard limit on the amount of time
we spend trying to service an individual read, but that's niche so
further down the road.