>>>>> "Kent" == Kent Overstreet <kent.overstreet@xxxxxxxxx> writes: > Roland Vet spotted a good one: currently, rebalance/copygc get stuck if > we've got an extent that can't be read, due to checksum error/bitrot. > This took some doing to fix properly, because > - We don't want to just delete such data (replace it with > KEY_TYPE_error); we never want to delete anything except when > explicitly told to by the user, and while we don't yet have an API for > "read this file even though there's errors, just give me what you > have" we definitely will in the future. So will open() just return an error? How will this look from userspace? > - Not being able to move data is a non-option: that would block copygc, > device removal, etc. > - And, moving said extent requires giving it a new checksum - strictly > required if the move has to fragment it, teaching the write/move path > about extents with bad checksums is unpalateable, and anyways we'd > like to be able to guard against more bitrot, if we can. Why does it need a new checksum if there are read errors? What happens if there are more read errors? If I have a file on a filesystem which is based in spinning rust and I get a single bit flip, I'm super happy you catch it. But now you re-checksum the file, with the read error, and return it? I'm stupid and just a user/IT guy. I want notifications, but I don't want my application to block so I can't kill it, or unmount the filesystem. Or continue to use it if I like. > So that means: > - Extents need a poison bit: "reads return errors, even though it now > has a good checksum" - this was added in a separate patch queued up > for 6.15. Sorry for being dense, but does a file have one or more extents? Or is this at a level below that? > It's an incompat feature because it's a new extent field, and old > versions can't parse extents with unknown field types, since they > won't know their sizes - meaning users will have to explicitly do an > incompat upgrade to make use of this stuff. > - The read path needs to do additional retries after checksum errors > before giving up and marking it poisoned, so that we don't > accidentally convert a transient error to permanent corruption. When doing these retries, is the filesystem locked up or will the process doing the read() be blocked from being killed? > - The read path gets a whole bunch of work to plumb precise modern error > codes around, so that e.g. the retry path, the data move path, and the > "mark extent poisoned" path all know exactly what's going on. > - Read path is responsible for marking extents poisoned after sufficient > retry attempts (controlled by a new filesystem option) > - Data move path is allowed to move extents after a read error, if it's > a checksum error (giving it a new checksum) if it's been poisoned > (i.e. the extent flags feature is enabled). So if just a single bit flips, the extent gets moved onto better storage, and the file gets re-checksummed. But what about if more bits go bad afterwards? > Code should be more or less finalized - still have more tests for corner > cases to write, but "write corrupt data and then tell rebalance to move > it to another device" works as expected. > TODO: > - NVME has a "read recovery level" attribute that controlls how hard the > erasure coding algorithms work - we want that plumbed. > Before we give up and move data that we know is bad, we need to try > _as hard as possible_ to get a successful read. What does this mean exactly in terms of end user and SysAdmin point of view? Locking up the system (no new writes to the now problematic storage) that I can't kill would be annoyingly painful. Don't get me wrong, I'm happy to see data integrity stuff get added, I'm just trying to understand how it interacts with userspace. John