On 2023/3/1 04:49, Darrick J. Wong wrote:
Hello fsdevel people,
Five years ago[0], we started a conversation about cross-filesystem
userspace tooling for online fsck. I think enough time has passed for
us to have another one, since a few things have happened since then:
1. ext4 has gained the ability to send corruption reports to a userspace
monitoring program via fsnotify. Thanks, Collabora!
Not familiar with the new fsnotify thing, any article to start?
I really believe we should have a generic interface to report errors,
currently btrfs reports extra details just through dmesg (like the
logical/physical of the corruption, reason, involved inodes etc), which
is far from ideal.
2. XFS now tracks successful scrubs and corruptions seen during runtime
and during scrubs. Userspace can query this information.
3. Directory parent pointers, which enable online repair of the
directory tree, is nearing completion.
4. Dave and I are working on merging online repair of space metadata for
XFS. Online repair of directory trees is feature complete, but we
still have one or two unresolved questions in the parent pointer
code.
5. I've gotten a bit better[1] at writing systemd service descriptions
for scheduling and performing background online fsck.
Now that fsnotify_sb_error exists as a result of (1), I think we
should figure out how to plumb calls into the readahead and writeback
code so that IO failures can be reported to the fsnotify monitor. I
suspect there may be a few difficulties here since fsnotify (iirc)
allocates memory and takes locks.
As a result of (2), XFS now retains quite a bit of incore state about
its own health. The structure that fsnotify gives to userspace is very
generic (superblock, inode, errno, errno count). How might XFS export
a greater amount of information via this interface? We can provide
details at finer granularity -- for example, a specific data structure
under an allocation group or an inode, or specific quota records.
The same for btrfs.
Some btrfs specific info like subvolume id is also needed to locate the
corrupted inode (ino is not unique among the full fs, but only inside
one subvolume).
And something like file paths for the corrupted inode is also very
helpful for end users to locate (and normally delete) the offending inode.
With (4) on the way, I can envision wanting a system service that would
watch for these fsnotify events, and transform the error reports into
targeted repair calls in the kernel.
Btrfs has two ways of repair:
- Read time repair
This happens automatically for both invovled data and metadata, as
long as the fs is mount RW.
- Scrub time repair
The repair is also automatic.
The main difference is, scrub is manually triggered by user space.
Otherwise it can be considered as a full read of the fs (both metadata
and data).
But the repair of btrfs only involves using the extra copies, never
intended to repair things like directories.
(That's still the work of btrfs-check, and the complex cross reference
of btrfs is not designed to repair those problems at runtime)
Currently both repair would result a dmesg based report, while scrub has
its own interface to report some very basis accounting, like how many
sectors are corrupted, and how many are repaired.
A feature full and generic interface to report errors are definitely a
good direction to go.
This of course would be very
filesystem specific, but I would also like to hear from anyone pondering
other usecases for fsnotify filesystem error monitors.
Btrfs also has an internal error counters, but that's accumulated value,
sometimes it's not that helpful and can even be confusing.
If we have such interface, we can more or less get rid of the internal
error counters, and rely on the user space to do the history recording.
Once (3) lands, XFS gains the ability to translate a block device IO
error to an inode number and file offset, and then the inode number to a
path. In other words, your file breaks and now we can tell applications
which file it was so they can failover or redownload it or whatever.
Ric Wheeler mentioned this in 2018's session.
Yeah, if user space deamon can automatically (at least by some policy)
delete offending files, it can be a great help.
As we have hit several reports that corrupted files (no extra copy to
recover from) are preventing btrfs balance, and users have to locate the
file from dmesg, and then delete the file and retry balancing.
Thus such interface can greatly improve the user experience.
Thanks,
Qu
The final topic from that 2018 session concerned generic wrappers for
fsscrub. I haven't pushed hard on that topic because XFS hasn't had
much to show for that. Now that I'm better versed in systemd services,
I envision three ways to interact with online fsck:
- A CLI program that can be run by anyone.
- Background systemd services that fire up periodically.
- A dbus service that programs can bind to and request a fsck.
I still think there's an opportunity to standardize the naming to make
it easier to use a variety of filesystems. I propose for the CLI:
/usr/sbin/fsscrub $mnt that calls /usr/sbin/fsscrub.$FSTYP $mnt
For systemd services, I propose "fsscrub@<escaped mountpoint>". I
suspect we want a separate background service that itself runs
periodically and invokes the fsscrub@$mnt services. xfsprogs already
has a xfs_scrub_all service that does that. The services are nifty
because it's really easy to restrict privileges, implement resource
usage controls, and use private name/mountspaces to isolate the process
from the rest of the system.
dbus is a bit trickier, since there's no precedent at all. I guess
we'd have to define an interface for filesystem "object". Then we could
write a service that establishes a well-known bus name and maintains
object paths for each mounted filesystem. Each of those objects would
export the filesystem interface, and that's how programs would call
online fsck as a service.
Ok, that's enough for a single session topic. Thoughts? :)
--D
[0] https://lwn.net/Articles/754504/
[1] https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=scrub-optimize-by-default