Re: [RFC] Filesystem error notifications proposal

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 9 Feb 2021 09:19:16 +1100

On Mon, Feb 08, 2021 at 01:49:41PM -0500, Gabriel Krisman Bertazi wrote:
> "Theodore Ts'o" <tytso@xxxxxxx> writes:
> 
> > On Tue, Feb 02, 2021 at 03:26:35PM -0500, Gabriel Krisman Bertazi wrote:
> >> 
> >> Thanks for the explanation.  That makes sense to me.  For corruptions
> >> where it is impossible to map to a mountpoint, I thought they could be
> >> considered global filesystem errors, being exposed only to someone
> >> watching the entire filesystem (like FAN_MARK_FILESYSTEM).
> >
> > At least for ext4, there are only 3 ext4_error_*() that we could map
> > to a subtree without having to make changes to the call points:
> >
> > % grep -i ext4_error_file\( fs/ext4/*.c  | wc -l
> > 3
> > % grep -i ext4_error_inode\( fs/ext4/*.c  | wc -l
> > 79
> > % grep -i ext4_error\( fs/ext4/*.c  | wc -l
> > 42
> >
> > So in practice, unless we want to make a lot of changes to ext4, most
> > of them will be global file system errors....
> >
> >> But, as you mentioned regarding the google use case, the entire idea of
> >> watching a subtree is a bit beyond the scope of my use-case, and was
> >> only added given the feedback on the previous proposal of this feature.
> >> While nice to have, I don't have the need to watch different mountpoints
> >> for errors, only the entire filesystem.
> >
> > I suspect that for most use cases, the most interesting thing is the
> > first error.  We already record this in the ext4 superblock, because
> > unfortunately, I can't guarantee that system administrators have
> > correctly configured their system logs, so when handling upstream bug
> > reports, I can just ask them to run dumpe2fs -h on the file system:
> >
> > FS Error count:           2
> > First error time:         Tue Feb  2 16:27:42 2021
> > First error function:     ext4_lookup
> > First error line #:       1704
> > First error inode #:      12
> > First error err:          EFSCORRUPTED
> > Last error time:          Tue Feb  2 16:27:59 2021
> > Last error function:      ext4_lookup
> > Last error line #:        1704
> > Last error inode #:       12
> > Last error err:           EFSCORRUPTED
> >
> > So it's not just the Google case.  I'd argue for most system
> > administrator, one of the most useful things is when the file system
> > was first found to be corrupted, so they can try correlating file
> > system corruptions, with, say, reports of I/O errors, or OOM kils,
> > etc.  This can also be useful for correlating the start of file system
> > problems with problems at the application layer --- say, MongoDB,
> > MySQL, etc.
> >
> > The reason why a notification system useful is because if you are
> > using database some kind of high-availability replication system, and
> > if there are problems detected in the file system of the primary MySQL
> > server, you'd want to have the system fail over to the secondary MySQL
> > server.  Sure, you *could* do this by polling the superblock, but
> > that's not the most efficient way to do things.
> 
> Hi Ted,
> 
> I think this closes a full circle back to my original proposal.  It
> doesn't have the complexities of objects other than superblock
> notifications, doesn't require allocations.  I sent an RFC for that a
> while ago [1] which resulted in this discussion and the current
> implementation.

Yup, we're back to "Design for Google/ext4 requirements only", and
ignoring that other filesystems and users also have non-trivial
requirements for userspace error notifications.

> For the sake of a having a proposal and a way to move forward, I'm not
> sure what would be the next step here.  I could revive the previous
> implementation, addressing some issues like avoiding the superblock
> name, the way we refer to blocks and using CAP_SYS_ADMIN.  I think that
> implementation solves the usecase you explained with more simplicity.
> But I'm not sure Darrick and Dave (all in cc) will be convinced by this
> approach of global pipe where we send messages for the entire
> filesystem, as Dave described it in the previous implementation.

Nope, not convinced at all. As a generic interface, it cannot be
designed explicitly for the needs of a single filesystem, especially
when there are other filesystems needing to implement similar
functionality.

As Amir pointed up earlier in the thread, XFS already already has
extensive per-object verification and error reporting facilicities
that we would like to hook up to such a generic error reporting
mechanism. These use generic mechanisms within XFS, and we've
largely standardised the code interally to implement this (e.g. the
xfs_failaddr as a mechanism of efficiently encoding the exact check
that failed out of the hundreds of verification checks we do).

If we've already got largely standardised, efficient mechanisms for
doing all of this in a filesystem, then why would we want to throw
that all away when implementing a generic userspace notification
channel? We know exactly what we need to present with userspace, so
even if other filesystems don't need exactly the same detail of
information, they still need to supply a subset of that same
information to userspace.

The ext4-based proposals keep mentioning dumping text strings and
encoded structures that are ext4 error specific, instead of starting
from a generic notification structure that defines the object in the
filesystem and location within the object that the notification is
for. e.g the {bdev, object, offset, length} object ID tuple I
mention here:

https://lore.kernel.org/linux-ext4/20201210220914.GG4170059@xxxxxxxxxxxxxxxxxxx/

For XFS, we want to be able to hook up the verifier error reports
to a notification. We want to be able to hook all our corruption
reports to a notification. We want to be able to hook all our
writeback errors to a notification. We want to be able to hook all
our ENOSPC and EDQUOT errors to a notification. And that's just the
obvious stuff that notifications are useful for.

If you want an idea of all the different types of metadata objects
we need to have different notifications for, look at the GETFSMAP
ioctl man page. It lists all the different types of objects we are
likely to emit notifications for from XFS (e.g. free space
btree corruption at record index X to Y) because, well, that's the
sort of information we're already dumping to the kernel log....

Hence from a design perspective, we need to separate the contents of
the notification from the mechanism used to configure, filter and
emit notifications to userspace.  That is, it doesn't matter if we
add a magic new syscall or use fanotify to configure watches and
transfer messages to userspace, the contents of the message is going
to be the exactly the same, and the API that the filesystem
implementations are going to call to emit a notification to
userspace is exactly the same.

So a generic message structure looks something like this:

<notification type>		(msg structure type)
<notification location>		(encoded file/line info)
<object type>			(inode, dir, btree, bmap, block, etc)
<object ID>			{bdev, object}
<range>				{offset, length} (range in object)
<notification version>		(notification data version)
<notification data>		(filesystem specific data)

The first six fields are generic and always needed and defined by
the kernel/user ABI (i.e. fixed forever in time). The notification
data is the custom message information from the filesystem, defined
by the filesystem, and is not covered by kernel ABI requirements
(similar to current dmesg output). That message could be a string,
an encoded structure, etc, but it's opaque to the notification layer
and just gets dumped to userspace in the notification. Generic tools
can parse the generic fields to give basic notification information,
debug/diagnostic tools can turn that fs specific information into
something useful for admins and support engineers.

IOWs, there is nothing that needs to be ext4 or XFS specific in the
notification infrastructure, just enough generic information for
generic tools to be useful, and a custom field for filesystem
specific diagnostic information to be included in the notification.

At this point, we just don't care about where in the filesystem the
notification is generated - the notification infrastructure assigns
the errors according to the scope the filesystem maps the object
type to. e.g. if fanotify is the userspace ABI, then global metadata
corruptions go to FA_MARK_FILESYSTEM watchers but not FA_MARK_MOUNT.
The individual operations that get an -EFSCORRUPTED error emits a
notification to FA_MARK_INODE watchers on that inode. And so on.

If a new syscall is added, then it also needs to be able to scope
error notifications like this because we would really like to have
per-directory and per-inode notifications supported if at all
possible. But that is a separate discussion to the message contents
and API filesystems will use to create notifications on demand...

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx