Re: [RFC] Filesystem error notifications proposal

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 10 Feb 2021 18:46:57 +1100

On Tue, Feb 09, 2021 at 06:35:43PM +0100, Jan Kara wrote:
> On Tue 09-02-21 09:19:16, Dave Chinner wrote:
> > On Mon, Feb 08, 2021 at 01:49:41PM -0500, Gabriel Krisman Bertazi wrote:
> > > "Theodore Ts'o" <tytso@xxxxxxx> writes:
> > For XFS, we want to be able to hook up the verifier error reports
> > to a notification. We want to be able to hook all our corruption
> > reports to a notification. We want to be able to hook all our
> > writeback errors to a notification. We want to be able to hook all
> > our ENOSPC and EDQUOT errors to a notification. And that's just the
> > obvious stuff that notifications are useful for.
> 
> I agree with you here but I'd like to get the usecases spelled out to be
> able to better evaluate the information we need to pass. I can imagine for
> ENOSPC errors this can be stuff like thin provisioning sending red alert to
> sysadmin - this would be fs-wide event. I have somewhat hard time coming up
> with a case where notification of ENOSPC / EDQUOT for a particular file /
> dir would be useful.

An example is containers that the admins configure with project
quotas as directory quotas so that individual containers have their
own independent space accounting and enforcement by the host.  Apps
inside the container to monitor for their own ENOSPC events
(triggered by project quota EDQUOT) instead of the filesystem wide
ENOSPC.

> I can see a usecase where an application wishes to monitor all its files /
> dirs for any type for error fatal error (ENOSPC, EDQUOT, EIO).

*nod*

We also have cluster level management tools wanting to know about
failure events inside data stores that it hads out of containers
and/or guests. That's where things like corruption reports come in -
being able to flag errors at the management interface that something
whent wrong with the filesystem used by container X, with some level
of detail of what actually got damaged (e.g. file X at offset Y for
length Z is bad).

> Here scoping
> makes a lot of sense from application POV. It may be somewhat tricky to
> reliably provide the notification though. If we say spot inconsistency in
> block allocation structure during page writeback (be it btree in XFS case
> or bitmap in ext4 case), we report the error there in the code for that
> structure but that is not necessarily aware of the inode so we need to make
> sure to generate another notification in upper layers where we can associate
> the error with the inode as well.

Yes, that's what we already do in XFS. The initial corruption
detection site generates the corruption warning, and then if higher
layers can't back out because the fs is in an unrecoverable state,
then shutdown and more error messages are generated. There are
multiple levels of warnings/error messages in filesystems, I thought
that was pretty clear to every one so I'm really very surprised that
nobody is thinking that notifications have different scopes, levels
and meanings, just like the message we send to syslog do....

Indeed, once the filesystem is in a global shutdown or error state,
we don't emit further corruption errors, so we wouldn't emit further
error notifications, either.

Essentially, we're not talking about anything new here - this is
already how we use the syslog for corruption and shutdown reporting.
I'm not sure why using a "notification" instead of a "printk()"
seems to make people think this is a unsolvable problem, because we
have already solved it....

> Even worse if we spot some error e.g. during
> journal commit, we (at least in ext4 case) don't have enough information to
> trace back affected inodes anymore.

Failure in the journal is fatal error, and we shut down. That
generates the shutdown notification, and we don't emit anything else
once the shutdown is complete. Further analysis is up to the admin,
not the notification subsystem.

> So how do we handle such cases? Do we
> actively queue error notifications for all inodes? Or do we lazily wait for
> some operation associated with a particular inode to fail to queue
> notification? I can see pros and cons for both...

I'd say that you're vastly over complicating the problem. There is
no point in generating a notification storm from the filesystem once
a fatal error has already been tripped over and the filesystem shut
down.  We don't flood the syslog like this, and we shouldn't flood
the system with unnecessary notifications, either.

This implies that "fatal error" notifications should probably be
broadcast over all "error" watches on that filesystem, regardless of their
scope, because the filesystem is basically saying "everything has
failed". And then no further error notifications are generated,
because everything is already been told "it's broken".

But, really, that's a scoping discussion, not a use case....

> What usecases you had in mind?

Data loss events being reported to userspace so desktop
notifications can be raised. Or management interface notifications
can be raised. Or repair utilities can determine if the problem can
be fixed automatically.

I mean, that's the whole sticking point with DAX+reflink - being
able to reverse map the physical storage to the user data so that
when the storage gets torched by a MCE we can do the right thing.
And part of that "right thing" is notifying the apps and admins that
they data just went up in a cloud of high energy particles...

THen there's stuff that is indicitive of imminent failure.
Notification of transient errors during metadata operations, the
number of retries before success, when we end up with permanently
retrying because the storage is actually toast writes so unmount
will eventually fail, etc.

When there is a filesysetm health status change. Notification that a
filesystem capacity has changed (e.g. grow/shrink). notification
that a filesystem has been frozen. That allocation groups are
running low on space, that we are out of inode space, the reserve
block pool has been depleted, etc.

IOWs, storage management and monitoring is a common case I keep
hearing about. I here more vague requirements from higher level
application stacks (cloudy stuff) that they need stuff like
per-container space management and notifications. But the one thing
that nobody wants to do is scrape and/or parse text messages.

Another class of use case is applications being able to monitor
their files for writeback errors and such notifications containing
the inode, offset and length in them where the failure occurred so
that the actual data loss can be dealt with (e.g. by rewriting the
data) before the application has removed it from it's write buffers.
Right now we have no way to tell the user application where the
writeback error occurred, just that EIO happened -some where- at
-some time in the past- when they next do something with data...

> > If you want an idea of all the different types of metadata objects
> > we need to have different notifications for, look at the GETFSMAP
> > ioctl man page. It lists all the different types of objects we are
> > likely to emit notifications for from XFS (e.g. free space
> > btree corruption at record index X to Y) because, well, that's the
> > sort of information we're already dumping to the kernel log....
> > 
> > Hence from a design perspective, we need to separate the contents of
> > the notification from the mechanism used to configure, filter and
> > emit notifications to userspace.  That is, it doesn't matter if we
> > add a magic new syscall or use fanotify to configure watches and
> > transfer messages to userspace, the contents of the message is going
> > to be the exactly the same, and the API that the filesystem
> > implementations are going to call to emit a notification to
> > userspace is exactly the same.
> > 
> > So a generic message structure looks something like this:
> > 
> > <notification type>		(msg structure type)
> > <notification location>		(encoded file/line info)
> > <object type>			(inode, dir, btree, bmap, block, etc)
> > <object ID>			{bdev, object}
> > <range>				{offset, length} (range in object)
> > <notification version>		(notification data version)
> > <notification data>		(filesystem specific data)
> 
> There's a caveat though that 'object type' is necessarily filesystem
> specific and with new filesystem wanting to support this we'll likely need
> to add more object types. So it is questionable how "generic error parser"
> would be able to use this type of information and whether this doesn't need
> to be in the fs-specific blob.

Well, there are only so many generic types. If we start with the
basic ones such as "regular file" "directory" "user data extent"
and "internal metadata" we cover most bases. That's the reason I
said "filesystem specific diagnostic data can follow the generic
message". This allows the filesystem to say "Fatal internal metadata
error" to userspace and then in it's custom field say "journal write
IO error at block XYZ".

Monitoring tools don't need to know it was a journal error, the
context they react to is "fatal error". Whoever (or whatever) is
tasked with responding to that error can then look at the diagnostic
information supplied with the notification. i.e.:

Severity: fatal
Scope: global
Type: Internal metadata
Object: Journal
Location: <bdev>
Range: <extent>
Error: ENODEV
Diagnostic data: <Write error ENODEV in journal at block XYZ>

A data loss event would indicate that a data extent went bad,
identify it as belonging to inode X at offset Y, length Z. i.e.

Severity: Data Corruption
Scope: inode
Type: User Data
Object: <Extent>
Location: <inode>
Range: <logical offset, length>
Diagnostic data: "writeback failed at inode X, offset Y, len Z due to ENOSPC from bdev"

If the data corruption happens in the inode metadata (e.g. the block
map), the event would be a little different:

Severity: Data Corruption
Scope: inode
Type: Internal Metadata
Object: <Extent>
Location: <inode>
Range: <logical offset, length>
Diagnostic data: "BMBT block at block XYZ failed checksum, cannot read extent records"

So they tell userspace the same thing, but the actual details of the
cause of the data loss over that range of th file are quite
different.

Perhaps an space usage event:

Severity: Information
Scope: Global
Type: Capacity
Total space: X
Available space: Y

Or a directory quota warning:

Severity: Warning
Scope: Project Quota
Type: Low Capacity
Object: <project id>
Total space: X
Available space: Y

IOWs, the notification message header is nothing but a
classification scheme that the notification scoping subsystem uses
for filtering and distribution. If we just stick to the major
objects a filesystem exposes to uses (regular files, directories,
extended attributes, quota, Capacity and internal metadata) and
important events (corruption, errors and emergency actions) then we
cover most of what all filesystems are going to need to tell
userspace.

> Also versioning fs specific blobs with 'notification version' tends to get
> somewhat cumbersome if you need to update the scheme, thus bump the
> version, which breaks all existing parsers (and I won't even speak about
> the percentage of parses that won't bother with checking the version and
> just blindly try to parse whatever they get assuming incorrect things ;).
> We've been there more than once... But this is more of a side remark - once
> other problems are settled I believe we can come up with reasonably
> extensible scheme for blob passing pretty easily.

Yup, I just threw it in there because we need to ensure that the
message protocol format is both extensible and revocable. We will
make mistakes, but we can also ensure we don't have to live with
those mistakes forever.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx