Re: [RFC] Filesystem error notifications proposal

Gabriel Krisman Bertazi <krisman@xxxxxxxxxxxxx> · Tue, 02 Feb 2021 15:51:55 -0500

Amir Goldstein <amir73il@xxxxxxxxx> writes:

>> I see.  But the visibility is of a watcher who can see an object, not
>> the application that caused the error.  The fact that the error happened
>> outside the context of the containerized process should not be a problem
>> here, right?  As long as the watcher is watching a mountpoint that can
>> reach the failed inode, that inode should be accessible to the watcher
>> and it should receive a notification. No?
>>
>
> No, because the mount/path is usually not available in file system
> internal context. Even in vfs, many operations have no mnt context,
> which is the reason that some fanotify event types are available for
> FAN_MARK_FILESYSTEM and not for FAN_MARK_MOUNT.

Hi Amir, thanks for the explanation.

> I understand the use case of monitoring a fleet of machines to know
> when some machine in the fleet has a corruption.
> I don't understand why the monitoring messages need to carry all the
> debugging info of that corruption.
>
> For corruption detection use case, it seems more logical to configure
> machines in the fleet to errors=remount-ro and then you'd only ever
> need to signal that a corruption was detected on a filesystem and the
> monitoring agent can access that machine to get more debugging
> info from dmesg or from filesystem recorded first/last error.

The main use-case, as Ted mentioned, is corruption detection in a bunch
of machines and, while allowing them to continue to operate if possible,
schedule the execution of repair tasks and/or data rebuilding of
specific files.  In fact, you are right, we don't need to provide enough
debug information, but the ext4 message, for instance would be
useful. This is more similar to my previous RFC at
https://lwn.net/Articles/839310/

There are other use cases requiring us to provide some more information, in
particular the place where the error was raised in the code and the type
of error, for pattern analysis. So just reporting corruption via sysfs,
for instance, wouldn't suffice.

> You may be able to avoid allocation in fanotify if a group keeps
> a pre-allocated "emergency" event, but you won't be able to
> avoid taking locks in fanotify. Even fsnotify takes srcu_read_lock
> and spin_lock in some cases, so you'd have to be carefull with the
> context you call fsnotify from.
>
> If you agree with my observation that filesystem can abort itself
> on corruption and keep the details internally, then the notification
> of a corrupted state can always be made from a safe context
> sometime after the corruption was detected, regardless of the
> context in which ext4_error() was called.
>
> IOW, if the real world use cases you have are reporting
> writeback errors and signalling that the filesystem entered a corrupted
> state, then fanotify might be the right tool for the job and you should
> have no need for variable size detailed event info.
> If you want a netoops equivalent reporting infrastructure, then
> you should probably use a different tool.

The main reason I was looking at fanotify was the ability to watch
different mountpoints and objects without watching the entire
filesystem.  This was a requirement raised against my previous
submission linked above, which provided only a mechanism based on
watch_queue to watch the entire filesystem.  If we agree to no longer
watch specific subtrees, I think it makes sense to revert to the
previous proposal, and drop fanotify all together for this use case.

-- 
Gabriel Krisman Bertazi