Re: [RFC] Filesystem error notifications proposal

Amir Goldstein <amir73il@xxxxxxxxx> · Fri, 22 Jan 2021 09:36:18 +0200

> >>   - Visibility of objects.  A bind mount of a subtree shouldn't receive
> >>     notifications of objects outside of that bind mount.
> >
> > So this is scope creep beyond the original goals of the project.  I
> > understand that there is a desire by folks in the community to support
> > various containerization use cases where only only a portion of file
> > system is visible in a container due to a bind mount.
> >
> > However, we need to recall that ext4_error messages can originate in
> > fairly deep inside the ext4 file system.  They indicate major problems
> > with the file system --- the kind that might require drastic system
> > administration reaction.  As such, at the point where we discover a
> > problem with an inode, that part of the ext4 code might not have
> > access to the pathname that was used to originally access the inode.
> >
> > We might be inside a workqueue handler, for example, so we might not
> > even running in the same process that had been containerized.  We
> > might be holding various file system mutexes or even in some cases a
> > spinlock.
>
> I see.  But the visibility is of a watcher who can see an object, not
> the application that caused the error.  The fact that the error happened
> outside the context of the containerized process should not be a problem
> here, right?  As long as the watcher is watching a mountpoint that can
> reach the failed inode, that inode should be accessible to the watcher
> and it should receive a notification. No?
>

No, because the mount/path is usually not available in file system
internal context. Even in vfs, many operations have no mnt context,
which is the reason that some fanotify event types are available for
FAN_MARK_FILESYSTEM and not for FAN_MARK_MOUNT.

> > What follows from that is that it's not really going to be possible to
> > filter notifications to a subtree.  Furthermore, if fanotify requires
> > memory allocation, that's going to be problematic, we may not be in a
> > context where memory allocation is possible.  So for that reason, it's
> > not clear to me that fanotify is going to be a good match for this use
> > case.
>
> I see.  Do you think we would be able to refactor the error handling
> code, such that we can drop spinlocks and do some non-reclaiming
> allocations at least?  I noticed Google's code seems to survive doing
> some allocations with GFP_ATOMIC in their internal-to-Google netlink
> notification system, and even GFP_KERNEL on some other scenarios.  I
> might not be seeing the full picture though.
>
> I think changing fanotify to avoid allocations in the submission path
> might be just intrusive enough for the patch to be rejected by Jan. If
> we cannot do allocations at all, I would suggest I move this feature out
> of fanotify, but stick with fsnotify, for its ability to link
> inodes/mntpoints/superblock.
>

Gabriel,

I understand the use case of monitoring a fleet of machines to know
when some machine in the fleet has a corruption.
I don't understand why the monitoring messages need to carry all the
debugging info of that corruption.

For corruption detection use case, it seems more logical to configure
machines in the fleet to errors=remount-ro and then you'd only ever
need to signal that a corruption was detected on a filesystem and the
monitoring agent can access that machine to get more debugging
info from dmesg or from filesystem recorded first/last error.

You may be able to avoid allocation in fanotify if a group keeps
a pre-allocated "emergency" event, but you won't be able to
avoid taking locks in fanotify. Even fsnotify takes srcu_read_lock
and spin_lock in some cases, so you'd have to be carefull with the
context you call fsnotify from.

If you agree with my observation that filesystem can abort itself
on corruption and keep the details internally, then the notification
of a corrupted state can always be made from a safe context
sometime after the corruption was detected, regardless of the
context in which ext4_error() was called.

IOW, if the real world use cases you have are reporting
writeback errors and signalling that the filesystem entered a corrupted
state, then fanotify might be the right tool for the job and you should
have no need for variable size detailed event info.
If you want a netoops equivalent reporting infrastructure, then
you should probably use a different tool.

Thanks,
Amir.