On Tue, Mar 23, 2021 at 11:35:46AM +0200, Amir Goldstein wrote: > On Tue, Mar 23, 2021 at 9:26 AM Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > On Tue, Mar 23, 2021 at 06:50:44AM +0200, Amir Goldstein wrote: > > > On Tue, Mar 23, 2021 at 1:03 AM Dave Chinner <david@xxxxxxxxxxxxx> wrote: > For most use cases, getting a unique fsid that is not "persistent" > would be fine. Many use case will probably be watching a single > filesystem and then the value of fsid in the event doesn't matter at all. > > If, however, at some point in the future, someone were to write > a listener that stores events in a persistent queue for later processing > it would be more "convenient" if fsid values were "persistent". Ok, that is what I suspected - that you want to write filehandles to a userspace database of some sort for later processing and so you need to also store the filesystem that the filehandle belongs to. FWIW, that's something XFS started doing a couple of decades ago for DMAPI based prioprietary HSM implementations. They were build around a massive userspace database that indexed the contents of the fileystem via file handles and were kept up to date via notifications through the DMAPI interface. > > > > However, changing the uuid on XFS is an offline (unmounted) > > > > operation, so there will be no fanotify marks present when it is > > > > changed. Hence when it is remounted, there will be a new f_fsid > > > > returned in statvfs(), just like what happens now, and all > > > > applications dependent on "persistent" fsids (and persistent > > > > filehandles for that matter) will now get ESTALE errors... > > > > > > > > And, worse, mp->m_fixed_fsid (and XFS superblock UUIDs in general) > > > > are not unique if you've got snapshots and they've been mounted via > > > > "-o nouuid" to avoid XFS's duplicate uuid checking. This is one of > > > > the reasons that the duplicate checking exists - so that fshandles > > > > are unique and resolve to a single filesystem.... > > > > > > Both of the caveats of uuid you mentioned are not a big concern for > > > fanotify because the nature of f_fsid can be understood by the event > > > listener before setting the multi-fs watch (i.e. in case of fsid collision). > > > > Sorry, I don't understand what "the nature of f_fsid can be > > understood" means. What meaning are you trying to infer from 8 bytes > > of opaque data in f_fsid? > > > > When the program is requested to watch multiple filesystems, it starts by > querying their fsid. In case of an fsid collision, the program knows that it > will not be able to tell which filesystem the event originated in, so the > program can print a descriptive error to the user. Ok, so it can handle collisions, but it cannot detect things like two filesystems swapping fsids because device ordering changed at boot time. i.e. there no way to determine the difference between f_fsid change vs the same filesystems with stable f_fsid being mounted in different locations.... > > > Regardless of the fanotify uapi and whether it's good or bad, do you insist > > > that the value of f_fsid exposed by xfs needs to be the bdev number and > > > not derived from uuid? > > > > I'm not insisting that it needs to be the bdev number. I'm trying to > > understand why it needs to be changed, what the impact of that > > change is and whether there are other alternatives before I form an > > opinion on whether we should make this user visible filesystem > > identifier change or not... > > > > > One thing we could do is in the "-o nouuid" case that you mentioned > > > we continue to use the bdev number for f_fsid. > > > Would you like me to make that change? > > > > No, I need to you stop rushing around in a hurry to change code and > > assuming that everyone knows every little detail of fanotify and the > > problems that need to be solved. Slow down and explain clearly and > > concisely why f_fsid needs to be persistent, how it gets used to > > optimise <something> when it is persistent, and what the impact of > > it not being persistent is. > > > > Fair enough. > I'm in a position of disadvantage having no real users to request this change. > XFS is certainly "not broken", so the argument for "not fixing" it is valid. > > Nevertheless bdev can change on modern systems even without reboot > for example for loop mounted images, so please consider investing the time > in forming an opinion about making this change for the sake of making f_fsid > more meaningful for any caller of statfs(2) not only for fanotify listeners. > > Leaving fanotify out of the picture, the question that the prospect user is > trying answer is: > "Is the object at $PATH or at $FD the same object that was observed at > 'an earlier time'?" > > With XFS, that question can be answered (< 100% certainty) > using the XFS_IOC_PATH_TO_FSHANDLE interface. Actually, there's a bit more to it than that. See below. > name_to_handle_at(2) + statfs(2) is a generic interface that provides > this answer with less certainty, but it could provide the answer > with the same certainty for XFS. Let me see if I get this straight.... Because the VFS filehandle interface does not cater for this by giving you a fshandle that is persistent, you have to know what path the filehandle was derived from to be able to open a mountfd for open_by_handle_at() on the file handle you have stored in userspace. And that open_by_handle_at() returns an ephemeral mount ID, so the kernel does not provide what you need to use open_by_handle_at() immediately. To turn this ephemeral mount ID into a stable identifier you have to look up /proc/self/mounts to find the mount point, then statvfs() the mount point to get the f_fsid. To use the handle, you then need to open the path to the stored mount point, check that f_fsid still matches what you originally looked up, then you can run open_by_handle_at() on the file handle. If you have an open fd on the filesystem and f_fsid matches, you have the filesystem pinned until you close the mount fd, and so you can just sort your queued filehandles by f_fsid and process them all while you have the mount fd open.... Is that right? But that still leaves a problem in that the VFS filehandle does not contain a filesystem identifier itself, so you can never actually verify that the filehandle belongs to the mount that you opened for that f_fsid. i.e. The file handle is built by exportfs_encode_fh(), which filesystems use to encode inode/gen/parent information. Nothing else is placed in the VFS handle, so the file handle cannot be used to identify what filesystem it came from. These seem like a fundamental problems for storing VFS handles across reboots: identifying the filesystem is not atomic with the file handle generation and it that identification is not encoded into the file handle for later verification. IOWs, if you get the fsid translation wrong, the filehandle will end up being used on the wrong filesystem and userspace has no way of knowing that this occurred - it will get ESTALE or data that isn't what it expected. Either way, it'll look like data corruption to the application(s). Using f_fsid for this seems fragile to me and has potential to break in unexpected ways in highly dynamic environments. The XFS filehandle exposed by the ioctls, and the NFS filehandle for that matter, both include an identifier for the filesystem they belong to in the file handle. This identifier matches the stable filesystem identifier held by the filesystem (or the NFS export table), and hence the kernel could resolve whether the filehandle itself has been directed at the correct filesystem. The XFS ioctls do not do this fshandle checking - this is something performed by the libhandle library (part of xfsprogs). libhandle knows the format of the XFS filehandles, so it peaks inside them to extract the fsid to determine where to direct them. Applications must first initialise filesystems that file handles can be used on by calling path_to_fshandle() to populate an open file cache. Behind the scenes, this calls XFS_IOC_PATH_TO_FSHANDLE to associate a {fd, path, fshandle} tuple for that filesystem. The libhandle operations then match the fsid embedded in the file handle to the known open fshandles, and if they match it uses the associated fd to issue the ioctl to the correct filesystem. This fshandle fd is held open for as long as the application is running, so it pins the filesystem and so the fshandle obtained at init time is guaranteed to be valid until the application exits. Hence on startup an app simply needs to walk the paths it is interested in and call path_to_fshandle() on all of them, but otherwise it does not need to know what filesystem a filehandle belongs to - the libhandle implementation takes care of that entirely.... IOWs, this whole "find the right filesystem for the file handle" implementation is largely abstracted away from userspace by libhandle. Hence just looking at what the the XFS ioctls do does not give the whole picture of how stable filehandles were actually used by applications... I suspect that the right thing to do here is extend the VFS filehandles to contain an 8 byte fsid prefix (supplied by a new an export op) and an AT_FSID flag for name_to_handle_at() to return just the 8 byte fsid that is used by handles on that filesystem. This provides the filesystems with a well defined API for providing a stable identifier instead of redefining what filesystems need to return in some other UAPI. This also means that userspace can be entirely filesystem agnostic and it doesn't need to rely on parsing proc files to translate ephemeral mount IDs to paths, statvfs() and hoping that f_fsid is stable enough that it doesn't get the destination wrong. It also means that fanotify UAPI probably no longer needs to supply a f_fsid with the filehandle because it is built into the filehandle.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx