On Mi, 12.08.20 12:50, Linus Torvalds (torvalds@xxxxxxxxxxxxxxxxxxxx) wrote: > On Wed, Aug 12, 2020 at 12:34 PM Steven Whitehouse <swhiteho@xxxxxxxxxx> wrote: > > > > The point of this is to give us the ability to monitor mounts from > > userspace. > > We haven't had that before, I don't see why it's suddenly such a big deal. > > The notification side I understand. Polling /proc files is not the answer. > > But the whole "let's design this crazy subsystem for it" seems way > overkill. I don't see anybody caring that deeply. > > It really smells like "do it because we can, not because we must". With my systemd maintainer hat on (and of other userspace stuff), there's a couple of things I really want from the kernel because it would fix real problems for us: 1. we want mount notifications that don't require to scan /proc/self/mountinfo entirely again every time things change, over and over again, simply because that doesn't scale. We have various bugs open about this performance bottleneck, I could point you to, but I figure it's easy to see why this currently doesn't scale... 2. We want an unpriv API to query (and maybe set) the fs UUID, like we have nowadays for the fs label FS_IOC_[GS]ETFSLABEL 3. We want an API to query time granularity of file systems timestamps. Otherwise it's so hard in userspace to reproducibly re-generate directory trees. We need to know for example that some fs only has 2s granularity (like fat). 4. Similar, we want to know if an fs is case-sensitive for file names. Or case-preserving. And which charset it accepts for filenames. 5. We want to know if a file system supports access modes, xattrs, file ownership, device nodes, symlinks, hardlinks, fifos, atimes, btimes, ACLs and so on. All these things currently can only be figured out by changing things and reading back if it worked. Which sucks hard of course. 6. We'd like to know the max file size on a file system. 7. Right now it's hard to figure out mount options used for the fs backing some file: you can now statx() the file, determine the mnt_id by that, and then search that in /proc/self/mountinfo, but it's slow, because again we need to scan the whole file until we find the entry we need. And that can be huge IRL. 8. Similar: we quite often want to know submounts of a mount. It would be great if for that kind of information (i.e. list of mnt_ids below some other mnt_id) we wouldn't have to scan the whole of /p/s/mi again. In many cases in our code we operate recursively, and want to know the mounts below some specific dir, but currently pay performance price for it if the number of file systems on the host is huge. This doesn't sound like a biggie, but actually is a biggie. In systemd we spend a lot of time scaninng /p/s/mi... 9. How are file locks implemented on this fs? Are they local only, and orthogonal to remote locks? Are POSIX and BSD locks possibly merged at the backend? Do they work at all? I don't really care too much how an API for this looks like, but let me just say that I am not a fan of APIs that require allocating an fd for querying info about an fd. This 'feels' a bit too recursive: if you expose information about some fd in some magic procfs subdir, or even in some virtual pseudo-file below the file's path then this means we have to allocate a new fd to figure out things or the first fd, and if we'd know the same info for that, we'd theoretically recurse down. Now of course, most likely IRL we wouldn't actually recurse down, but it is still smelly. In particular if fd limits are tight. I mean, I really don't care if you expose non-file-system stuff via the fs, if that's what you want, but I think exposing *fs* metainfo in the *fs*, it's just ugly. I generally detest APIs that have no chance to ever returning multiple bits of information atomically. Splitting up querying of multiple attributes into multiple system calls means they couldn't possibly be determined in a congruent way. I much prefer APIs where we provide a struct to fill in and do a single syscall, and at least for some fields we'd know afterwards that the fields were filled in together and are congruent with each other. I am a fan of the statx() system call I must say. If we had something like this for the file system itself I'd be quite happy, it could tick off many of the requests I list above. Hope this is useful, Lennart -- Lennart Poettering, Berlin