Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > All your documentation (both commit logs, man-pages and in-kernel > actual docs you add) only talk about "what". > > They don't talk about _why_. > > I can imagine why's. But I think that the "why" is actually way mnore > important than the what. At no point did I see a "this is the current > interface, and it doesn't work for xyz, so here's the new interface > that allows us to do stuff". Firstly, there are a bunch of problems with the current mount(2) syscall: (1) It's actually six or seven different interfaces rolled into one and weird combinations of flags make it do different things beyond the original specification of the syscall. (2) It produces a particularly large and diverse set of errors, which have to be mapped back to a small error code. Yes, there's dmesg - if you have it configured - but you can't necessarily see that if you're doing a mount inside of a container. (3) It copies a PAGE_SIZE block of data for each of the type, device name and options. (4) The size of the buffers is PAGE_SIZE - and this is arch dependent. (5) You can't mount into another mount namespace. I could, for example, build a container without having to be in that container's namespace if I can do it from outside. (6) It's not really geared for the specification of multiple sources, but some filesystems really want that - overlayfs, for example. and some problems in the internal kernel api: (1) There's no defined way to supply namespace configuration for the superblock - so, for instance, I can't say that I want to create a superblock in a particular network namespace (on automount, say). NFS hacks around this by creating multiple shadow file_system_types with different ->mount() ops. (2) When calling mount internally, unless you have NFS-like hacks, you have to generate or otherwise provide text config data which then gets parsed, when some of the time you could bypass the parsing stage entirely. (3) The amount of data in the data buffer is not known, but the data buffer might be on a kernel stack somewhere, leading to the possibility of tripping the stack underrun guard. and other issues too: (1) Superblock remount in some filesystems applies options on an as-parsed basis, so if there's a parse failure, a partial alteration with no rollback is effected. (2) Under some circumstances, the mount data may get copied multiple times so that it can have multiple parsers applied to it or because it has to be parsed multiple times - for instance, once to get the preliminary info required to access the on-disk superblock and then again to update the superblock record in the kernel. I want to be able to add support for a bunch of things: (1) UID, GID and Project ID mapping/translation. I want to be able to install a translation table of some sort on the superblock to translate source identifiers (which may be foreign numeric UIDs/GIDs, text names, GUIDs) into system identifiers. This needs to be done before the superblock is published[*]. Note that this may, for example, involve using the context and the superblock held therein to issue an RPC to a server to look up translations. [*] By "published" I mean made available through mount so that other userspace processes can access it by path. Maybe specifying a translation range element with something like: write(fd, "t uid <srcuid> <nsuid> <count>"); The translation information also needs to propagate over an automount in some circumstances. (2) Namespace configuration. I want to be able to tell the superblock creation process what namespaces should be applied when it created (in particular the userns and netns) for containerisation purposes, e.g.: write(fd, "n user=<fd> net=<fd>"); (3) Namespace propagation. I want to have a properly defined mechanism for propagating namespace configuration over automounts within the kernel. This will be particularly useful for network filesystems. (4) Pre-mount attribute query. A chunk of the changes is actually the fsinfo() syscall to query attributes of the filesystem beyond what's available in statx() and statfs(). This will allow a created superblock to be queried before it is published. (5) Upcall for configuration. I would like to be able to query configuration that's stored in userspace when an automount is made. For instance, to look up network parameters for NFS or to find a cache selector for fscache. The internal fs_context could be passed to the upcall process or the kernel could read a config file directly if named appropriately for the superblock, perhaps: [/etc/fscontext.d/afs/example.com/cell.cfg] realm = EXAMPLE.COM translation = uid,3000,4000,100 fscache = tag=fred (6) Event notifications. I want to be able to install a watch on a superblock before it is published to catch things like quota events and EIO. (7) Large and binary parameters. There might be at some point a need to pass large/binary objects like Microsoft PACs around. If I understand PACs correctly, you can obtain these from the Kerberos server and then pass them to the file server when you connect. Having it possible to pass large or binary objects as individual writes makes parsing these trivial. OTOH, some or all of this can potentially be handled with the use of the keyrings interface - as the afs filesystem does for passing kerberos tokens around; it's just that that seems overkill for a parameter you may only need once. > When you have a diffstat like this: > > 171 files changed, 7147 insertions(+), 1805 deletions(-) > > I sure want to see an explanation for *WHY* it adds 5000+ lines of core code. Note that there's a chunk more core code to be removed too, once all the filesystems have been converted, including some of the added code. > Also, I want to hear about sane security models. One of the things > people really want to do is have users do their own mounts. We've had > security issues in that area. Why does this improve on it, or make it > even worse? At the moment, I think it's fairly neutral in that regard. Currently, you have to have CAP_SYS_ADMIN to call fsopen() and again to call fsmount(). To supervise user-triggered mounting, I might need to add something to permit upcalling for permission or configuration, then this could be in the parent of a container, say, or something dispatched from systemd in the system root. It should be able to restrict the sources and options that a non-privileged or container-based mount request is given. An upcall to an arbiter could be passed the fs-context fd as an argument and could then use fsinfo() to query the context, including the option flags. It also might be possible to handle this through LSM policy, particularly if I formalise the specification of *all* sources in the context. For example, I could require things like: write(fd, "s store /dev/sda1"); // Specify the storage device write(fd, "s jnl /dev/sda2"); // Specify a separate journal write(fd, "s nfs example.com"); // Specify an NFS server write(fd, "s afs example.com"); // Specify an AFS cell Then the LSMs could be asked to rule on whether the "store" and "jnl" block devices could be used for those purposes by the caller and "nfs" or "afs" names could be looked up in the DNS. David