On Sat, 27 Jan 2024 at 10:06, Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > > I'd suggest that eventfs and shiftfs are not "simple filesystems". > They're synthetic filesystems that want to do very different things > from block filesystems and network filesystems. We have a lot of > infrastructure in place to help authors of, say, bcachefs, but not a lot > of infrastructure for synthetic filesystems (procfs, overlayfs, sysfs, > debugfs, etc). Indeed. I think it's worth pointing out three very _fundamental_ design issues here, which all mean that a "regular filesystem" is in many ways much simpler than a virtual one: (a) this is what the VFS has literally primarily been designed for. When you look at a lot of VFS issues, they almost all come from just very basic "this is what a filesystem needs" issues, and particularly performance issues. And when you talk "performance", the #1 thing is caching. In fact, I'd argue that #2 is caching too. Caching is just *so* important, and it really shows in the VFS. Think about just about any part of the VFS, and it's all about caching filesystem data. It's why the dentry cache exists, it's why the page / folios exist, it's what 99% of all the VFS code is about. And that performance / caching issue isn't just why most of the VFS code exists, it's literally also the reason for most of the design decisions. The dentry cache is a hugely complicated beast, and a *lot* of the complications are directly related to one thing, and one thing only: performance. It's why locking is so incredibly baroque. Yes, there are other complications. The whole notion of "bind mounts" is a huge complication that arguably isn't performance-related, and it's why we have that combination of "vfsmount" and "dentry" that we together call a "path". And that tends to confuse low-level filesystem people, because the other thing the VFS layer does is to try to shield the low-level filesystem from higher-level concepts like that, so that the low-level filesystem literally doesn't have to know about "oh, this same filesystem is mounted in five different places". The VFS layer takes care of that, and the filesystem doesn't need to know. So part of it is that the VFS has been designed for regular filesystems, but the *other* part of the puzzle is on the other side: (b) regular filesystems have been designed to be filesystems. Ok, that may sound like a stupid truism, but when it comes to the discussion of virtual filesystems and relative simplicity, it's quite a big deal. The fact is, a regular filesystem has literally been designed from the ground up to do regular filesystem things. And that matters. Yes, yes, many filesystems then made various bad design decisions, and the world isn't perfect. But basically things like "read a directory" and "read and write files" and "rename things" are all things that the filesystem was *designed* for. So the VFS layer was designed for real filesystems, and real filesystems were designed to do filesystem operations, so they are not just made to fit together, they are also all made to expose all the normal read/write/open/stat/whatever system calls. (c) none of the above is generally true of virtual filesystems Sure, *some* virtual filesystems are designed to act like a filesystem from the ground up. Something like "tmpfs" is obviously a virtual filesystem, but it's "virtual" only in the sense that it doesn't have much of a backing store. It's still designed primarily to *be* a filesystem, and the only operations that happen on it are filesystem operations. So ignore 'tmpfs' here, and think about all the other virtual filesystems we have. And realize that hey aren't really designed to be filesystems per se - they are literally designed to be something entirely different, and the filesystem interface is then only a secondary thing - it's a window into a strange non-filesystem world where normal filesystem operations don't even exist, even if sometimes there can be some kind of convoluted transformation for them. So you have "simple" things like just plain read-only files in /proc, and desp[ite being about as simple as they come, they fail miserably at the most fundamental part of a file: you can't even 'stat()' them and get sane file size data from them. And "caching" - which was the #1 reason for most of the filesystem code - ends up being much less so, although it turns out that it's still hugely important because of the abstraction interface it allows. So all those dentries, and all the complicated lookup code, end up still being quite important to make the virtual filesystem look like a filesystem at all: it's what gives you the 'getcwd()' system call, it's what still gives you the whole bind mount thing, it really ends up giving a lot of "structure" to the virtual filesystem that would be an absolute nightmare without it. But it's a structure that is really designed for something else. Because the non-filesystem virtual part that a virtual filesystem is actually trying to expose _as_ a filesystem to user space usually has lifetime rules (and other rules) that are *entirely* unrelated to any filesystem activity. A user can "chdir()" into a directory that describes a process, but the lifetime of that process is then entirely unrelated to that, and it can go away as a process, while the directory still has to virtually exist. That's part of what the VFS code gives a virtual filesystem: the dentries etc end up being those things that hang around even when the virtual part that they described may have disappeared. And you *need* that, just to get sane UNIX 'home directory' semantics. I think people often don't think of how much that VFS infrastructure protects them from. But it's also why virtual filesystems are generally a complete mess: you have these two pieces, and they are really doing two *COMPLETELY* different things. It's why I told Steven so forcefully that tracefs must not mess around with VFS internals. A virtual filesystem either needs to be a "real filesystem" aka tmpfs and just leave it *all* to the VFS layer, or it needs to just treat the dentries as a separate cache that the virtual filesystem is *not* in charge of, and trust the VFS layer to do the filesystem parts. But no. You should *not* look at a virtual filesystem as a guide how to write a filesystem, or how to use the VFS. Look at a real FS. A simple one, and preferably one that is built from the ground up to look like a POSIX one, so that you don't end up getting confused by all the nasty hacks to make it all look ok. IOW, while FAT is a simple filesystem, don't look at that one, just because then you end up with all the complications that come from decades of non-UNIX filesystem history. I'd say "look at minix or sysv filesystems", except those may be simple but they also end up being so legacy that they aren't good examples. You shouldn't use buffer-heads for anything new. But they are still probably good examples for one thing: if you want to understand the real power of dentries, look at either of the minix or sysv 'namei.c' files. Just *look* at how simple they are. Ignore the internal implementation of how a directory entry is then looked up on disk - because that's obviously filesystem-specific - and instead just look at the interface. Linus