Re: [LSF/MM TOPIC] Making pseudo file systems inodes/dentries more like normal file systems

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Sat, 27 Jan 2024 11:44:45 -0800

On Sat, 27 Jan 2024 at 10:06, Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
>
> I'd suggest that eventfs and shiftfs are not "simple filesystems".
> They're synthetic filesystems that want to do very different things
> from block filesystems and network filesystems.  We have a lot of
> infrastructure in place to help authors of, say, bcachefs, but not a lot
> of infrastructure for synthetic filesystems (procfs, overlayfs, sysfs,
> debugfs, etc).

Indeed.

I think it's worth pointing out three very _fundamental_ design issues
here, which all mean that a "regular filesystem" is in many ways much
simpler than a virtual one:

 (a) this is what the VFS has literally primarily been designed for.

When you look at a lot of VFS issues, they almost all come from just
very basic "this is what a filesystem needs" issues, and particularly
performance issues. And when you talk "performance", the #1 thing is
caching. In fact, I'd argue that #2 is caching too. Caching is just
*so* important, and it really shows in the VFS. Think about just about
any part of the VFS, and it's all about caching filesystem data. It's
why the dentry cache exists, it's why the page / folios exist, it's
what 99% of all the VFS code is about.

And that performance / caching issue isn't just why most of the VFS
code exists, it's literally also the reason for most of the design
decisions. The dentry cache is a hugely complicated beast, and a *lot*
of the complications are directly related to one thing, and one thing
only: performance. It's why locking is so incredibly baroque.

Yes, there are other complications. The whole notion of "bind mounts"
is a huge complication that arguably isn't performance-related, and
it's why we have that combination of "vfsmount" and "dentry" that we
together call a "path". And that tends to confuse low-level filesystem
people, because the other thing the VFS layer does is to try to shield
the low-level filesystem from higher-level concepts like that, so that
the low-level filesystem literally doesn't have to know about "oh,
this same filesystem is mounted in five different places". The VFS
layer takes care of that, and the filesystem doesn't need to know.

So part of it is that the VFS has been designed for regular
filesystems, but the *other* part of the puzzle is on the other side:

 (b) regular filesystems have been designed to be filesystems.

Ok, that may sound like a stupid truism, but when it comes to the
discussion of virtual filesystems and relative simplicity, it's quite
a big deal. The fact is, a regular filesystem has literally been
designed from the ground up to do regular filesystem things. And that
matters.

Yes, yes, many filesystems then made various bad design decisions, and
the world isn't perfect. But basically things like "read a directory"
and "read and write files" and "rename things" are all things that the
filesystem was *designed* for.

So the VFS layer was designed for real filesystems, and real
filesystems were designed to do filesystem operations, so they are not
just made to fit together, they are also all made to expose all the
normal read/write/open/stat/whatever system calls.

 (c) none of the above is generally true of virtual filesystems

Sure, *some* virtual filesystems are designed to act like a filesystem
from the ground up. Something like "tmpfs" is obviously a virtual
filesystem, but it's "virtual" only in the sense that it doesn't have
much of a backing store. It's still designed primarily to *be* a
filesystem, and the only operations that happen on it are filesystem
operations.

So ignore 'tmpfs' here, and think about all the other virtual
filesystems we have.

And realize that hey aren't really designed to be filesystems per se -
they are literally designed to be something entirely different, and
the filesystem interface is then only a secondary thing - it's a
window into a strange non-filesystem world where normal filesystem
operations don't even exist, even if sometimes there can be some kind
of convoluted transformation for them.

So you have "simple" things like just plain read-only files in /proc,
and desp[ite being about as simple as they come, they fail miserably
at the most fundamental part of a file: you can't even 'stat()' them
and get sane file size data from them.

And "caching" - which was the #1 reason for most of the filesystem
code - ends up being much less so, although it turns out that it's
still hugely important because of the abstraction interface it allows.

So all those dentries, and all the complicated lookup code, end up
still being quite important to make the virtual filesystem look like a
filesystem at all: it's what gives you the 'getcwd()' system call,
it's what still gives you the whole bind mount thing, it really ends
up giving a lot of "structure" to the virtual filesystem that would be
an absolute nightmare without it.  But it's a structure that is really
designed for something else.

Because the non-filesystem virtual part that a virtual filesystem is
actually trying to expose _as_ a filesystem to user space usually has
lifetime rules (and other rules) that are *entirely* unrelated to any
filesystem activity. A user can "chdir()" into a directory that
describes a process, but the lifetime of that process is then entirely
unrelated to that, and it can go away as a process, while the
directory still has to virtually exist.

That's part of what the VFS code gives a virtual filesystem: the
dentries etc end up being those things that hang around even when the
virtual part that they described may have disappeared. And you *need*
that, just to get sane UNIX 'home directory' semantics.

I think people often don't think of how much that VFS infrastructure
protects them from.

But it's also why virtual filesystems are generally a complete mess:
you have these two pieces, and they are really doing two *COMPLETELY*
different things.

It's why I told Steven so forcefully that tracefs must not mess around
with VFS internals. A virtual filesystem either needs to be a "real
filesystem" aka tmpfs and just leave it *all* to the VFS layer, or it
needs to just treat the dentries as a separate cache that the virtual
filesystem is *not* in charge of, and trust the VFS layer to do the
filesystem parts.

But no. You should *not* look at a virtual filesystem as a guide how
to write a filesystem, or how to use the VFS. Look at a real FS. A
simple one, and preferably one that is built from the ground up to
look like a POSIX one, so that you don't end up getting confused by
all the nasty hacks to make it all look ok.

IOW, while FAT is a simple filesystem, don't look at that one, just
because then you end up with all the complications that come from
decades of non-UNIX filesystem history.

I'd say "look at minix or sysv filesystems", except those may be
simple but they also end up being so legacy that they aren't good
examples. You shouldn't use buffer-heads for anything new. But they
are still probably good examples for one thing: if you want to
understand the real power of dentries, look at either of the minix or
sysv 'namei.c' files. Just *look* at how simple they are. Ignore the
internal implementation of how a directory entry is then looked up on
disk - because that's obviously filesystem-specific - and instead just
look at the interface.

           Linus