Re: [LSF/MM TOPIC] Making pseudo file systems inodes/dentries more like normal file systems

Steven Rostedt <rostedt@xxxxxxxxxxx> · Mon, 29 Jan 2024 10:57:26 -0500

On Mon, 29 Jan 2024 16:08:33 +0100
Christian Brauner <brauner@xxxxxxxxxx> wrote:

> > But no. You should *not* look at a virtual filesystem as a guide how
> > to write a filesystem, or how to use the VFS. Look at a real FS. A
> > simple one, and preferably one that is built from the ground up to
> > look like a POSIX one, so that you don't end up getting confused by
> > all the nasty hacks to make it all look ok.
> > 
> > IOW, while FAT is a simple filesystem, don't look at that one, just
> > because then you end up with all the complications that come from
> > decades of non-UNIX filesystem history.
> > 
> > I'd say "look at minix or sysv filesystems", except those may be
> > simple but they also end up being so legacy that they aren't good
> > examples. You shouldn't use buffer-heads for anything new. But they
> > are still probably good examples for one thing: if you want to
> > understand the real power of dentries, look at either of the minix or
> > sysv 'namei.c' files. Just *look* at how simple they are. Ignore the
> > internal implementation of how a directory entry is then looked up on
> > disk - because that's obviously filesystem-specific - and instead just
> > look at the interface.  
> 
> I agree and I have to say I'm getting annoyed with this thread.
> 
> And I want to fundamentally oppose the notion that it's too difficult to
> write a virtual filesystem. Just one look at how many virtual

I guess you mean pseudo file systems? Somewhere along the discussion we
switched from saying pseudo to virtual. I may have been the culprit, I
don't remember and I'm not re-reading the thread to find out.

> filesystems we already have and how many are proposed. Recent example is
> that KVM wanted to implement restricted memory as a stacking layer on
> top of tmpfs which I luckily caught early and told them not to do.
> 
> If at all a surprising amount of people that have nothing to do with
> filesystems manage to write filesystem drivers quickly and propose them
> upstream. And I hope people take a couple of months to write a decently
> sized/complex (virtual) filesystem.

I spent a lot of time on this. Let me give you a bit of history of where
tracefs/eventfs came from.

When we first started the tracing infrastructure, I wanted it to be easy to
debug embedded devices. I use to have my own tracer called "logdev" which
was a character device in /dev called /dev/logdev. I was able to write into
it for simple control actions.

But we needed a more complex system when we started integrating the
PREEMPT_RT latency tracer which eventually became the ftrace infrastructure.

As I wanted to still only need busybox to interact with it, I wanted to use
files and not system calls. I was recommended to use debugfs, and I did. It
became /sys/kernel/debug/tracing.

After a while, when tracing started to become useful in production systems,
people wanted access to tracing without having to have debugfs mounted.
That's because debugfs is a dumping ground to a lot of interactions with
the kernel, and people were legitimately worried about security
vulnerabilities it could expose.

I then asked about how to make /sys/kernel/debug/tracing its own file
system and was recommended to just start with debugfs (it's the easiest
concept of all the files systems to understand) and since tracing was
already used the debugfs API (with dentries as the handle) it made sense.

That created tracefs. Now you could mount tracefs at /sys/kernel/tracing
and even have debugfs configured out.

When the eBPF folks were using trace_printk directly into the main trace
buffer, I asked them to please use an instance instead. They told me that
an instance adds too much memory overhead. Over 20MBs! When I investigated,
I found that they were right. And most of that overhead was all the dentry
and inodes that were created for every directory and file that was used for
events. As there's 10s of thousands of files and directories that adds up.
And if you create a new instance, you create another 10s of thousands of
files and directories that are basically all the same.

This lead to the effort to create eventfs that would remove the overhead of
these inodes and dentries with just a light weight descriptor for every
directory. As there's just around 2000 directories, its the files that take
up most of the memory.

What got us here is the evolution of changes that were made. Now you can
argue that when tracefs was first moved out of debugfs I should have based
it on kernfs. I actually did look at that, but because it behaved so much
differently than debugfs (which was the only thing in VFS that I was
familiar with), I chose debugfs instead.

The biggest savings in eventfs is the fact that it has no meta data for
files. All the directories in eventfs has a fixed number of files when they
are created. The creating of a directory passes in an array that has a list
of names and callbacks to call when the file needs to be accessed. Note,
this array is static for all events. That is, there's one array for all
event files, and one array for all event systems, they are not allocated per
directory.

> 
> And specifically for virtual filesystems they often aren't alike at
> all. And that's got nothing to do with the VFS abstractions. It's
> simply because a virtual filesystem is often used for purposes when
> developers think that they want a filesystem like userspace interface
> but don't want all of the actual filesystem semantics that come with it.
> So they all differ from each other and what functionality they actually
> implement.

I agree with the above.

> 
> And I somewhat oppose the notion that the VFS isn't documented. We do
> have extensive documentation for locking rules, a constantly updated
> changelog with fundamental changes to all VFS APIs and expectations
> around it. Including very intricate details for the reader that really
> needs to know everything. I wrote a whole document just on permission
> checking and idmappings when we added that to the VFS. Both
> implementation and theoretical background. 

I spent a lot of time reading the VFS documentation. The problem I had was
that it's very much focused for its main purpose. That is for real file
systems. It was hard to know what would apply to a pseudo file system and
what would not.

So I don't want to say that VFS isn't well documented. I would say that VFS
is a very big beast, and there's documentation that is focused on what the
majority want to do with it.

It's us outliers (pseudo file systems) that are screwing this up. And when
you come from an approach of "I just want an file systems like interface"
you really just want to know the bare minimum of VFS to get that done.

I've been approach countless of times by the embedded community (including
those that worked on the Mars helicopter) thanking me for having such a
nice file system like interface into tracing.

-- Steve