Re: Switching to iterate_shared

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Wed, 17 Aug 2022 10:05:53 -0700

On Wed, Aug 17, 2022 at 12:25 AM Amir Goldstein <amir73il@xxxxxxxxx> wrote:
>
> All the "complexity" is really very tidy and neat inside iterate_dir()
> I honestly don't see what the big fuss is about.

The worst part of it is conceptual, and the conditional locking.

No, it's not a huge maintenance burden, but it's quite the ugly wart
in our vfs layer that generally has pretty clean abstractions.

The non-shared version can also be a real performance problem on some
loads, but that's mainly a "real local filesystem" issue. Samba used
to (and presumably still does) do a *lot* of readdir calls for all the
name mangling.

We used to have similar horrors wrt ioctl's with the big kernel lock -
you can still see signs of that in the naming (ie the 'ioctl' function
pointer in the file operations structure is called 'unlocked_ioctl' -
even though it's been over a decade since we got rid of the 'locked'
one).

So that thing is just nasty.

> If filesystems do need to synchronize creates/deletes with readdir
> they would need to add a new internal lock and take it for every
> dir operation - it seems like a big hustle for little good reason.

Most of the time, they have that lock *anyway* - or, as I looked at
two cases, don't need any locking at all.

The pathname lookup parts of a filesystem need to be able to handle
concurrency within a directory anyway, because of how '->lookup()'
works.  Yes, our dentry layer helps a lot and serializes individual
name lookups, but basically no filesystem can rely on _just_ i_rwsem
anyway.

Of course, those locks are often about things like "read page X from
directory Y", so the locking can be about things like block allocation
etc. The bigger conceptual races - filename creation, lookup of the
same entry concurrently - are already handled by the VFS layer.

So those locks generally already exist. They just need to be checked
cover readdir too.

Normally readdir iteration is very similar to lookup().

The *one* exception tends to be directory entry caches, although
sometimes those too are shared with lookup.

> Perhaps it would have been more elegant to replace the
> iterate_shared/iterate duality with an fs_type flag.
> And it would be very simple to make that change.

No.

You're missing the point. It's not the *flag* and "we have two
different iterate functions" that is the ugly part.

Making it a filesystem flag would not have made any difference - it
would still be the same horrendous wart at the vfs level, and it would
be in some sense worse because now it's split into two different
independent things that are tied together.

> Hmm, I wonder. Can gentents() proceed without f_pos_lock on
> single f_count and then a cloned fd created and another gentents()
> races with the first one?

That would break much more important system calls than getdents (ie
basic read/write).

So if fdget_pos() were to be broken, we have much bigger issues.

In case  you care, the thing that protects us against another thread
using the same file descriptor is

  fdget_pos ->
    __fdget_pos ->
      __fdget ->
        __fdget_light ->
          check current->files->count
          do full __fdget() if it's > 1

and then __fdget_pos() does that

                if (file_count(file) > 1) {
                        v |= FDPUT_POS_UNLOCK;
                        mutex_lock(&file->f_pos_lock);
                }

check afterwards.

IOW, this code is carefully written to avoid both the file pointer
reference increment *and* the f_pos_lock in the common case of a
single-threaded process that only has the file open once.

But if 'current->files->count' is elevated - iow some other thread has
access to the file array and might be open/close/dup'ing files, we
take the file reference.

And then if the file is open through multiple file descriptors (which
might be in another file table entirely), _or_ if we just took the
file refernce due to that threading issue - then in either of those
cases we will also now take the file pos lock.

End result: either this system call is the only one that can access
that file, or we have taken the file pos lock.

Either way, getdents() ends up being serialized wrt that particular open file.

Now, you can actually screw this up if you really try: the file pos
lock obviously also is conditional on FMODE_ATOMIC_POS.

So if you have a filesystem that does "stream_open()" to clear
FMODE_ATOMIC_POS, that can avoid the file pos lock entirely.

So don't do that. I'm looking at fuse and FOPEN_STREAM, and that looks
a bit iffy.

(Of course, for a virtual filesystem, if you want to have a getdents()
that doesn't allow lseek, and that doesn't use f_pos for getdents but
just some internal other logic, clearing FMODE_ATOMIC_POS might
actually be entirely intentional. It's not like the file pos lock is
really _required_).

> I tried to look at ovl_iterate(). The subtlety is regarding the internal
> readdir cache.

Ok, from a quick look, you do have

        cache = ovl_dir_cache(d_inode(dentry));

so yeah, the cache is per-inode and thus different 'struct file'
entries for the same directory will need locking.

> Compared to fuse_readdir(), which is already "iterate_shared"
> and also implements internal readdir cache, fuse adds an internal
> mutex (quote) "against (crazy) parallel readdir on same open file".

Maybe the FUSE lock is *because* fuse allows that FOPEN_STREAM?

> Considering this protection is already provided by f_pos_lock
> 99% of the time,

It's not 99% of the time.

f_pos_lock is reliable. You have to explicitly turn it off. FUSE is
either confused and does pointless locking, or has done that
stream_open() thing to explicitly turn off the vfs locking.

Or, probably more realistically, the fuse code goes back to before we
did f_pos_lock.

                Linus