Re: [GIT PULL] bcachefs

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Thu, 10 Aug 2023 16:47:22 -0700

On Thu, 10 Aug 2023 at 15:39, Darrick J. Wong <djwong@xxxxxxxxxx> wrote:
>
> FWIW I recently fixed all my stupid debian package dependencies so that
> I could actually install liburing again, and rebuilt fstests.  The very
> next morning I noticed a number of new test failures in /exactly/ the
> way that Kent said to expect:
>
> fsstress -d /mnt & <sleep then simulate fs crash>; \
>         umount /mnt; mount /dev/sda /mnt
>
> Here, umount exits before the filesystem is really torn down, and then
> mount fails because it can't get an exclusive lock on the device.

I agree that that obviously sounds like mount is just returning either
too early. Or too eagerly.

But I suspect any delayed fput() issues (whether from aio or io_uring)
are then just a way to trigger the problem, not the fundamental cause.

Because even if the fput() is delayed, the mntput() part of that
delayed __fput action is the one that *should* have kept the
filesystem mounted until it is no longer busy.

And more importantly, having some of the common paths synchronize
*their* fput() calls only affects those paths.

It doesn't affect the fundamental issue that the last fput() can
happen in odd contexts when the file descriptor was used for something
a bit stranger.

So I do feel like the fput patch I saw looked more like a "hide the
problem" than a real fix.

Put another way: I would not be surprised in the *least* if then
adding more synchronization to fput would basically hide any issue,
particularly from tests that then use those things that you added
synchronization for.

But it really smells like it's purely hiding the symptom to me.

If I were a betting man, I'd look at ->mnt_count.  I'm not saying
that's the problem, but the mnt refcount handling is more than a bit
scary.

It is so hugely performance-critical (ie every single path access)
that we use those percpu counters for it, and I'm not at all sure it's
all race-free.

Just as an example, mnt_add_count() has this comment above it:

 * vfsmount lock must be held for read

but afaik the main way it gets called is through mntget(), and I see
no vfsmount lock held anywhere there (think "path_get() and friends).
Maybe I'm missing something obvious.

So I think that comment is some historical left-over that hasn't been
true for ages.

And all of the counter updates should be consistent even in the
absence of said lock, so it's not an issue.

Except when it is: it does look like it *would* screw up
mnt_get_count() that tries to add up all those percput counters with

        for_each_possible_cpu(cpu) {
                count += per_cpu_ptr(mnt->mnt_pcp, cpu)->mnt_count;
        }

and that one has that

 * vfsmount lock must be held for write

comment, which makes sense as a "that would indeed synchronize if
others held it for read". But...

And where is that sum used? Very much in things like may_umount_tree().

Anyway, I'm absolutely not saying this is the actual problem - we
probably at least partly just have stale or incomplete comments, and
maybe I think the fput() side is good mainly because I'm *much* more
familiar with that side than I am with the actual mount code these
days.

So I might be barking up entirely the wrong tree.

But I do feel like the fput patch I saw looked more like a "hide the
problem" than a real fix. Because the mount counting *should* be
entirely independent of when exactly a fput happens.

So I believe having the test-case then do some common fput's
synchronously pretty much by definition can't fix any issues, but it
*can* make sure that any normal test using just regular system calls
then never triggers the "oh, in other situations the last fput will be
delayed".

So that's why I'm very leery of the fput patch I saw. I don't think it
makes sense.

That does *not* mean that I don't believe that umount can have serious problems.

I suspect we get very little coverage of that in normal situations.

And yes, obviously io_uring does add *way* more asynchronicity, and
I'd not be surprised at all if it uncovers problems.

In most other situations, the main source of file counts are purely
open/close system calls, which are in many ways "simple" (and where
things like process exit obviously does the closing part).

                Linus