On Thu, 10 Aug 2023 at 15:39, Darrick J. Wong <djwong@xxxxxxxxxx> wrote: > > FWIW I recently fixed all my stupid debian package dependencies so that > I could actually install liburing again, and rebuilt fstests. The very > next morning I noticed a number of new test failures in /exactly/ the > way that Kent said to expect: > > fsstress -d /mnt & <sleep then simulate fs crash>; \ > umount /mnt; mount /dev/sda /mnt > > Here, umount exits before the filesystem is really torn down, and then > mount fails because it can't get an exclusive lock on the device. I agree that that obviously sounds like mount is just returning either too early. Or too eagerly. But I suspect any delayed fput() issues (whether from aio or io_uring) are then just a way to trigger the problem, not the fundamental cause. Because even if the fput() is delayed, the mntput() part of that delayed __fput action is the one that *should* have kept the filesystem mounted until it is no longer busy. And more importantly, having some of the common paths synchronize *their* fput() calls only affects those paths. It doesn't affect the fundamental issue that the last fput() can happen in odd contexts when the file descriptor was used for something a bit stranger. So I do feel like the fput patch I saw looked more like a "hide the problem" than a real fix. Put another way: I would not be surprised in the *least* if then adding more synchronization to fput would basically hide any issue, particularly from tests that then use those things that you added synchronization for. But it really smells like it's purely hiding the symptom to me. If I were a betting man, I'd look at ->mnt_count. I'm not saying that's the problem, but the mnt refcount handling is more than a bit scary. It is so hugely performance-critical (ie every single path access) that we use those percpu counters for it, and I'm not at all sure it's all race-free. Just as an example, mnt_add_count() has this comment above it: * vfsmount lock must be held for read but afaik the main way it gets called is through mntget(), and I see no vfsmount lock held anywhere there (think "path_get() and friends). Maybe I'm missing something obvious. So I think that comment is some historical left-over that hasn't been true for ages. And all of the counter updates should be consistent even in the absence of said lock, so it's not an issue. Except when it is: it does look like it *would* screw up mnt_get_count() that tries to add up all those percput counters with for_each_possible_cpu(cpu) { count += per_cpu_ptr(mnt->mnt_pcp, cpu)->mnt_count; } and that one has that * vfsmount lock must be held for write comment, which makes sense as a "that would indeed synchronize if others held it for read". But... And where is that sum used? Very much in things like may_umount_tree(). Anyway, I'm absolutely not saying this is the actual problem - we probably at least partly just have stale or incomplete comments, and maybe I think the fput() side is good mainly because I'm *much* more familiar with that side than I am with the actual mount code these days. So I might be barking up entirely the wrong tree. But I do feel like the fput patch I saw looked more like a "hide the problem" than a real fix. Because the mount counting *should* be entirely independent of when exactly a fput happens. So I believe having the test-case then do some common fput's synchronously pretty much by definition can't fix any issues, but it *can* make sure that any normal test using just regular system calls then never triggers the "oh, in other situations the last fput will be delayed". So that's why I'm very leery of the fput patch I saw. I don't think it makes sense. That does *not* mean that I don't believe that umount can have serious problems. I suspect we get very little coverage of that in normal situations. And yes, obviously io_uring does add *way* more asynchronicity, and I'd not be surprised at all if it uncovers problems. In most other situations, the main source of file counts are purely open/close system calls, which are in many ways "simple" (and where things like process exit obviously does the closing part). Linus