On Sat, Dec 12, 2020 at 8:51 AM Jens Axboe <axboe@xxxxxxxxx> wrote: > > We handle it for the path resolution itself, but we should also factor > it in for open_last_lookups() and tmpfile open. So I think this one is fundamentally wrong, for two reasons. One is that "nonblock" shouldn't necessarily mean "take no locks at all". That directory inode lock is very very different from "go down to the filesystem to do IO". No other NONBLOCK thing has ever been "no locks at all", they have all been about possibly long-term blocking. The other this is that the mnt_want_write() check really smells no different at all from any of the "some ->open functions might block". Which you don't handle, and which again is entirely different from the pathname resolution itself blocking. So I don't think either of these cases are about LOOKUP_NONBLOCK, the same way they aren't about LOOKUP_RCU. The mnt_want_write() in particular is much more about the kinds of things we already check for in do_open(). In fact, looking at this patch, I think mnt_want_write() itself is actually conceptually buggy, although it doesn't really matter: think about device nodes etc. Opening a device node for writing doesn't write to the filesystem that the device node is on. Why does that code care about O_WRONLY | O_RDWR? That has *nothing* to do with the open() wanting to write to the filesystem. We don't even hold that lock after the open - we'll always drop it even for a successful open. Only O_CREAT | O_TRUNC should matter, since those are the ones that cause writes as part of the *open*. Again - I don't think that this is really a problem. I mention it more as a "this shows how the code is _conceptually_ wrong", and how it's not really about the pathname resolution any more. In fact, I think that how we pass on that "got_write" to lookup_open() is just another sign of how we do that mnt_want_write() in the wrong place and at trhe wrong level. We shouldn't have been doing that mnt_want_write() for writable oipens (that's a different thing), and looking at lookup_open(), I think we should have waited with doing it at all until we've checked if we even need it (ie O_CREAT on a file that already exists does *not* need the mnt_want_write() at all, and we'll see that when we get to that /* Negative dentry, just create the file */ if (!dentry->d_inode && (open_flag & O_CREAT)) { thing. So I think this patch actually shows that we do things wrong in this area, and I think the kinds of things it does are questionable as a result. In particular, I'm not convinced that the directory semaphore thing should be tied to LOOKUP_NONBLOCK, and I don't think the mnt_want_write() logic is right at all. The first one would be fixed by a separate flag. The second one I think is more about "we are doing mnt_want_write() at the wrong point, and if we move mnt_want_write() to the right place we'd just fail it explicitly for the LOOKUP_NONBLOCK case" - the same way a truncate should just be an explicit fail, not a "trylock" failure. I also think we need to deal with the O_NONBLOCK kind of situation at the actual ->open() time (ie the whole "oh, NFS wants to revalidate caches due to open/close consistency, named pipes want to wait for pairing, device nodes want to wait for device". Again, I think that's separate from LOOKUP_NONBLOCK, though. Linus