Re: [PATCH 4/5] fs: honor LOOKUP_NONBLOCK for the last part of file open

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Sat, 12 Dec 2020 10:57:34 -0800

On Sat, Dec 12, 2020 at 8:51 AM Jens Axboe <axboe@xxxxxxxxx> wrote:
>
> We handle it for the path resolution itself, but we should also factor
> it in for open_last_lookups() and tmpfile open.

So I think this one is fundamentally wrong, for two reasons.

One is that "nonblock" shouldn't necessarily mean "take no locks at
all". That directory inode lock is very very different from "go down
to the filesystem to do IO". No other NONBLOCK thing has ever been "no
locks at all", they have all been about possibly long-term blocking.

The other this is that the mnt_want_write() check really smells no
different at all from any of the "some ->open functions might block".
Which you don't handle, and which again is entirely different from the
pathname resolution itself blocking.

So I don't think either of these cases are about LOOKUP_NONBLOCK, the
same way they aren't about LOOKUP_RCU.

The  mnt_want_write() in particular is much more about the kinds of
things we already check for in do_open().

In fact, looking at this patch, I think mnt_want_write() itself is
actually conceptually buggy, although it doesn't really matter: think
about device nodes etc. Opening a device node for writing doesn't
write to the filesystem that the device node is on.

Why does that code care about O_WRONLY | O_RDWR? That has *nothing* to
do with the open() wanting to write to the filesystem. We don't even
hold that lock after the open - we'll always drop it even for a
successful open.

Only O_CREAT | O_TRUNC should matter, since those are the ones that
cause writes as part of the *open*.

Again - I don't think that this is really a problem. I mention it more
as a "this shows how the code is _conceptually_ wrong", and how it's
not really about the pathname resolution any more.

In fact, I think that how we pass on that "got_write" to lookup_open()
is just another sign of how we do that mnt_want_write() in the wrong
place and at trhe wrong level. We shouldn't have been doing that
mnt_want_write() for writable oipens (that's a different thing), and
looking at lookup_open(), I think we should have waited with doing it
at all until we've checked if we even need it (ie O_CREAT on a file
that already exists does *not* need the mnt_want_write() at all, and
we'll see that when we get to that

        /* Negative dentry, just create the file */
        if (!dentry->d_inode && (open_flag & O_CREAT)) {

thing.

So I think this patch actually shows that we do things wrong in this
area, and I think the kinds of things it does are questionable as a
result. In particular, I'm not convinced that the directory semaphore
thing should be tied to LOOKUP_NONBLOCK, and I don't think the
mnt_want_write() logic is right at all.

The first one would be fixed by a separate flag.

The second one I think is more about "we are doing mnt_want_write() at
the wrong point, and if we move mnt_want_write() to the right place
we'd just fail it explicitly for the LOOKUP_NONBLOCK case" - the same
way a truncate should just be an explicit fail, not a "trylock"
failure.

I also think we need to deal with the O_NONBLOCK kind of situation at
the actual ->open() time (ie the whole "oh, NFS wants to revalidate
caches due to open/close consistency, named pipes want to wait for
pairing, device nodes want to wait for device". Again, I think that's
separate from LOOKUP_NONBLOCK, though.

               Linus