Re: [PATCH 4/5] fs: honor LOOKUP_NONBLOCK for the last part of file open

Dave Chinner <david@xxxxxxxxxxxxx> · Mon, 14 Dec 2020 09:50:22 +1100

On Sat, Dec 12, 2020 at 02:25:00PM -0700, Jens Axboe wrote:
> On 12/12/20 11:57 AM, Linus Torvalds wrote:
> > On Sat, Dec 12, 2020 at 8:51 AM Jens Axboe <axboe@xxxxxxxxx> wrote:
> >>
> >> We handle it for the path resolution itself, but we should also factor
> >> it in for open_last_lookups() and tmpfile open.
> > 
> > So I think this one is fundamentally wrong, for two reasons.
> > 
> > One is that "nonblock" shouldn't necessarily mean "take no locks at
> > all". That directory inode lock is very very different from "go down
> > to the filesystem to do IO". No other NONBLOCK thing has ever been "no
> > locks at all", they have all been about possibly long-term blocking.
> 
> Do we ever do long term IO _while_ holding the direcoty inode lock? If
> we don't, then we can probably just ignore that side alltogether.

Yes - "all the time" is the simple answer.

Readdir is a simple example, but that is just extent mapping tree
and directory block IO you'd have to wait for, just like regular
file IO.

The big problem is that modifications to directories are atomic and
transactional in most filesystems, which means we might block a
create/unlink/attr/etc in a transaction start for an indefinite
amount of time while we wait for metadata writeback to free up
journal/reservation space. And while we are doing this, nothing else
can access the directory because the VFS holds the directory inode
lock....

We also have metadata IO within transactions, but in most
journalling filesystems once we've started the transaction we can't
back out and return -EAGAIN. So once we are in a transaction
context, the filesystem will block as necessary to run the operation
to completion.

So, really, at the filesystem level I don't see much value in trying
to push non-blocking directory modifications down to the filesystem.
The commonly used filesystems will mostly have to return -EAGAIN
immediately without being able to do anything at all because they
simply aren't architected with the modification rollback
capabilities needed to run fully non-blocking transactional
modification operations.

> > Why does that code care about O_WRONLY | O_RDWR? That has *nothing* to
> > do with the open() wanting to write to the filesystem. We don't even
> > hold that lock after the open - we'll always drop it even for a
> > successful open.
> > 
> > Only O_CREAT | O_TRUNC should matter, since those are the ones that
> > cause writes as part of the *open*.

And __O_TMPFILE, which is the same as O_CREAT.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx