On Sat, Dec 12, 2020 at 02:25:00PM -0700, Jens Axboe wrote: > On 12/12/20 11:57 AM, Linus Torvalds wrote: > > On Sat, Dec 12, 2020 at 8:51 AM Jens Axboe <axboe@xxxxxxxxx> wrote: > >> > >> We handle it for the path resolution itself, but we should also factor > >> it in for open_last_lookups() and tmpfile open. > > > > So I think this one is fundamentally wrong, for two reasons. > > > > One is that "nonblock" shouldn't necessarily mean "take no locks at > > all". That directory inode lock is very very different from "go down > > to the filesystem to do IO". No other NONBLOCK thing has ever been "no > > locks at all", they have all been about possibly long-term blocking. > > Do we ever do long term IO _while_ holding the direcoty inode lock? If > we don't, then we can probably just ignore that side alltogether. Yes - "all the time" is the simple answer. Readdir is a simple example, but that is just extent mapping tree and directory block IO you'd have to wait for, just like regular file IO. The big problem is that modifications to directories are atomic and transactional in most filesystems, which means we might block a create/unlink/attr/etc in a transaction start for an indefinite amount of time while we wait for metadata writeback to free up journal/reservation space. And while we are doing this, nothing else can access the directory because the VFS holds the directory inode lock.... We also have metadata IO within transactions, but in most journalling filesystems once we've started the transaction we can't back out and return -EAGAIN. So once we are in a transaction context, the filesystem will block as necessary to run the operation to completion. So, really, at the filesystem level I don't see much value in trying to push non-blocking directory modifications down to the filesystem. The commonly used filesystems will mostly have to return -EAGAIN immediately without being able to do anything at all because they simply aren't architected with the modification rollback capabilities needed to run fully non-blocking transactional modification operations. > > Why does that code care about O_WRONLY | O_RDWR? That has *nothing* to > > do with the open() wanting to write to the filesystem. We don't even > > hold that lock after the open - we'll always drop it even for a > > successful open. > > > > Only O_CREAT | O_TRUNC should matter, since those are the ones that > > cause writes as part of the *open*. And __O_TMPFILE, which is the same as O_CREAT. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx