On Wed, Aug 17, 2022 at 2:02 AM Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote: > > On Tue, Aug 16, 2022 at 1:14 PM Jan Harkes <jaharkes@xxxxxxxxxx> wrote: > > > > So good to know in advance a change like this is coming. I'll have to > > educate myself on this shared vs non-shared filldir. > > Well, that change isn't necessarily "coming" - we've had this horrid > duality for years. "iterate_shared" goes back to 2016, and has been > marked "recommended" since then. > > But the plain old "iterate" does continue to work, and having both is > only an ugly wart - I can't honestly claim that it's a big and > pressing problem. > All the "complexity" is really very tidy and neat inside iterate_dir() I honestly don't see what the big fuss is about. If filesystems do need to synchronize creates/deletes with readdir they would need to add a new internal lock and take it for every dir operation - it seems like a big hustle for little good reason. Perhaps it would have been more elegant to replace the iterate_shared/iterate duality with an fs_type flag. And it would be very simple to make that change. > But it would be *really* nice if filesystems did switch to it. For > most filesystems, it is probably trivial. The only real difference is > that the "iterate_shared" directory walker is called with the > directory inode->i_rwsem held for reading, rather than for writing. > > End result: there can be multiple concurrent "readdir" iterators > running on the same directory at the same time. But there's still > exclusion wrt things like file creation, so a filesystem doesn't have > to worry about that. > > Also, the concurrency is only between different 'struct file' > entities. The file position lock still serializes all getdents() calls > on any individual open directory 'struct file'. > Hmm, I wonder. Can gentents() proceed without f_pos_lock on single f_count and then a cloned fd created and another gentents() races with the first one? A very niche corner case, but perhaps we can close this hole for filesystems that opt-in with fs_type flag, because... (see below) > End result: most filesystems probably can move to the 'iterate_shared' > version with no real issues. > > Looking at coda in particular, I don't see any reason to not just > switch over to 'iterate_shared' with no code modifications at all. > Nothing in there seems to have any "I rely on the directory inode > being exclusively locked". > > In fact, for coda in particular, it would be really nice if we could > just get rid of the old "iterate" function for the host filesystem > entirely. Which honestly I'd expect you could _also_ do, because > almost all serious local filesystems have long since been converted. > > So coda looks like it could trivially move over. But I might be > missing something. > > I suspect the same is true of most other filesystems, but it requires > people go check and go think about it. > I tried to look at ovl_iterate(). The subtlety is regarding the internal readdir cache. Compared to fuse_readdir(), which is already "iterate_shared" and also implements internal readdir cache, fuse adds an internal mutex (quote) "against (crazy) parallel readdir on same open file". Considering this protection is already provided by f_pos_lock 99% of the time, perhaps an opt-in flag to waiver the single f_count optimization would be enough to be able to make also ovl_iterate() shared? Thanks, Amir.