Re: Switching to iterate_shared

Amir Goldstein <amir73il@xxxxxxxxx> · Wed, 17 Aug 2022 10:25:30 +0300

On Wed, Aug 17, 2022 at 2:02 AM Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> On Tue, Aug 16, 2022 at 1:14 PM Jan Harkes <jaharkes@xxxxxxxxxx> wrote:
> >
> > So good to know in advance a change like this is coming. I'll have to
> > educate myself on this shared vs non-shared filldir.
>
> Well, that change isn't necessarily "coming" - we've had this horrid
> duality for years. "iterate_shared" goes back to 2016, and has been
> marked "recommended" since then.
>
> But the plain old "iterate" does continue to work, and having both is
> only an ugly wart - I can't honestly claim that it's a big and
> pressing problem.
>

All the "complexity" is really very tidy and neat inside iterate_dir()
I honestly don't see what the big fuss is about.

If filesystems do need to synchronize creates/deletes with readdir
they would need to add a new internal lock and take it for every
dir operation - it seems like a big hustle for little good reason.

Perhaps it would have been more elegant to replace the
iterate_shared/iterate duality with an fs_type flag.
And it would be very simple to make that change.

> But it would be *really* nice if filesystems did switch to it. For
> most filesystems, it is probably trivial. The only real difference is
> that the "iterate_shared" directory walker is called with the
> directory inode->i_rwsem held for reading, rather than for writing.
>
> End result: there can be multiple concurrent "readdir" iterators
> running on the same directory at the same time. But there's still
> exclusion wrt things like file creation, so a filesystem doesn't have
> to worry about that.
>
> Also, the concurrency is only between different 'struct file'
> entities. The file position lock still serializes all getdents() calls
> on any individual open directory 'struct file'.
>

Hmm, I wonder. Can gentents() proceed without f_pos_lock on
single f_count and then a cloned fd created and another gentents()
races with the first one?

A very niche corner case, but perhaps we can close this hole
for filesystems that opt-in with fs_type flag, because... (see below)

> End result: most filesystems probably can move to the 'iterate_shared'
> version with no real issues.
>
> Looking at coda in particular, I don't see any reason to not just
> switch over to 'iterate_shared' with no code modifications at all.
> Nothing in there seems to have any "I rely on the directory inode
> being exclusively locked".
>
> In fact, for coda in particular, it would be really nice if we could
> just get rid of the old "iterate" function for the host filesystem
> entirely. Which honestly I'd expect you could _also_ do, because
> almost all serious local filesystems have long since been converted.
>
> So coda looks like it could trivially move over. But I might be
> missing something.
>
> I suspect the same is true of most other filesystems, but it requires
> people go check and go think about it.
>

I tried to look at ovl_iterate(). The subtlety is regarding the internal
readdir cache.

Compared to fuse_readdir(), which is already "iterate_shared"
and also implements internal readdir cache, fuse adds an internal
mutex (quote) "against (crazy) parallel readdir on same open file".

Considering this protection is already provided by f_pos_lock
99% of the time, perhaps an opt-in flag to waiver the single
f_count optimization would be enough to be able to make
also ovl_iterate() shared?

Thanks,
Amir.