On Wed, Aug 17, 2022 at 12:25 AM Amir Goldstein <amir73il@xxxxxxxxx> wrote: > > All the "complexity" is really very tidy and neat inside iterate_dir() > I honestly don't see what the big fuss is about. The worst part of it is conceptual, and the conditional locking. No, it's not a huge maintenance burden, but it's quite the ugly wart in our vfs layer that generally has pretty clean abstractions. The non-shared version can also be a real performance problem on some loads, but that's mainly a "real local filesystem" issue. Samba used to (and presumably still does) do a *lot* of readdir calls for all the name mangling. We used to have similar horrors wrt ioctl's with the big kernel lock - you can still see signs of that in the naming (ie the 'ioctl' function pointer in the file operations structure is called 'unlocked_ioctl' - even though it's been over a decade since we got rid of the 'locked' one). So that thing is just nasty. > If filesystems do need to synchronize creates/deletes with readdir > they would need to add a new internal lock and take it for every > dir operation - it seems like a big hustle for little good reason. Most of the time, they have that lock *anyway* - or, as I looked at two cases, don't need any locking at all. The pathname lookup parts of a filesystem need to be able to handle concurrency within a directory anyway, because of how '->lookup()' works. Yes, our dentry layer helps a lot and serializes individual name lookups, but basically no filesystem can rely on _just_ i_rwsem anyway. Of course, those locks are often about things like "read page X from directory Y", so the locking can be about things like block allocation etc. The bigger conceptual races - filename creation, lookup of the same entry concurrently - are already handled by the VFS layer. So those locks generally already exist. They just need to be checked cover readdir too. Normally readdir iteration is very similar to lookup(). The *one* exception tends to be directory entry caches, although sometimes those too are shared with lookup. > Perhaps it would have been more elegant to replace the > iterate_shared/iterate duality with an fs_type flag. > And it would be very simple to make that change. No. You're missing the point. It's not the *flag* and "we have two different iterate functions" that is the ugly part. Making it a filesystem flag would not have made any difference - it would still be the same horrendous wart at the vfs level, and it would be in some sense worse because now it's split into two different independent things that are tied together. > Hmm, I wonder. Can gentents() proceed without f_pos_lock on > single f_count and then a cloned fd created and another gentents() > races with the first one? That would break much more important system calls than getdents (ie basic read/write). So if fdget_pos() were to be broken, we have much bigger issues. In case you care, the thing that protects us against another thread using the same file descriptor is fdget_pos -> __fdget_pos -> __fdget -> __fdget_light -> check current->files->count do full __fdget() if it's > 1 and then __fdget_pos() does that if (file_count(file) > 1) { v |= FDPUT_POS_UNLOCK; mutex_lock(&file->f_pos_lock); } check afterwards. IOW, this code is carefully written to avoid both the file pointer reference increment *and* the f_pos_lock in the common case of a single-threaded process that only has the file open once. But if 'current->files->count' is elevated - iow some other thread has access to the file array and might be open/close/dup'ing files, we take the file reference. And then if the file is open through multiple file descriptors (which might be in another file table entirely), _or_ if we just took the file refernce due to that threading issue - then in either of those cases we will also now take the file pos lock. End result: either this system call is the only one that can access that file, or we have taken the file pos lock. Either way, getdents() ends up being serialized wrt that particular open file. Now, you can actually screw this up if you really try: the file pos lock obviously also is conditional on FMODE_ATOMIC_POS. So if you have a filesystem that does "stream_open()" to clear FMODE_ATOMIC_POS, that can avoid the file pos lock entirely. So don't do that. I'm looking at fuse and FOPEN_STREAM, and that looks a bit iffy. (Of course, for a virtual filesystem, if you want to have a getdents() that doesn't allow lseek, and that doesn't use f_pos for getdents but just some internal other logic, clearing FMODE_ATOMIC_POS might actually be entirely intentional. It's not like the file pos lock is really _required_). > I tried to look at ovl_iterate(). The subtlety is regarding the internal > readdir cache. Ok, from a quick look, you do have cache = ovl_dir_cache(d_inode(dentry)); so yeah, the cache is per-inode and thus different 'struct file' entries for the same directory will need locking. > Compared to fuse_readdir(), which is already "iterate_shared" > and also implements internal readdir cache, fuse adds an internal > mutex (quote) "against (crazy) parallel readdir on same open file". Maybe the FUSE lock is *because* fuse allows that FOPEN_STREAM? > Considering this protection is already provided by f_pos_lock > 99% of the time, It's not 99% of the time. f_pos_lock is reliable. You have to explicitly turn it off. FUSE is either confused and does pointless locking, or has done that stream_open() thing to explicitly turn off the vfs locking. Or, probably more realistically, the fuse code goes back to before we did f_pos_lock. Linus