Al Viro <viro@xxxxxxxxxxxxxxxxxx> writes: > On Thu, Oct 13, 2022 at 03:27:19PM -0700, Stephen Brennan wrote: > >> So I have two more narrowly scoped strategies to improve the situation. Both are >> included in the hacky, awful patch below. >> >> First, is to let __fsnotify_update_child_dentry_flags() sleep. This means nobody >> is holding the spinlock for several seconds at a time. We can actually achieve >> this via a cursor, the same way that simple_readdir() is implemented. I think >> this might require moving the declaration of d_alloc_cursor() and maybe >> exporting it. I had to #include fs/internal.h which is not ok. > > Er... Won't that expose every filesystem to having to deal with cursors? > Currently it's entirely up to the filesystem in question and I wouldn't > be a dime on everyone being ready to cope with such surprises... Hi Al, I wanted to follow-up on this. Yeah, I didn't realize that this could cause issues for some filesystems when I wrote it, since I hadn't considered that rather few filesystems use dcache_readdir(), and so most aren't prepared to encounter a cursor. Thanks for that catch. I think I came up with a better solution, which you can see in context in v3 [1]. The only change I have from the posting there is to eliminate the unnecssary "child->d_parent != parent" check. So the new idea is to take a reference to the current child and then go to sleep. That alone should be enough to prevent dentry_kill() from removing the child from its parent's list. However, it wouldn't prevent d_move() from moving it to a new list, nor would it prevent it from moving along the list if it were a cursor. However, in this situation, we also hold the parent i_rwsem in write mode, which means that no concurrent rename or readdir can happen. You can see my full analysis here [2]. So here's the new approach, if you can call out anything I've missod or confirm that it's sound, I'd really appreciate it! /* Must be called with inode->i_rwsem in held in write mode */ static bool __fsnotify_update_children_dentry_flags(struct inode *inode) { struct dentry *child, *alias, *last_ref = NULL; alias = d_find_any_alias(inode); /* * These lists can get very long, so we may need to sleep during * iteration. Normally this would be impossible without a cursor, * but since we have the inode locked exclusive, we're guaranteed * that the directory won't be modified, so whichever dentry we * pick to sleep on won't get moved. */ spin_lock(&alias->d_lock); list_for_each_entry(child, &alias->d_subdirs, d_child) { if (need_resched()) { /* * We need to hold a reference while we sleep. But when * we wake, dput() could free the dentry, invalidating * the list pointers. We can't look at the list pointers * until we re-lock the parent, and we can't dput() once * we have the parent locked. So the solution is to hold * onto our reference and free it the *next* time we drop * alias->d_lock: either at the end of the function, or * at the time of the next sleep. */ dget(child); spin_unlock(&alias->d_lock); dput(last_ref); last_ref = child; cond_resched(); spin_lock(&alias->d_lock); } if (!child->d_inode) continue; spin_lock_nested(&child->d_lock, DENTRY_D_LOCK_NESTED); if (watched) child->d_flags |= DCACHE_FSNOTIFY_PARENT_WATCHED; else child->d_flags &= ~DCACHE_FSNOTIFY_PARENT_WATCHED; spin_unlock(&child->d_lock); } spin_unlock(&alias->d_lock); dput(last_ref); dput(alias); return watched; } Thanks, Stephen [1]: https://lore.kernel.org/linux-fsdevel/20221028001016.332663-4-stephen.s.brennan@xxxxxxxxxx/ [2]: https://lore.kernel.org/linux-fsdevel/874jvigye9.fsf@xxxxxxxxxx/