Amir Goldstein <amir73il@xxxxxxxxx> writes: > On Fri, Oct 28, 2022 at 3:10 AM Stephen Brennan > <stephen.s.brennan@xxxxxxxxxx> wrote: >> >> With very large d_subdirs lists, iteration can take a long time. Since >> iteration needs to hold parent->d_lock, this can trigger soft lockups. >> It would be best to make this iteration sleepable. Since we have the >> inode locked exclusive, we can drop the parent->d_lock and sleep, >> holding a reference to a child dentry, and continue iteration once we >> wake. >> >> Signed-off-by: Stephen Brennan <stephen.s.brennan@xxxxxxxxxx> >> --- >> > > Reviewed-by: Amir Goldstein <amir73il@xxxxxxxxx> > > some comment nits and one fortify suggestion > >> Notes: >> v3: >> - removed if statements around dput() >> v2: >> - added a check for child->d_parent != alias and retry logic >> >> fs/notify/fsnotify.c | 36 ++++++++++++++++++++++++++++++++---- >> 1 file changed, 32 insertions(+), 4 deletions(-) >> >> diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c >> index ccb8a3a6c522..34e5d18235a7 100644 >> --- a/fs/notify/fsnotify.c >> +++ b/fs/notify/fsnotify.c >> @@ -102,10 +102,12 @@ void fsnotify_sb_delete(struct super_block *sb) >> * on a child we run all of our children and set a dentry flag saying that the >> * parent cares. Thus when an event happens on a child it can quickly tell >> * if there is a need to find a parent and send the event to the parent. >> + * >> + * Context: inode locked exclusive >> */ >> static bool __fsnotify_update_children_dentry_flags(struct inode *inode) >> { >> - struct dentry *alias, *child; >> + struct dentry *child, *alias, *last_ref = NULL; >> int watched; >> >> if (!S_ISDIR(inode->i_mode)) >> @@ -120,12 +122,37 @@ static bool __fsnotify_update_children_dentry_flags(struct inode *inode) >> alias = d_find_any_alias(inode); >> >> /* >> - * run all of the children of the original inode and fix their >> - * d_flags to indicate parental interest (their parent is the >> - * original inode) >> + * These lists can get very long, so we may need to sleep during >> + * iteration. Normally this would be impossible without a cursor, >> + * but since we have the inode locked exclusive, we're guaranteed >> + * that the directory won't be modified, so whichever dentry we >> + * pick to sleep on won't get moved. So, start a manual iteration >> + * over d_subdirs which will allow us to sleep. >> */ >> spin_lock(&alias->d_lock); >> +retry: >> list_for_each_entry(child, &alias->d_subdirs, d_child) { >> + if (need_resched()) { >> + /* >> + * We need to hold a reference while we sleep. But when >> + * we wake, dput() could free the dentry, invalidating >> + * the list pointers. We can't look at the list pointers >> + * until we re-lock the parent, and we can't dput() once >> + * we have the parent locked. So the solution is to hold >> + * onto our reference and free it the *next* time we drop >> + * alias->d_lock: either at the end of the function, or >> + * at the time of the next sleep. >> + */ > > My personal preference would be to move this above if (needed_reschd()) > it is not any less clear when this comment is above the condition > and less indented will read nicer. Definitely. >> + dget(child); >> + spin_unlock(&alias->d_lock); >> + dput(last_ref); >> + last_ref = child; >> + cond_resched(); >> + spin_lock(&alias->d_lock); >> + if (child->d_parent != alias) >> + goto retry; > > Is this expected? If not, then we need a WARN_ON_ONCE(). > Also, I wonder if it would be better to break out and leave > dentry flags as they are instead of risking some endless > or very long retry loop? > > And how about asserting on unexpected !list_empty(&child->d_child) > to avoid an endless loop in list_for_each_entry()? I was trying to think this through as I prepared v3, and ended I up leaving it as-is, out of a hope that it was doing something helpful. But I'm pretty sure Al would ask why I believe that this could happen (e.g. what scenario am I guarding against happening?). I didn't have a clear idea of what I was guarding against here when I wrote it, which is my fault. What's necessary is an audit of the places where d_child is modified, so we can understand the risky places. What we're doing here is dropping parent->d_lock and going to sleep, while still having parent->d_inode->i_rwsem held in write mode. The risk we have is that this particular dentry is removed from this list while we slept. Checking d_parent seems particularly unhelpful for that, since what we care far more about is the d_child list pointers. I did a code audit of all locations where d_child is modified. Here's all I could identify: 1. dentry_unlist() in fs/dcache.c 2. __d_move() in fs/dcache.c 3. Initialization code in d_alloc() family 4. Cursor code in fs/libfs.c for dcache_readdir() For case #1, dentry_unlist() is a helper of dentry_kill(). The real guts of dentry_kill() must happen with parent->d_lock held. Since we get a reference to the child with parent->d_lock held, we can be confident that any dentry_kill() which could have been happening concurrently will bail out after seeing that we got a reference. At least, that's my reading of the code. For case #2, __d_move() can (should?) NOT happen, because the caller must hold the i_mutex (read i_rwsem in write mode), but we are holding that ourselves. For case #3, it doesn't matter... it's initialization time, and we couldn't have gotten a reference to the dentry until the dentry is put into the tree. For case #4, the simple_readdir() code should hold the inode rwsem in read mode, so we have mutual exclusion from this code too. While it's possible that we could go to sleep holding a reference to a cursor, I believe that the cursor could not move without holding i_rwsem in read mode. So we could never wake up to find we've skipped part of the list. So to summarize, I don't think there's a case where we could actually expect that d_child pointers get updated while we sleep with a reference held and the parent i_rwsem held exclusive. Consequently, no place where d_parent would change from under our feet. So I will remove this check form the patch. Thanks, Stephen