When an inode is interested in events on its children, it must set DCACHE_FSNOTIFY_PARENT_WATCHED flag on all its children. Currently, when the fsnotify connector is removed and i_fsnotify_mask becomes zero, we lazily allow __fsnotify_parent() to do this the next time we see an event on a child. However, if the list of children is very long (e.g., in the millions), and lots of activity is occurring on the directory, then it's possible for many CPUs to end up blocked on the inode spinlock in __fsnotify_update_child_flags(). Each CPU will then redundantly iterate over the very long list of children. This situation can cause soft lockups. To avoid this, stop lazily updating child flags in __fsnotify_parent(). Protect the child flag update with i_rwsem held exclusive, to ensure that we only iterate over the child list when it's absolutely necessary, and even then, only once. Signed-off-by: Stephen Brennan <stephen.s.brennan@xxxxxxxxxx> --- Notes: It seems that there are two implementation options for this, regarding what i_rwsem protects: 1. Both updates to i_fsnotify_mask, and the child dentry flags, or 2. Only updates to the child dentry flags I wanted to do #1, but it got really tricky with fsnotify_put_mark(). We don't want to hold the inode lock whenever we decrement the refcount, but if we don't, then we're stuck holding a spinlock when the refcount goes to zero, and we need to grab the inode rwsem to synchronize the update to the child flags. I'm sure there's a way around this, but I didn't keep going with it. With #1, as currently implemented, we have the unfortunate effect of that a mark can be added, can see that no update is required, and return, despite the fact that the flag update is still in progress on a different CPU/thread. From our discussion, that seems to be the current status quo, but I wanted to explicitly point that out. If we want to move to #1, it should be possible with some work. fs/notify/fsnotify.c | 12 ++++++++-- fs/notify/mark.c | 55 ++++++++++++++++++++++++++++++++++---------- 2 files changed, 53 insertions(+), 14 deletions(-) diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c index 7974e91ffe13..e887a195983b 100644 --- a/fs/notify/fsnotify.c +++ b/fs/notify/fsnotify.c @@ -207,8 +207,16 @@ int __fsnotify_parent(struct dentry *dentry, __u32 mask, const void *data, parent = dget_parent(dentry); p_inode = parent->d_inode; p_mask = fsnotify_inode_watches_children(p_inode); - if (unlikely(parent_watched && !p_mask)) - __fsnotify_update_child_dentry_flags(p_inode); + if (unlikely(parent_watched && !p_mask)) { + /* + * Flag would be cleared soon by + * __fsnotify_update_child_dentry_flags(), but as an + * optimization, clear it now. + */ + spin_lock(&dentry->d_lock); + dentry->d_flags &= ~DCACHE_FSNOTIFY_PARENT_WATCHED; + spin_unlock(&dentry->d_lock); + } /* * Include parent/name in notification either if some notification diff --git a/fs/notify/mark.c b/fs/notify/mark.c index c74ef947447d..da9f944fcbbb 100644 --- a/fs/notify/mark.c +++ b/fs/notify/mark.c @@ -184,15 +184,36 @@ static void *__fsnotify_recalc_mask(struct fsnotify_mark_connector *conn) */ void fsnotify_recalc_mask(struct fsnotify_mark_connector *conn) { + struct inode *inode = NULL; + int watched_before, watched_after; + if (!conn) return; - spin_lock(&conn->lock); - __fsnotify_recalc_mask(conn); - spin_unlock(&conn->lock); - if (conn->type == FSNOTIFY_OBJ_TYPE_INODE) - __fsnotify_update_child_dentry_flags( - fsnotify_conn_inode(conn)); + if (conn->type == FSNOTIFY_OBJ_TYPE_INODE) { + /* + * For inodes, we may need to update flags on the child + * dentries. To ensure these updates occur exactly once, + * synchronize the recalculation with the inode mutex. + */ + inode = fsnotify_conn_inode(conn); + spin_lock(&conn->lock); + watched_before = fsnotify_inode_watches_children(inode); + __fsnotify_recalc_mask(conn); + watched_after = fsnotify_inode_watches_children(inode); + spin_unlock(&conn->lock); + + inode_lock(inode); + if ((watched_before && !watched_after) || + (!watched_before && watched_after)) { + __fsnotify_update_child_dentry_flags(inode); + } + inode_unlock(inode); + } else { + spin_lock(&conn->lock); + __fsnotify_recalc_mask(conn); + spin_unlock(&conn->lock); + } } /* Free all connectors queued for freeing once SRCU period ends */ @@ -295,6 +316,8 @@ void fsnotify_put_mark(struct fsnotify_mark *mark) struct fsnotify_mark_connector *conn = READ_ONCE(mark->connector); void *objp = NULL; unsigned int type = FSNOTIFY_OBJ_TYPE_DETACHED; + struct inode *inode = NULL; + int watched_before, watched_after; bool free_conn = false; /* Catch marks that were actually never attached to object */ @@ -311,17 +334,31 @@ void fsnotify_put_mark(struct fsnotify_mark *mark) if (!refcount_dec_and_lock(&mark->refcnt, &conn->lock)) return; + if (conn->type == FSNOTIFY_OBJ_TYPE_INODE) { + inode = fsnotify_conn_inode(conn); + watched_before = fsnotify_inode_watches_children(inode); + } + hlist_del_init_rcu(&mark->obj_list); if (hlist_empty(&conn->list)) { objp = fsnotify_detach_connector_from_object(conn, &type); free_conn = true; + watched_after = 0; } else { objp = __fsnotify_recalc_mask(conn); type = conn->type; + watched_after = fsnotify_inode_watches_children(inode); } WRITE_ONCE(mark->connector, NULL); spin_unlock(&conn->lock); + if (inode) { + inode_lock(inode); + if (watched_before && !watched_after) + __fsnotify_update_child_dentry_flags(inode); + inode_unlock(inode); + } + fsnotify_drop_object(type, objp); if (free_conn) { @@ -331,12 +368,6 @@ void fsnotify_put_mark(struct fsnotify_mark *mark) spin_unlock(&destroy_lock); queue_work(system_unbound_wq, &connector_reaper_work); } - /* - * Note that we didn't update flags telling whether inode cares about - * what's happening with children. We update these flags from - * __fsnotify_parent() lazily when next event happens on one of our - * children. - */ spin_lock(&destroy_lock); list_add(&mark->g_list, &destroy_list); spin_unlock(&destroy_lock); -- 2.34.1