On Mon, Feb 6, 2017 at 9:44 PM, Miklos Szeredi <miklos@xxxxxxxxxx> wrote: > On Wed, Feb 1, 2017 at 11:44 AM, Jan Kara <jack@xxxxxxx> wrote: >> Currently notification marks are attached to object (inode or vfsmnt) by >> a hlist_head in the object. The list is also protected by a spinlock in >> the object. So while there is any mark attached to the list of marks, >> the object must be pinned in memory (and thus e.g. last iput() deleting >> inode cannot happen). Also for list iteration in fsnotify() to work, we >> must hold fsnotify_mark_srcu lock so that mark itself and >> mark->obj_list.next cannot get freed. Thus we are required to wait for >> response to fanotify events from userspace process with >> fsnotify_mark_srcu lock held. That causes issues when userspace process >> is buggy and does not reply to some event - basically the whole >> notification subsystem gets eventually stuck. >> >> So to be able to drop fsnotify_mark_srcu lock while waiting for >> response, we have to pin the mark in memory and make sure it stays in >> the object list (as removing the mark waiting for response could lead to >> lost notification events for groups later in the list). However we don't >> want inode reclaim to block on such mark as that would lead to system >> just locking up elsewhere. >> >> This commit tries to pave a way towards solving these conflicting >> lifetime needs. Instead of anchoring the list of marks directly in the >> object, we anchor it in a dedicated structure (fsnotify_mark_connector) >> and just point to that structure from the object. In the following patch >> we will also protect the list by a spinlock contained in that structure. >> With this, we will be able to detach notification marks from object >> without having to modify the list itself. > > This is still a bit big to review easily. I'd suggest further splitup into: > > - move hlist_head from inode/mnt into connector > - move inode/mnt ptr from mark into connector (no refcounting changes) > - inode refcounting cleanup > > More comments inline. > ... >> diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c >> index b41515d3f081..421c7431a16e 100644 >> --- a/fs/notify/fsnotify.c >> +++ b/fs/notify/fsnotify.c >> @@ -193,6 +193,7 @@ int fsnotify(struct inode *to_tell, __u32 mask, const void *data, int data_is, >> struct hlist_node *inode_node = NULL, *vfsmount_node = NULL; >> struct fsnotify_mark *inode_mark = NULL, *vfsmount_mark = NULL; >> struct fsnotify_group *inode_group, *vfsmount_group; >> + struct fsnotify_mark_connector *inode_conn, *vfsmount_conn; >> struct mount *mnt; >> int idx, ret = 0; >> /* global tests shouldn't care about events on child only the specific event */ >> @@ -210,8 +211,8 @@ int fsnotify(struct inode *to_tell, __u32 mask, const void *data, int data_is, >> * SRCU because we have no references to any objects and do not >> * need SRCU to keep them "alive". >> */ >> - if (hlist_empty(&to_tell->i_fsnotify_marks) && >> - (!mnt || hlist_empty(&mnt->mnt_fsnotify_marks))) >> + if (!to_tell->i_fsnotify_marks && >> + (!mnt || !mnt->mnt_fsnotify_marks)) > > The checks against hlist_empty() could still be useful to optimize > away the memory barrier imposed by srcu_read_lock() in case there were > marks on the object but not anymore (i.e. the connector is non-NULL, > but the hlist is empty). > Is this really a case worth optimizing? Can an empty connector stay alive more then a few grace periods? Not that it is hard to check hlist_empty(). I'm just wondering, if this would be code with no visible gain.