On Mon 06-02-17 20:44:45, Miklos Szeredi wrote: > On Wed, Feb 1, 2017 at 11:44 AM, Jan Kara <jack@xxxxxxx> wrote: > > Currently notification marks are attached to object (inode or vfsmnt) by > > a hlist_head in the object. The list is also protected by a spinlock in > > the object. So while there is any mark attached to the list of marks, > > the object must be pinned in memory (and thus e.g. last iput() deleting > > inode cannot happen). Also for list iteration in fsnotify() to work, we > > must hold fsnotify_mark_srcu lock so that mark itself and > > mark->obj_list.next cannot get freed. Thus we are required to wait for > > response to fanotify events from userspace process with > > fsnotify_mark_srcu lock held. That causes issues when userspace process > > is buggy and does not reply to some event - basically the whole > > notification subsystem gets eventually stuck. > > > > So to be able to drop fsnotify_mark_srcu lock while waiting for > > response, we have to pin the mark in memory and make sure it stays in > > the object list (as removing the mark waiting for response could lead to > > lost notification events for groups later in the list). However we don't > > want inode reclaim to block on such mark as that would lead to system > > just locking up elsewhere. > > > > This commit tries to pave a way towards solving these conflicting > > lifetime needs. Instead of anchoring the list of marks directly in the > > object, we anchor it in a dedicated structure (fsnotify_mark_connector) > > and just point to that structure from the object. In the following patch > > we will also protect the list by a spinlock contained in that structure. > > With this, we will be able to detach notification marks from object > > without having to modify the list itself. > > This is still a bit big to review easily. I'd suggest further splitup into: > > - move hlist_head from inode/mnt into connector > - move inode/mnt ptr from mark into connector (no refcounting changes) > - inode refcounting cleanup OK, I got back to this after sorting out some more urgent stuff with BDIs... I did the split and the result really looks more manageable. Thanks for suggestion! > > diff --git a/fs/notify/dnotify/dnotify.c b/fs/notify/dnotify/dnotify.c > > index 5a4ec309e283..3dd2e0ece262 100644 > > --- a/fs/notify/dnotify/dnotify.c > > +++ b/fs/notify/dnotify/dnotify.c > > @@ -69,8 +69,8 @@ static void dnotify_recalc_inode_mask(struct fsnotify_mark *fsn_mark) > > if (old_mask == new_mask) > > return; > > > > - if (fsn_mark->inode) > > - fsnotify_recalc_inode_mask(fsn_mark->inode); > > + if (fsn_mark->connector->inode > > This doesn't make sense in the context of this patch. Now > mark->connector is set to NULL where previously mark->inode was set to > null. So the right condition would be "if (fsn_mark->connector) > ...". Right, I'll fix that. > > diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c > > index b41515d3f081..421c7431a16e 100644 > > --- a/fs/notify/fsnotify.c > > +++ b/fs/notify/fsnotify.c > > @@ -193,6 +193,7 @@ int fsnotify(struct inode *to_tell, __u32 mask, const void *data, int data_is, > > struct hlist_node *inode_node = NULL, *vfsmount_node = NULL; > > struct fsnotify_mark *inode_mark = NULL, *vfsmount_mark = NULL; > > struct fsnotify_group *inode_group, *vfsmount_group; > > + struct fsnotify_mark_connector *inode_conn, *vfsmount_conn; > > struct mount *mnt; > > int idx, ret = 0; > > /* global tests shouldn't care about events on child only the specific event */ > > @@ -210,8 +211,8 @@ int fsnotify(struct inode *to_tell, __u32 mask, const void *data, int data_is, > > * SRCU because we have no references to any objects and do not > > * need SRCU to keep them "alive". > > */ > > - if (hlist_empty(&to_tell->i_fsnotify_marks) && > > - (!mnt || hlist_empty(&mnt->mnt_fsnotify_marks))) > > + if (!to_tell->i_fsnotify_marks && > > + (!mnt || !mnt->mnt_fsnotify_marks)) > > The checks against hlist_empty() could still be useful to optimize > away the memory barrier imposed by srcu_read_lock() in case there were > marks on the object but not anymore (i.e. the connector is non-NULL, > but the hlist is empty). Yes, at this point in the series it would be a worthwhile optimization. But since following patches change the rules so that empty connector structures get discarded, I don't think making the optimization in this patch is worth it. > > return 0; > > /* > > * if this is a modify event we may need to clear the ignored masks > > @@ -226,16 +227,24 @@ int fsnotify(struct inode *to_tell, __u32 mask, const void *data, int data_is, > > idx = srcu_read_lock(&fsnotify_mark_srcu); > > > > if ((mask & FS_MODIFY) || > > - (test_mask & to_tell->i_fsnotify_mask)) > > - inode_node = srcu_dereference(to_tell->i_fsnotify_marks.first, > > - &fsnotify_mark_srcu); > > + (test_mask & to_tell->i_fsnotify_mask)) { > > + inode_conn = to_tell->i_fsnotify_marks; > > Needs lockless_dereference(). Right. Will fix. > > diff --git a/fs/notify/fsnotify.h b/fs/notify/fsnotify.h > > index 0a3bc2cf192c..7a95436dc8d3 100644 > > --- a/fs/notify/fsnotify.h > > +++ b/fs/notify/fsnotify.h ... > > /* run the list of all marks associated with inode and destroy them */ > > static inline void fsnotify_clear_marks_by_inode(struct inode *inode) > > { > > + if (!inode->i_fsnotify_marks) > > + return; > > Should move the above check inside fsnotify_destroy_marks(). Locking > rules also dictate that the check be done under inode lock, although > that may not matter in practice. Yeah, it does not matter in practice but it's just nicer. Moved (and the other similar occurences as well). > > diff --git a/fs/notify/mark.c b/fs/notify/mark.c > > index 44836e539169..ce2314ae96d0 100644 > > --- a/fs/notify/mark.c > > +++ b/fs/notify/mark.c ... > > @@ -249,6 +252,16 @@ void fsnotify_destroy_marks(struct hlist_head *head, spinlock_t *lock) > > fsnotify_destroy_mark(mark, mark->group); > > fsnotify_put_mark(mark); > > } > > + /* > > + * Skip freeing of connector when inode is just unlinked. We leave that > > + * until destroy_inode() is called at which moment nobody will be > > + * looking at the inode anymore. > > + */ > > + if ((*connp)->flags & FSNOTIFY_OBJ_TYPE_INODE && > > + !((*connp)->inode->i_state & I_CLEAR)) > > + return; > > This is ugly. Why not rather add a fsnotify_connector_free() and > call that explicitly on inode/vfsmount destruction? Good idea. Changed. > > +static int fsnotify_add_mark_list(struct fsnotify_mark *mark, > > + struct inode *inode, struct vfsmount *mnt, > > + int allow_dups) > > { > > struct fsnotify_mark *lmark, *last = NULL; > > + struct fsnotify_mark_connector *conn; > > + struct fsnotify_mark_connector **connp; > > int cmp; > > + int err = 0; > > + > > + BUG_ON(!inode && !mnt); > > Use WARN_ON() instead. OK. > > + if (inode) > > + connp = &inode->i_fsnotify_marks; > > + else > > + connp = &real_mount(mnt)->mnt_fsnotify_marks; > > +restart: > > + spin_lock(&mark->lock); > > + conn = fsnotify_grab_connector(*connp); > > This looks a bit too subtle. We have the inode/vfsmount. We have the > matching lock. Could store the address of the lock in a local > variable and use that for locking. Done. > > + if (!conn) { > > + spin_unlock(&mark->lock); > > + conn = kmem_cache_alloc(fsnotify_mark_connector_cachep, > > + GFP_KERNEL); > > + if (!conn) > > + return -ENOMEM; > > + INIT_HLIST_HEAD(&conn->list); > > + if (inode) { > > + conn->flags = FSNOTIFY_OBJ_TYPE_INODE; > > + conn->inode = inode; > > + } else { > > + conn->flags = FSNOTIFY_OBJ_TYPE_VFSMOUNT; > > + conn->mnt = mnt; > > + } > > + if (cmpxchg(connp, NULL, conn)) { > > + /* Someone else created list structure for us */ > > + kmem_cache_free(fsnotify_mark_connector_cachep, conn); > > + } > > Setting *connp should be done with inode/vfsmount lock already > regained (if I understand the locking rules correctly) and then no > need to have cmpxchg magic in there. Yes, it will have to come in following patches but I agree it is not needed here. Changed. > > diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h > > index 487246546ebe..01df00327683 100644 > > --- a/include/linux/fsnotify_backend.h > > +++ b/include/linux/fsnotify_backend.h > > @@ -194,6 +194,27 @@ struct fsnotify_group { > > #define FSNOTIFY_EVENT_INODE 2 > > > > /* > > + * Inode / vfsmount point to this structure which tracks all marks attached to > > + * the inode / vfsmount. The reference to inode / vfsmount is held by this > > + * structure. We destroy this structure when there are no more marks attached > > + * to it. The structure is protected by fsnotify_mark_srcu. > > + */ > > +struct fsnotify_mark_connector { > > +#define FSNOTIFY_OBJ_TYPE_INODE 0x01 > > +#define FSNOTIFY_OBJ_TYPE_VFSMOUNT 0x02 > > + unsigned int flags; /* Type of object [lock] */ > > + union { /* Object pointer [lock] */ > > + struct inode *inode; > > + struct vfsmount *mnt; > > + }; > > + union { > > + struct hlist_head list; > > + /* Used listing heads to free after srcu period expires */ > > + struct fsnotify_mark_connector *destroy_next; > > This field is not used in this patch at all. Yes, will move to later patches. > > diff --git a/kernel/audit_tree.c b/kernel/audit_tree.c > > index 11c7ac441624..a878a98f55b3 100644 > > --- a/kernel/audit_tree.c > > +++ b/kernel/audit_tree.c > > @@ -248,7 +263,8 @@ static void untag_chunk(struct node *p) > > > > mutex_lock(&entry->group->mark_mutex); > > spin_lock(&entry->lock); > > - if (chunk->dead || !entry->inode) { > > + if (chunk->dead || !entry->connector || > > + !entry->connector->inode) { > > entry->connector->inode will always be positive, at least in this patch. Correct. I will move it to a later patch. > > diff --git a/kernel/auditsc.c b/kernel/auditsc.c > > index cf1fa43512c1..e990de915341 100644 > > --- a/kernel/auditsc.c > > +++ b/kernel/auditsc.c > > @@ -1591,7 +1591,7 @@ static inline void handle_one(const struct inode *inode) > > struct audit_tree_refs *p; > > struct audit_chunk *chunk; > > int count; > > - if (likely(hlist_empty(&inode->i_fsnotify_marks))) > > + if (likely(!inode->i_fsnotify_marks)) > > Need to check hlist_empty() as well? Perhaps add an inline helper to > fsnotify_backend.h: fsnotify_inode_marks_empty(). I've added the hlist_empty() check. I don't think fsnotify_inode_marks_empty() helper is needed as it will be used only in these two places. Thanks for review! Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR