[paulmck Cc'd] On Mon, Dec 07, 2020 at 10:49:04PM +0000, Al Viro wrote: > On Mon, Dec 07, 2020 at 10:46:57PM +0000, Al Viro wrote: > > On Fri, Nov 20, 2020 at 05:14:26PM -0600, Eric W. Biederman wrote: > > > > > /* > > > * Check whether the specified fd has an open file. > > > */ > > > -#define fcheck(fd) fcheck_files(current->files, fd) > > > +#define fcheck(fd) files_lookup_fd_rcu(current->files, fd) > > > > Huh? > > fs/file.c:1113: file = fcheck(oldfd); > > dup3(), under ->file_lock, no rcu_read_lock() in sight > > > > fs/locks.c:2548: f = fcheck(fd); > > fcntl_setlk(), ditto > > > > fs/locks.c:2679: f = fcheck(fd); > > fcntl_setlk64(), ditto > > > > fs/notify/dnotify/dnotify.c:330: f = fcheck(fd); > > fcntl_dirnotify(); this one _is_ under rcu_read_lock(). > > > > > > IOW, unless I've missed something earlier in the series, this is wrong. > > I have missed something, all right. Ignore that comment... Actually, I take that back - the use of fcheck() in dnotify _is_ interesting. At the very least it needs to be commented upon; what that code is trying to prevent is a race between fcntl_dirnotify() and close(2)/dup2(2) closing the descriptor in question. The problem is, dnotify marks are removed when we detach from descriptor table; that's done by filp_close() calling dnotify_flush(). Suppose fcntl(fd, F_NOTIFY, 0) in one thread races with close(fd) in another (both sharing the same descriptor table). If the former had created and inserted a mark *after* the latter has gotten past dnotify_flush(), there would be nothing to evict that mark. That's the reason that fcheck() is there. rcu_read_lock() used to be sufficient, but the locking has changed since then and even if it is still enough, that's not at all obvious. Exclusion is not an issue; barriers, OTOH... Back then we had ->i_lock taken both by dnotify_flush() before any checks and by fcntl_dirnotify() around the fcheck+insertion. So on close side we had store NULL into descriptor table acquire ->i_lock fetch ->i_dnotify ... release ->i_lock while on fcntl() side we had acquire ->i_lock fetch from descriptor table, sod off if not our file ... store ->i_dnotify ... release ->i_lock Storing NULL into descriptor table could get reordered into ->i_lock-protected area in dnotify_flush(), but it could not get reordered past releasing ->i_lock. So fcntl_dirnotify() either grabbed ->i_lock before dnotify_flush() (in which case missing the store of NULL into descriptor table wouldn't matter, since dnotify_flush() would've observed the mark we'd inserted) or it would've seen that store to descriptor table. Nowadays it's nowhere near as straightforward; in fcntl_dirnotify() we have /* this is needed to prevent the fcntl/close race described below */ mutex_lock(&dnotify_group->mark_mutex); and it would appear to be similar to the original situation, with ->mark_mutex serving in place of ->i_lock. However, dnotify_flush() might not take that mutex at all - it has fsn_mark = fsnotify_find_mark(&inode->i_fsnotify_marks, dnotify_group); if (!fsn_mark) return; before grabbing that thing. So the things are trickier - we actually rely upon the barriers in fsnotify_find_mark(). And those are complicated. The case when it returns non-NULL is not a problem - there we have that mutex providing the barriers we need. NULL can be returned in two cases: a) ->i_fsnotify_marks is not empty, but it contains no dnotify marks. In that case we have ->i_fsnotify_marks.lock acquired and released. By the time it gets around to fcheck(), fcntl_dirnotify() has either found or created and inserted a dnotify mark into that list, with ->i_fsnotify_marks.lock acquired/released around the insertion, so we are fine - either fcntl_dirnotify() gets there first (in which case dnotify_flush() will observe it), or release of that lock by fsnotify_find_mark() called by dnotify_flush() will serve as a barrier, making sure that store to descriptor table will be observed. b) fsnotify_find_mark() (fsnotify_grab_connector() it calls, actually) finds ->i_fsnotify_marks empty. That's where the things get interesting; we have idx = srcu_read_lock(&fsnotify_mark_srcu); conn = srcu_dereference(*connp, &fsnotify_mark_srcu); if (!conn) goto out; on dnotify_flush() side. The matching store of fcntl_dirnotify() side would be in fsnotify_attach_connector_to_object(), where we have /* * cmpxchg() provides the barrier so that readers of *connp can see * only initialized structure */ if (cmpxchg(connp, NULL, conn)) { /* Someone else created list structure for us */ So we have A: store NULL to descriptor table srcu_read_lock srcu_dereference fetches NULL from ->i_fsnotify_marks vs. B: cmpxchg replaces NULL with non-NULL in ->i_fsnotify_marks fetch from descriptor table, can't miss the store done by A Which might be safe, but the whole thing *RELLY* needs to be discussed in fcntl_dirnotify() in more details. fs/notify/* guts are convoluted enough to confuse anyone unfamiliar with them.