On Wed, Dec 09, 2020 at 04:54:11PM +0000, Al Viro wrote: > [paulmck Cc'd] > > On Mon, Dec 07, 2020 at 10:49:04PM +0000, Al Viro wrote: > > On Mon, Dec 07, 2020 at 10:46:57PM +0000, Al Viro wrote: > > > On Fri, Nov 20, 2020 at 05:14:26PM -0600, Eric W. Biederman wrote: > > > > > > > /* > > > > * Check whether the specified fd has an open file. > > > > */ > > > > -#define fcheck(fd) fcheck_files(current->files, fd) > > > > +#define fcheck(fd) files_lookup_fd_rcu(current->files, fd) > > > > > > Huh? > > > fs/file.c:1113: file = fcheck(oldfd); > > > dup3(), under ->file_lock, no rcu_read_lock() in sight > > > > > > fs/locks.c:2548: f = fcheck(fd); > > > fcntl_setlk(), ditto > > > > > > fs/locks.c:2679: f = fcheck(fd); > > > fcntl_setlk64(), ditto > > > > > > fs/notify/dnotify/dnotify.c:330: f = fcheck(fd); > > > fcntl_dirnotify(); this one _is_ under rcu_read_lock(). > > > > > > > > > IOW, unless I've missed something earlier in the series, this is wrong. > > > > I have missed something, all right. Ignore that comment... > > Actually, I take that back - the use of fcheck() in dnotify _is_ interesting. > At the very least it needs to be commented upon; what that code is trying > to prevent is a race between fcntl_dirnotify() and close(2)/dup2(2) closing > the descriptor in question. The problem is, dnotify marks are removed > when we detach from descriptor table; that's done by filp_close() calling > dnotify_flush(). > > Suppose fcntl(fd, F_NOTIFY, 0) in one thread races with close(fd) in another > (both sharing the same descriptor table). If the former had created and > inserted a mark *after* the latter has gotten past dnotify_flush(), there > would be nothing to evict that mark. > > That's the reason that fcheck() is there. rcu_read_lock() used to be > sufficient, but the locking has changed since then and even if it is > still enough, that's not at all obvious. > > Exclusion is not an issue; barriers, OTOH... Back then we had > ->i_lock taken both by dnotify_flush() before any checks and > by fcntl_dirnotify() around the fcheck+insertion. So on close > side we had > store NULL into descriptor table > acquire ->i_lock > fetch ->i_dnotify > ... > release ->i_lock > while on fcntl() side we had > acquire ->i_lock > fetch from descriptor table, sod off if not our file > ... > store ->i_dnotify > ... > release ->i_lock > Storing NULL into descriptor table could get reordered into > ->i_lock-protected area in dnotify_flush(), but it could not > get reordered past releasing ->i_lock. So fcntl_dirnotify() > either grabbed ->i_lock before dnotify_flush() (in which case > missing the store of NULL into descriptor table wouldn't > matter, since dnotify_flush() would've observed the mark > we'd inserted) or it would've seen that store to descriptor > table. > > Nowadays it's nowhere near as straightforward; in fcntl_dirnotify() > we have > /* this is needed to prevent the fcntl/close race described below */ > mutex_lock(&dnotify_group->mark_mutex); > and it would appear to be similar to the original situation, with > ->mark_mutex serving in place of ->i_lock. However, dnotify_flush() > might not take that mutex at all - it has > fsn_mark = fsnotify_find_mark(&inode->i_fsnotify_marks, dnotify_group); > if (!fsn_mark) > return; > before grabbing that thing. So the things are trickier - we actually > rely upon the barriers in fsnotify_find_mark(). And those are complicated. > The case when it returns non-NULL is not a problem - there we have that > mutex providing the barriers we need. NULL can be returned in two cases: > a) ->i_fsnotify_marks is not empty, but it contains no > dnotify marks. In that case we have ->i_fsnotify_marks.lock acquired > and released. By the time it gets around to fcheck(), fcntl_dirnotify() has > either found or created and inserted a dnotify mark into that list, with > ->i_fsnotify_marks.lock acquired/released around the insertion, so we > are fine - either fcntl_dirnotify() gets there first (in which case > dnotify_flush() will observe it), or release of that lock by > fsnotify_find_mark() called by dnotify_flush() will serve as a barrier, > making sure that store to descriptor table will be observed. > b) fsnotify_find_mark() (fsnotify_grab_connector() it calls, > actually) finds ->i_fsnotify_marks empty. That's where the things > get interesting; we have > idx = srcu_read_lock(&fsnotify_mark_srcu); > conn = srcu_dereference(*connp, &fsnotify_mark_srcu); > if (!conn) > goto out; > on dnotify_flush() side. The matching store of fcntl_dirnotify() > side would be in fsnotify_attach_connector_to_object(), where > we have > /* > * cmpxchg() provides the barrier so that readers of *connp can see > * only initialized structure > */ > if (cmpxchg(connp, NULL, conn)) { > /* Someone else created list structure for us */ > > So we have > A: > store NULL to descriptor table > srcu_read_lock > srcu_dereference fetches NULL from ->i_fsnotify_marks > vs. > B: > cmpxchg replaces NULL with non-NULL in ->i_fsnotify_marks > fetch from descriptor table, can't miss the store done by A > > Which might be safe, but the whole thing *RELLY* needs to be discussed > in fcntl_dirnotify() in more details. fs/notify/* guts are convoluted > enough to confuse anyone unfamiliar with them. This ordering relies on the full barrier in srcu_read_lock(). There was a time when srcu_read_lock() did not have a full barrier, and there might well be a time in the future when srcu_read_lock() no longer has a full barrier. So please add smp_mb__after_srcu_read_unlock() immediately after the srcu_read_lock() if you are relying on that full barrier. Thanx, Paul