On Tue, Oct 22, 2019 at 12:43 PM David Wysochanski <dwysocha@xxxxxxxxxx> wrote: > > On Tue, Oct 22, 2019 at 12:09 PM Pavel Shilovsky <piastryyy@xxxxxxxxx> wrote: > > > > пн, 21 окт. 2019 г. в 20:05, David Wysochanski <dwysocha@xxxxxxxxxx>: > > > > ? If yes, then we can create another work queue that will be > > > > re-locking files that have been re-opened. In this case > > > > cifs_reopen_file will simply submit the "re-lock" request to the queue > > > > without the need to grab the lock one more time. > > > > > > > I think you're approach of a work queue may work but it would have to > > > be on a per-pid basis, as would any solution. > > > > Why do you think we should have a queue per pid? What are possible > > problems to have a single queue per SMB connection or Share (Tree) > > connection? > > > > What you're proposing may work but would have to see the code and how > lock_sem would be used. If you're serializing the relocking to only > one pid it should be fine. > > I can describe what I tried (attached patch) and why it did not work. > Here's the problem code path. Note that down_read(lock_sem) #2 is > only a problem if down_write(lock_sem) happens between #1 and #2, but > we're trying to avoid #2 to avoid the deadlock. > > PID: 5464 TASK: ffff9862640b2f80 CPU: 1 COMMAND: "python3-openloo" > #0 [ffffbd2d81203a20] __schedule at ffffffff9aa629e3 > #1 [ffffbd2d81203ac0] schedule at ffffffff9aa62fe8 > #2 [ffffbd2d81203ac8] rwsem_down_read_failed at ffffffff9aa660cd > #3 [ffffbd2d81203b58] cifs_reopen_file at ffffffffc0529dce [cifs] > <--- down_read (lock_sem) #2 > #4 [ffffbd2d81203c00] cifs_read.constprop.46 at ffffffffc052bbe0 [cifs] > #5 [ffffbd2d81203ca8] cifs_readpage_worker at ffffffffc052c0a7 [cifs] > #6 [ffffbd2d81203ce0] cifs_readpage at ffffffffc052c427 [cifs] > #7 [ffffbd2d81203d20] generic_file_buffered_read at ffffffff9a40f2b1 > #8 [ffffbd2d81203e10] cifs_strict_readv at ffffffffc0533a7b [cifs] > <---- down_read (lock_sem) #1 > #9 [ffffbd2d81203e40] new_sync_read at ffffffff9a4b6e21 > #10 [ffffbd2d81203ec8] vfs_read at ffffffff9a4b9671 > #11 [ffffbd2d81203f00] ksys_read at ffffffff9a4b9aaf > #12 [ffffbd2d81203f38] do_syscall_64 at ffffffff9a2041cb > > Note that each of these down_read() calls are per-pid (each pid has to > get the sem). I tried this patch: > --- a/fs/cifs/cifsglob.h > +++ b/fs/cifs/cifsglob.h > @@ -1404,6 +1404,7 @@ struct cifsInodeInfo { > #define CIFS_INO_DELETE_PENDING (3) /* delete > pending on server */ > #define CIFS_INO_INVALID_MAPPING (4) /* pagecache is invalid */ > #define CIFS_INO_LOCK (5) /* lock bit for synchronization */ > +#define CIFS_LOCK_SEM_HELD (6) /* the lock_sem is held */ > unsigned long flags; > spinlock_t writers_lock; > unsigned int writers; /* Number of writers on this inode */ > diff --git a/fs/cifs/file.c b/fs/cifs/file.c > index 5ad15de2bb4f..153adcd3ad16 100644 > --- a/fs/cifs/file.c > +++ b/fs/cifs/file.c > @@ -620,11 +620,15 @@ cifs_relock_file(struct cifsFileInfo *cfile) > struct cifsInodeInfo *cinode = CIFS_I(d_inode(cfile->dentry)); > struct cifs_tcon *tcon = tlink_tcon(cfile->tlink); > int rc = 0; > + int lock_sem_held = 0; > > - down_read_nested(&cinode->lock_sem, SINGLE_DEPTH_NESTING); > + lock_sem_held = test_bit(CIFS_LOCK_SEM_HELD, &cinode->flags); > + if (!lock_sem_held) > + down_read_nested(&cinode->lock_sem, SINGLE_DEPTH_NESTING); > if (cinode->can_cache_brlcks) { > /* can cache locks - no need to relock */ > - up_read(&cinode->lock_sem); > + if (!lock_sem_held) > + up_read(&cinode->lock_sem); > return rc; > } > > @@ -635,7 +639,8 @@ cifs_relock_file(struct cifsFileInfo *cfile) > else > rc = tcon->ses->server->ops->push_mand_locks(cfile); > > - up_read(&cinode->lock_sem); > + if (!lock_sem_held) > + up_read(&cinode->lock_sem); > return rc; > } > > @@ -3888,11 +3893,13 @@ cifs_strict_readv(struct kiocb *iocb, struct > iov_iter *to) > * with a brlock that prevents reading. > */ > down_read(&cinode->lock_sem); > + set_bit(CIFS_LOCK_SEM_HELD, &cinode->flags); > if (!cifs_find_lock_conflict(cfile, iocb->ki_pos, iov_iter_count(to), > tcon->ses->server->vals->shared_lock_type, > 0, NULL, CIFS_READ_OP)) > rc = generic_file_read_iter(iocb, to); > up_read(&cinode->lock_sem); > + clear_bit(CIFS_LOCK_SEM_HELD, &cinode->flags); > return rc; > } > > > Note that the bit is associated with the cifsInodeInfo file, and we > only have one file but multiple pids so it's not an effective > solution. To describe, the patch doesn't work because when the server > goes down, you have many processes all having passed down_read() #1, > and they are blocked waiting for a response. When the server comes > back at least one will see the bit is set and will skip down_read() > #2, but then it will proceed to finish the cifs_strict_readv(), and so > clear the bit, and other pids may not see the bit set anymore, and > will call down_read() #2, leading to the same deadlock. One other thing to think about here is this. The other part of the deadlock, as seen in the reproducer, is due to this process: PID: 5411 TASK: ffff9862788fc740 CPU: 3 COMMAND: "python3-openloo" #0 [ffffbd2d80fbfc60] __schedule at ffffffff9aa629e3 #1 [ffffbd2d80fbfd00] schedule at ffffffff9aa62fe8 #2 [ffffbd2d80fbfd08] rwsem_down_write_failed at ffffffff9aa65dbc #3 [ffffbd2d80fbfda8] down_write at ffffffff9aa65be9 #4 [ffffbd2d80fbfdb8] _cifsFileInfo_put at ffffffffc052da65 [cifs] #5 [ffffbd2d80fbfe98] cifs_close at ffffffffc052dd3f [cifs] #6 [ffffbd2d80fbfea8] __fput at ffffffff9a4ba5c7 #7 [ffffbd2d80fbfee8] task_work_run at ffffffff9a2d1c5a #8 [ffffbd2d80fbff20] exit_to_usermode_loop at ffffffff9a203b4b #9 [ffffbd2d80fbff38] do_syscall_64 at ffffffff9a2042f2 #10 [ffffbd2d80fbff50] entry_SYSCALL_64_after_hwframe at ffffffff9ac000ad What's that process doing? Well it is blocked on down_write() deleting and freeing all entries on cifs_file->llist->locks void _cifsFileInfo_put(struct cifsFileInfo *cifs_file, bool wait_oplock_handler) /* * Delete any outstanding lock records. We'll lose them when the file * is closed anyway. */ down_write(&cifsi->lock_sem); list_for_each_entry_safe(li, tmp, &cifs_file->llist->locks, llist) { list_del(&li->llist); cifs_del_lock_waiters(li); kfree(li); } The other trace we see where someone is trying to down_write is here: PID: 14207 TASK: ffff8ac1f3369e40 CPU: 6 COMMAND: "python3-openloo" #0 [ffff9cfc41ccb990] __schedule at ffffffff9f9d2d29 #1 [ffff9cfc41ccba20] schedule at ffffffff9f9d31a9 #2 [ffff9cfc41ccba38] rwsem_down_write_slowpath at ffffffff9f131b0a #3 [ffff9cfc41ccbb08] cifs_new_fileinfo at ffffffffc065833a [cifs] #4 [ffff9cfc41ccbb78] cifs_open at ffffffffc0658a8c [cifs] #5 [ffff9cfc41ccbcd0] do_dentry_open at ffffffff9f2f09ca #6 [ffff9cfc41ccbd00] path_openat at ffffffff9f3050ad #7 [ffff9cfc41ccbdd8] do_filp_open at ffffffff9f307431 #8 [ffff9cfc41ccbee0] do_sys_open at ffffffff9f2f2154 #9 [ffff9cfc41ccbf38] do_syscall_64 at ffffffff9f0042bb #10 [ffff9cfc41ccbf50] entry_SYSCALL_64_after_hwframe at ffffffff9fa0008c What is that process doing? It is creating a new empty list of cifs_fid_locks and adding that onto cinode->llist: cifs_new_fileinfo(struct cifs_fid *fid, struct file *file, struct tcon_link *tlink, __u32 oplock) ... cfile = kzalloc(sizeof(struct cifsFileInfo), GFP_KERNEL); if (cfile == NULL) return cfile; fdlocks = kzalloc(sizeof(struct cifs_fid_locks), GFP_KERNEL); if (!fdlocks) { kfree(cfile); return NULL; } INIT_LIST_HEAD(&fdlocks->locks); fdlocks->cfile = cfile; cfile->llist = fdlocks; down_write(&cinode->lock_sem); list_add(&fdlocks->llist, &cinode->llist); up_write(&cinode->lock_sem); So back to the reader, why did he down_read() and hold the semaphore and go to sleep waiting for the read response (or a reconnect)? cifs_strict_readv(struct kiocb *iocb, struct iov_iter *to) ... /* * We need to hold the sem to be sure nobody modifies lock list * with a brlock that prevents reading. */ down_read(&cinode->lock_sem); if (!cifs_find_lock_conflict(cfile, iocb->ki_pos, iov_iter_count(to), tcon->ses->server->vals->shared_lock_type, 0, NULL, CIFS_READ_OP)) rc = generic_file_read_iter(iocb, to); up_read(&cinode->lock_sem); Neither of the above two callers of down_write() are doing the operation that down_read #1 is meaning to avoid - "modify the lock list with a brlock that prevents reading". So if there is some way to pull that piece out, then you can solve the deadlock. The existing use of lock_sem to both a) protect reads against a lock that prevents reading, and b) protect modifications to cinode->llist causes unnecessary contention of the semaphore and this deadlock.