Re: Easily reproducible deadlock on 5.4-rc4 with cifsInodeInfo.lock_sem semaphore after network disruption or smb restart

Pavel Shilovsky <piastryyy@xxxxxxxxx> · Mon, 21 Oct 2019 15:19:10 -0700

пн, 21 окт. 2019 г. в 10:23, David Wysochanski <dwysocha@xxxxxxxxxx>:
>
> HI all,
>
> We recently found another problem when running a simple test case
> where multiple processes open/read/close a single file, and a network
> disruption or smb restart occurs.
>
> After just a few minutes of restarts, a few of the processes error
> out, but a few remain.  If you stop the disruption, the share is
> accessible again, but the original processes are blocked indefinitely
> waiting on lock_sem
>
> Once this occurs, you'll see two threads deadlocked like the below.  Basically:
> thread1: down_read(lock_sem) inside cifs_strict_readv(); does not
> block, but blocks on network temporarily
> thread2: down_write(lock_sem) inside _cifsFileInfo_put; blocks due to thread 1
> thread1: down_read(lock_sem) inside cifs_reopen_file (after
> reconnect); unblocks to reconnect, but then blocks due to thread 2
>
> At this point there is a deadlock, both threads are waiting for each
> other to release lock_sem
>
> The root of the problem seems to be having to hold the read sem over a
> network disruption, and no check when it comes back before down_read a
> second time.
> Inside cifs_strict_readv there's a comment that explains why we need
> hold the sem over the read, and it's so no one modifies the lock list
> with a brlock that prevents writing.
> However, in this case at least, we're only opening files for reading,
> and inside _cifsFileInfo_put it needs to down_write to delete the
> outstanding lock records.
> So we're not really hitting that case, but are modifying the list.
>
> One approach to avoid the deadlock is for the reading process to drop
> the lock_sem before re-aquiring it again inside cifs_reopen_file.
> We would drop the lock just before calling, and then inside we'd need
> to check for the condition of the conflicting brlock, and then
> possibly error out somehow.
> But that is probably not preferred.
>
> Instead, can the reader only down_read one time?  cifs_relock_file is
> only called from one place (cifs_reopen_file) and it's when we're
> reconnecting, but
> cifs_reopen_file is called from many places. I'm not sure if we there
> is something in existence we can use that says we don't need to
> re-aquire it, or if we need to add a flag or something?
>
>
>
>
>
> Here is the live crash session
>
> crash> ps -m | grep UN
> [0 01:54:10.166] [UN]  PID: 13271  TASK: ffff93b8a7349e40  CPU: 2
> COMMAND: "python3-openloo"
> [0 01:54:10.167] [UN]  PID: 13273  TASK: ffff93b8b3310000  CPU: 2
> COMMAND: "python3-openloo"
> [0 01:54:10.169] [UN]  PID: 13269  TASK: ffff93b8a734dac0  CPU: 0
> COMMAND: "python3-openloo"
> [0 01:54:10.208] [UN]  PID: 13272  TASK: ffff93b8a7348000  CPU: 7
> COMMAND: "python3-openloo"
> [0 01:54:10.216] [UN]  PID: 13265  TASK: ffff93b8a7358000  CPU: 6
> COMMAND: "python3-openloo"
> [0 01:54:10.172] [UN]  PID: 13263  TASK: ffff93b8b3313c80  CPU: 2
> COMMAND: "python3-openloo"
> [0 01:54:10.229] [UN]  PID: 13267  TASK: ffff93b8a7359e40  CPU: 5
> COMMAND: "python3-openloo"
> [0 01:54:10.234] [UN]  PID: 13277  TASK: ffff93b8a7dddac0  CPU: 3
> COMMAND: "python3-openloo"
> [0 01:54:10.235] [UN]  PID: 13270  TASK: ffff93b8a734bc80  CPU: 1
> COMMAND: "python3-openloo"
> [0 01:54:10.244] [UN]  PID: 13276  TASK: ffff93b8a7dd8000  CPU: 6
> COMMAND: "python3-openloo"
> [0 01:54:10.260] [UN]  PID: 13264  TASK: ffff93b8b3315ac0  CPU: 6
> COMMAND: "python3-openloo"
> [0 01:54:10.257] [UN]  PID: 13274  TASK: ffff93b8a7ddbc80  CPU: 1
> COMMAND: "python3-openloo"
> [0 01:54:31.262] [UN]  PID: 13279  TASK: ffff93b8a7d90000  CPU: 1
> COMMAND: "python3-openloo" <--- block same instant
> [0 01:54:31.265] [UN]  PID: 13275  TASK: ffff93b8a7dd9e40  CPU: 4
> COMMAND: "python3-openloo" <--- block same instant
> crash> bt 13279
> PID: 13279  TASK: ffff93b8a7d90000  CPU: 1   COMMAND: "python3-openloo"
>  #0 [ffffaae1c0a77c20] __schedule at ffffffffa19d4239
>  #1 [ffffaae1c0a77cb0] schedule at ffffffffa19d46b9
>  #2 [ffffaae1c0a77cc8] rwsem_down_write_slowpath at ffffffffa1131ada
> <------------- T+1: pid 13279 down_write(... lock_sem); blocks due to
> pid 13275
>  #3 [ffffaae1c0a77d98] _cifsFileInfo_put at ffffffffc0759818 [cifs]
>  #4 [ffffaae1c0a77e78] cifs_close at ffffffffc0759b4f [cifs]
>  #5 [ffffaae1c0a77e88] __fput at ffffffffa12f6e51
>  #6 [ffffaae1c0a77ed0] task_work_run at ffffffffa10fc827
>  #7 [ffffaae1c0a77f08] exit_to_usermode_loop at ffffffffa1003f83
>  #8 [ffffaae1c0a77f38] do_syscall_64 at ffffffffa10043c2
>  #9 [ffffaae1c0a77f50] entry_SYSCALL_64_after_hwframe at ffffffffa1a0008c
>     RIP: 00007fc6caca1778  RSP: 00007ffed0c43de8  RFLAGS: 00000246
>     RAX: 0000000000000000  RBX: 00007fc6bd5a5f30  RCX: 00007fc6caca1778
>     RDX: 0000000000000000  RSI: 0000000000000000  RDI: 0000000000000003
>     RBP: 0000558ebb9ca0d0   R8: 00007fc6cab941a8   R9: 0000000000000000
>     R10: 0000558ebb9ca0d0  R11: 0000000000000246  R12: 0000000000000000
>     R13: 0000000000000000  R14: 00007fc6cab4df0c  R15: 00007ffed0c43f10
>     ORIG_RAX: 0000000000000003  CS: 0033  SS: 002b
> crash> bt 13275
> PID: 13275  TASK: ffff93b8a7dd9e40  CPU: 4   COMMAND: "python3-openloo"
>  #0 [ffffaae1c0a57a00] __schedule at ffffffffa19d4239
>  #1 [ffffaae1c0a57a90] schedule at ffffffffa19d46b9
>  #2 [ffffaae1c0a57aa8] rwsem_down_read_slowpath at ffffffffa19d72b4
>  #3 [ffffaae1c0a57b60] cifs_reopen_file at ffffffffc0755352 [cifs]
> <------------- T+2: pid 13275 down_read(... lock_sem); blocks due to
> pid 13279
>  #4 [ffffaae1c0a57c08] cifs_readpage_worker at ffffffffc07575f0 [cifs]
>  #5 [ffffaae1c0a57cd0] cifs_readpage at ffffffffc0758096 [cifs]
>  #6 [ffffaae1c0a57d08] generic_file_read_iter at ffffffffa123acd2
>  #7 [ffffaae1c0a57e10] cifs_strict_readv at ffffffffc07606a5 [cifs]
> <------------- T+0: pid 13275 down_read(... lock_sem); does not block
>  #8 [ffffaae1c0a57e40] new_sync_read at ffffffffa12f336a
>  #9 [ffffaae1c0a57ec8] vfs_read at ffffffffa12f5fad
> #10 [ffffaae1c0a57f00] ksys_read at ffffffffa12f62df
> #11 [ffffaae1c0a57f38] do_syscall_64 at ffffffffa10042bb
> #12 [ffffaae1c0a57f50] entry_SYSCALL_64_after_hwframe at ffffffffa1a0008c
>     RIP: 00007f6ed0b7c055  RSP: 00007ffeea637e68  RFLAGS: 00000246
>     RAX: ffffffffffffffda  RBX: 00007f6ed05cd600  RCX: 00007f6ed0b7c055
>     RDX: 000000000000001e  RSI: 00007f6ec3404c10  RDI: 0000000000000003
>     RBP: 000000000000001e   R8: 00007f6ed0a6f1a8   R9: 0000000000000000
>     R10: 00007f6ed090faf0  R11: 0000000000000246  R12: 00007f6ec3480f40
>     R13: 0000000000000003  R14: 00007f6ec3404c10  R15: 000055c449ea60d0
>     ORIG_RAX: 0000000000000000  CS: 0033  SS: 002b
>
>
> Here is my tracing from the attached debug patch which shows the
> conditions, and points at the bugzilla we have open:
> [  233.588248] CIFS VFS: file=fs/cifs/file.c line=480 pid=13279
> cifs_sem_op op=up_write sem=ffff93b8a7424bd8
> [  233.589462] CIFS VFS: file=fs/cifs/file.c line=311 pid=13279
> cifs_sem_op op=down_write sem=ffff93b8a7424bd8
> [  233.592333] CIFS VFS: file=fs/cifs/file.c line=314 pid=13275
> cifs_sem_op op=up_write sem=ffff93b8a7424bd8
> [  233.612324] CIFS VFS: file=fs/cifs/file.c line=314 pid=13279
> cifs_sem_op op=up_write sem=ffff93b8a7424bd8
> [  233.612510] CIFS VFS: file=fs/cifs/file.c line=3920 pid=13275
> cifs_sem_op op=down_read sem=ffff93b8a7424bd8
> [  233.616487] CIFS VFS: file=fs/cifs/file.c line=3920 pid=13279
> cifs_sem_op op=down_read sem=ffff93b8a7424bd8  T0:
> https://bugzilla.redhat.com/show_bug.cgi?id=1760835#c9
> [  254.389458] CIFS VFS: Send error in read = -9
> [  254.391943] CIFS VFS: file=fs/cifs/file.c line=3926 pid=13279
> cifs_sem_op op=up_read sem=ffff93b8a7424bd8
> [  254.397453] CIFS VFS: file=fs/cifs/file.c line=471 pid=13279
> cifs_sem_op op=down_write sem=ffff93b8a7424bd8  T1:
> https://bugzilla.redhat.com/show_bug.cgi?id=1760835#c9
> [  254.399798] CIFS VFS: file=fs/cifs/file.c line=630 pid=13275
> cifs_sem_op op=down_read sem=ffff93b8a7424bd8   T2:
> https://bugzilla.redhat.com/show_bug.cgi?id=1760835#c9
>
>
>
> There was a patch that removed a lockdep warning, but it looks like
> this removed a semi-valid warning, because clearly we can deadlock
> here:
>
> commit 560d388950ceda5e7c7cdef7f3d9a8ff297bbf9d
> Author: Rabin Vincent <rabinv@xxxxxxxx>
> Date:   Wed May 3 17:17:21 2017 +0200
>
>     CIFS: silence lockdep splat in cifs_relock_file()
>
>     cifs_relock_file() can perform a down_write() on the inode's lock_sem even
>     though it was already performed in cifs_strict_readv().  Lockdep complains
>     about this.  AFAICS, there is no problem here, and lockdep just needs to be
>     told that this nesting is OK.
>
>      =============================================
>      [ INFO: possible recursive locking detected ]
>      4.11.0+ #20 Not tainted
>      ---------------------------------------------
>      cat/701 is trying to acquire lock:
>       (&cifsi->lock_sem){++++.+}, at: cifs_reopen_file+0x7a7/0xc00
>
>      but task is already holding lock:
>       (&cifsi->lock_sem){++++.+}, at: cifs_strict_readv+0x177/0x310
>
>      other info that might help us debug this:
>       Possible unsafe locking scenario:
>
>             CPU0
>             ----
>        lock(&cifsi->lock_sem);
>        lock(&cifsi->lock_sem);
>
>       *** DEADLOCK ***
>
>       May be due to missing lock nesting notation
>
>      1 lock held by cat/701:
>       #0:  (&cifsi->lock_sem){++++.+}, at: cifs_strict_readv+0x177/0x310
>
>      stack backtrace:
>      CPU: 0 PID: 701 Comm: cat Not tainted 4.11.0+ #20
>      Call Trace:
>       dump_stack+0x85/0xc2
>       __lock_acquire+0x17dd/0x2260
>       ? trace_hardirqs_on_thunk+0x1a/0x1c
>       ? preempt_schedule_irq+0x6b/0x80
>       lock_acquire+0xcc/0x260
>       ? lock_acquire+0xcc/0x260
>       ? cifs_reopen_file+0x7a7/0xc00
>       down_read+0x2d/0x70
>       ? cifs_reopen_file+0x7a7/0xc00
>       cifs_reopen_file+0x7a7/0xc00
>       ? printk+0x43/0x4b
>       cifs_readpage_worker+0x327/0x8a0
>       cifs_readpage+0x8c/0x2a0
>       generic_file_read_iter+0x692/0xd00
>       cifs_strict_readv+0x29f/0x310
>       generic_file_splice_read+0x11c/0x1c0
>       do_splice_to+0xa5/0xc0
>       splice_direct_to_actor+0xfa/0x350
>       ? generic_pipe_buf_nosteal+0x10/0x10
>       do_splice_direct+0xb5/0xe0
>       do_sendfile+0x278/0x3a0
>       SyS_sendfile64+0xc4/0xe0
>       entry_SYSCALL_64_fastpath+0x1f/0xbe
>
>     Signed-off-by: Rabin Vincent <rabinv@xxxxxxxx>
>     Acked-by: Pavel Shilovsky <pshilov@xxxxxxxxxxxxx>
>     Signed-off-by: Steve French <smfrench@xxxxxxxxx>

So, even after we already grabbed the semaphore it doesn't allow us to
acquire it one more time because of the conflicting waiting down_write
? If yes, then we can create another work queue that will be
re-locking files that have been re-opened. In this case
cifs_reopen_file will simply submit the "re-lock" request to the queue
without the need to grab the lock one more time.

Thoughts?

--
Best regards,
Pavel Shilovsky