пн, 28 окт. 2019 г. в 05:13, Frank Sorenson <sorenson@xxxxxxxxxx>: > > On 10/26/19 4:04 PM, Ronnie Sahlberg wrote: > > Steve, Pavel, Frank > > > > This patch moves the logic in cifsFileInfo_put() into a work-queue so that > > it will be processed in a different thread which, importantly, does not hold > > any other locks. > > > > This should address the deadlock that Frank reported in the thread: > > Yet another hang after network disruption or smb restart -- multiple file writers > > Pavel, > > Thanks for understanding my report and translating it into what it meant. > > Ronnie, > Thanks for the patch. I'm happy to say that my attempt at the same > looks like it would have been similar, had I known what I was doing to > begin with :) > > > Unfortunately, I think the patch really only moves the problem from one > place to another. I'm guessing that means we have a second underlying > deadlock. (this reproducer appears to be quite effective) > > > On the plus side, only the one file involved is jammed in the deadlock > now, as opposed to the parent directory as well, so it is at least progress. > Hi Frank, Ronnie, Ok, it seems there is another bug resulting in a deadlock between a page lock and writeback (WB) flag in cifs_writev_complete and cifs_writepages. 1) cifs_writev_complete. in the case of WB_SYNC_ALL and EAGAIN (reconnect) it calls cifs_writev_requeue. At this point it still has WB flag set for all the pages being retried. Inside cifs_writev_requeue it tries to lock the page: kworker/4:3+cifsiod #0 __schedule at ffffffffab4f1a45 #1 schedule at ffffffffab4f1ea0 #2 io_schedule at ffffffffab4f22a2 #3 __lock_page at ffffffffaae3f66f #4 cifs_writev_complete at ffffffffc07df762 #5 process_one_work at ffffffffaacecc11 #6 worker_thread at ffffffffaaced0f0 #7 kthread at ffffffffaacf2ac2 #8 ret_from_fork at ffffffffab600215 2) cifs_writepages. It locks the page and then waits on WB flag to be cleared before processing pages: 1 openloop.sh #0 __schedule at ffffffffab4f1a45 #1 schedule at ffffffffab4f1ea0 #2 io_schedule at ffffffffab4f22a2 #3 wait_on_page_bit at ffffffffaae3fbc6 #4 cifs_writepages at ffffffffc07ffc73 #5 do_writepages at ffffffffaae4ba71 #6 __filemap_fdatawrite_range at ffffffffaae4321b #7 filemap_write_and_wait at ffffffffaae432aa #8 cifs_flush at ffffffffc0800fe3 #9 filp_close at ffffffffaaeed181 #10 do_dup2 at ffffffffaaf12c0b #11 __x64_sys_dup2 at ffffffffaaf1306a #12 do_syscall_64 at ffffffffaac04345 The bug is in cifs_writev_complete and cifs_writev_requeue: the 1st needs to clear the WB flag and re-dirty pages before trying to retry the IO; the 2nd needs additionally to skip pages that are no longer dirty (e.g. written by another thread) and re-set the WB flag before sending pages over the wire. -- Best regards, Pavel Shilovsky