On Thu, Oct 15, 2020 at 11:16:06AM -0400, Vivek Goyal wrote: > On Wed, Oct 14, 2020 at 07:44:16PM -0700, Linus Torvalds wrote: > > On Wed, Oct 14, 2020 at 6:48 PM Qian Cai <cai@xxxxxx> wrote: > > > > > > While on this topic, I just want to bring up a bug report that we are chasing an > > > issue that a process is stuck in the loop of wait_on_page_bit_common() for more > > > than 10 minutes before I gave up. > > > > Judging by call trace, that looks like a deadlock rather than a missed wakeup. > > > > The trace isn't reliable, but I find it suspicious that the call trace > > just before the fault contains that > > "iov_iter_copy_from_user_atomic()". > > > > IOW, I think you're in fuse_fill_write_pages(), which has allocated > > the page, locked it, and then it takes a page fault. > > > > And the page fault waits on a page that is locked. > > > > This is a classic deadlock. > > > > The *intent* is that iov_iter_copy_from_user_atomic() returns zero, > > and you retry without the page lock held. > > > > HOWEVER. > > > > That's not what fuse actually does. Fuse will do multiple pages, and > > it will unlock only the _last_ page. It keeps the other pages locked, > > and puts them in an array: > > > > ap->pages[ap->num_pages] = page; > > > > And after the iov_iter_copy_from_user_atomic() fails, it does that > > "unlock" and repeat. > > > > But while the _last_ page was unlocked, the *previous* pages are still > > locked in that array. Deadlock. > > > > I really don't think this has anything at all to do with page locking, > > and everything to do with fuse_fill_write_pages() having a deadlock if > > the source of data is a mmap of one of the pages it is trying to write > > to (just with an offset, so that it's not the last page). > > > > See a similar code sequence in generic_perform_write(), but notice how > > that code only has *one* page that it locks, and never holds an array > > of pages around over that iov_iter_fault_in_readable() thing. > > Indeed. This is a deadlock in fuse. Thanks for the analysis. I can now > trivially reproduce it with following program. I am wondering how should I fix this issue. Is it enough that I drop the page lock (but keep the reference) inside the loop. And once copying from user space is done, acquire page locks for all pages (Attached a patch below). Or dropping page lock means that there are no guarantees that this page did not get written back and removed from address space and a new page has been placed at same offset. Does that mean I should instead be looking up page cache again after copying from user space is done. --- fs/fuse/file.c | 20 ++++++++++++++++++-- 1 file changed, 18 insertions(+), 2 deletions(-) Index: redhat-linux/fs/fuse/file.c =================================================================== --- redhat-linux.orig/fs/fuse/file.c 2020-10-15 14:11:23.464720742 -0400 +++ redhat-linux/fs/fuse/file.c 2020-10-15 15:23:17.716569698 -0400 @@ -1117,7 +1117,7 @@ static ssize_t fuse_fill_write_pages(str struct fuse_conn *fc = get_fuse_conn(mapping->host); unsigned offset = pos & (PAGE_SIZE - 1); size_t count = 0; - int err; + int err, i; ap->args.in_pages = true; ap->descs[0].offset = offset; @@ -1155,6 +1155,14 @@ static ssize_t fuse_fill_write_pages(str goto again; } + /* + * Unlock page but retain reference count. Unlock is requied + * otherwise it is possible that next loop iteration tries + * to fault in page (source address) we have lock on and we + * will deadlock. So drop lock but keep a reference on the + * page. Re-acquire page lock after breaking out of the loop + */ + unlock_page(page); err = 0; ap->pages[ap->num_pages] = page; ap->descs[ap->num_pages].length = tmp; @@ -1171,7 +1179,15 @@ static ssize_t fuse_fill_write_pages(str } while (iov_iter_count(ii) && count < fc->max_write && ap->num_pages < max_pages && offset == 0); - return count > 0 ? count : err; + if (count <= 0) + return err; + + for (i = 0; i < ap->num_pages; i++) { + /* Take page lock */ + lock_page(ap->pages[i]); + } + + return count; } static inline unsigned int fuse_wr_pages(loff_t pos, size_t len,