On 18 Jul 2016, at 10:36, Weston Andros Adamson wrote:
On Jul 18, 2016, at 10:32 AM, Alexey Dobriyan <adobriyan@xxxxxxxxx> wrote:On Mon, Jul 18, 2016 at 3:35 PM, Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote:On Jul 18, 2016, at 07:43, Benjamin Coddington <bcodding@xxxxxxxxxx> wrote:On 11 Jul 2016, at 9:28, Trond Myklebust wrote:On Jul 11, 2016, at 08:59, Alexey Dobriyan <adobriyan@xxxxxxxxx> wrote:We have a customer who was able to reliably reproduce the following hang: (hang itself is rare but there are many machines, so it is not rare)INFO: task ascp:66692 blocked for more than 120 seconds.bt 66692PID: 66692 TASK: ffff883f124ba280 CPU: 5 COMMAND: "ascp" __schedule schedule schedule_timeout io_schedule_timeout io_schedule nfs_wait_bit_uninterruptible __wait_on_bit out_of_line_wait_on_bit nfs_wait_on_request nfs_try_to_update_request nfs_setup_write_request nfs_writepage_setup nfs_updatepage nfs_write_end generic_file_buffered_write __generic_file_aio_write generic_file_aio_write nfs_file_write do_sync_write vfs_write sys_write system_call_fastpath ascp is Aspera secure copy program. It is multithreaded. When the hang happens, 5 threads are in S-state and 1 on D-state.The workload is copying ~100+ GB files over NFS3 with 10GiB ethernetwith the following mount options: rw,relatime,vers=3,rsize=131072,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=X.X.X.X,mountvers=3,mountport=300,mountproto=udp,local_lock=none,addr=X.X.X.X We got coredump (2 actually) and observations are below:1) one of the inode's pagecache pages has page->private available:struct nfs_page { wb_list = { next = 0xffff885623ac4f80, prev = 0xffff885623ac4f80 }, wb_page = 0xffffea01218c2600, wb_context = 0xffff887f2265de00, wb_lock_context = 0xffff887f2265de00, wb_index = 2649328, wb_offset = 0, wb_pgbase = 0, wb_bytes = 0, wb_kref = { refcount = { counter = 3 } }, wb_flags = 19, wb_verf = { data = "\000\000\000\000\000\000\000" }, wb_this_page = 0xffff885623ac4f80, wb_head = 0xffff885623ac4f80 } ->wb_list is always empty, ->wb_bytes is always 0 (!). ->wb_kref is always 3. ->wb_flags = PG_BUSY|PG_MAPPED|PG_INODE_REF, page->flags = 0x2869 = PG_writeback|PG_private|PG_active|PF_lru|PG_uptodate|PG_locked The zero-length request is created at nfs_create_request(): WARN_ON_ONCE(req->wb_bytes == 0) nfs_create_request nfs_setup_write_request nfs_updatepage nfs_write_end generic_perform_write generic_file_buffered_write __generic_file_aio_write generic_file_aio_write nfs_file_write do_sync_write vfs_write SyS_write with count=0 coming from generic_perform_write (where else?).Customer is saying that RHEL6 was totally OK, and there are several reportsof other people hitting same bug with RHEL7: https://gist.github.com/Millnert/ecc10d8cc79c81b55d7f https://bugs.centos.org/view.php?id=9284Why is this being reported here and not to Red Hat? Is the bug reproducible on the upstream kernel?There's a RHEL BZ open for it: https://bugzilla.redhat.com/show_bug.cgi?id=132463Upstream has the problem, too. The pgio layer doesn't expect zero-lengthrequests, so an nfs_page can get locked but never added to a pageio descriptor.To create such a problem, writeback has to happen just after nfs_write_end()has created a nfs_page with wb_bytes = 0. This can happen if iov_iter_copy_from_user_atomic() fails to copy pages ingeneric_perform_write(). Normally, we'd just go around the loop and try the copy again, but if writeback sneaks in then the request gets locked but never added to a pgio descriptor due to the way we use conditions on zerobytes to move requests around. I see two potential ways to fix:- Just skip the creation of the request if the length is zero. We'rebound to just come around and created it again.- Allow zero-length requests, but fix up __nfs_pageio_add_request,nfs_generic_pg_test, and others that use conditions on zero.I think the first option to handle the case early is the simpler fix, and best approach, since: what do we do with a zero-length request? Notcreating them helps us be conservative in what we send out to the nfs_pageio_ops.I don't understand what conditions can cause iov_iter_copy_from_user_atomicto fail since pagefaults are disabled. Can anyone enlighten me?To reproduce this upstream, I've simulated iov_iter_copy_from_user_atomic() returning zero once, then inserted a small delay after nfs_write_end toallow me to sneak in a sync.Any advice on how to proceed? I'll send a patch to fix the first way otherwise. If anyone has trouble with that RHEL BZ, send me your bugzillaemail and I'll give you access.I think adding a check to nfs_updatepage() to bail out if ‘count’ is zero would seem to be the right thing to do.This is exactly what we did as a workaround for customer, so I can already confirm it helps. :^)Makes sense to me too! Who wants to take the lead and post a patch? Ben?
Sure.. I'll send it. I was hunting around trying to figure out an appropriate version number for a stable tag.. I think it became a problem in v3.15 or so -- but there's a lot that moved around so I am not sure.
Ben -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html