Re: RHEL7 hang (and probably mainline) while copying big files because of subrequests merging(?)

"Benjamin Coddington" <bcodding@xxxxxxxxxx> · Mon, 18 Jul 2016 10:40:01 -0400

On 18 Jul 2016, at 10:36, Weston Andros Adamson wrote:

On Jul 18, 2016, at 10:32 AM, Alexey Dobriyan <adobriyan@xxxxxxxxx> 
wrote:

On Mon, Jul 18, 2016 at 3:35 PM, Trond Myklebust
<trondmy@xxxxxxxxxxxxxxx> wrote:

On Jul 18, 2016, at 07:43, Benjamin Coddington 
<bcodding@xxxxxxxxxx> wrote:

On 11 Jul 2016, at 9:28, Trond Myklebust wrote:

On Jul 11, 2016, at 08:59, Alexey Dobriyan <adobriyan@xxxxxxxxx> 
wrote:

We have a customer who was able to reliably reproduce the 
following hang:
(hang itself is rare but there are many machines, so it is not 
rare)

INFO: task ascp:66692 blocked for more than 120 seconds.

bt 66692
PID: 66692  TASK: ffff883f124ba280  CPU: 5   COMMAND: "ascp"

__schedule
schedule
schedule_timeout
io_schedule_timeout
io_schedule
nfs_wait_bit_uninterruptible
__wait_on_bit
out_of_line_wait_on_bit
nfs_wait_on_request
nfs_try_to_update_request
nfs_setup_write_request
nfs_writepage_setup
nfs_updatepage
nfs_write_end
generic_file_buffered_write
__generic_file_aio_write
generic_file_aio_write
nfs_file_write
do_sync_write
vfs_write
sys_write
system_call_fastpath

ascp is Aspera secure copy program. It is multithreaded.
When the hang happens, 5 threads are in S-state and 1 on D-state.

The workload is copying ~100+ GB files over NFS3 with 10GiB 
ethernet
with the following mount options:
rw,relatime,vers=3,rsize=131072,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=X.X.X.X,mountvers=3,mountport=300,mountproto=udp,local_lock=none,addr=X.X.X.X

We got coredump (2 actually) and observations are below:

1) one of the inode's pagecache pages has page->private 
available:
struct nfs_page {
wb_list = {
 next = 0xffff885623ac4f80,
 prev = 0xffff885623ac4f80
},
wb_page = 0xffffea01218c2600,
wb_context = 0xffff887f2265de00,
wb_lock_context = 0xffff887f2265de00,
wb_index = 2649328,
wb_offset = 0,
wb_pgbase = 0,
wb_bytes = 0,
wb_kref = {
 refcount = {
   counter = 3
 }
},
wb_flags = 19,
wb_verf = {
 data = "\000\000\000\000\000\000\000"
},
wb_this_page = 0xffff885623ac4f80,
wb_head = 0xffff885623ac4f80
}

->wb_list is always empty,
->wb_bytes is always 0 (!).
->wb_kref is always 3.
->wb_flags = PG_BUSY|PG_MAPPED|PG_INODE_REF,

page->flags = 0x2869 =
PG_writeback|PG_private|PG_active|PF_lru|PG_uptodate|PG_locked

The zero-length request is created at nfs_create_request():
WARN_ON_ONCE(req->wb_bytes == 0)
nfs_create_request
nfs_setup_write_request
nfs_updatepage
nfs_write_end
generic_perform_write
generic_file_buffered_write
__generic_file_aio_write
generic_file_aio_write
nfs_file_write
do_sync_write
vfs_write
SyS_write

with count=0 coming from generic_perform_write (where else?).

Customer is saying that RHEL6 was totally OK, and there are 
several reports
of other people hitting same bug with RHEL7:
https://gist.github.com/Millnert/ecc10d8cc79c81b55d7f
https://bugs.centos.org/view.php?id=9284

Why is this being reported here and not to Red Hat? Is the bug 
reproducible on the upstream kernel?

There's a RHEL BZ open for it:
https://bugzilla.redhat.com/show_bug.cgi?id=132463

Upstream has the problem, too.  The pgio layer doesn't expect 
zero-length
requests, so an nfs_page can get locked but never added to a pageio
descriptor.

To create such a problem, writeback has to happen just after 
nfs_write_end()
has created a nfs_page with wb_bytes = 0.  This can happen if
iov_iter_copy_from_user_atomic() fails to copy pages in
generic_perform_write().  Normally, we'd just go around the loop 
and try the
copy again, but if writeback sneaks in then the request gets locked 
but
never added to a pgio descriptor due to the way we use conditions 
on zero
bytes to move requests around.

I see two potential ways to fix:
  - Just skip the creation of the request if the length is zero.  
We're
    bound to just come around and created it again.

  - Allow zero-length requests, but fix up 
__nfs_pageio_add_request,
    nfs_generic_pg_test, and others that use conditions on zero.

I think the first option to handle the case early is the simpler 
fix, and
best approach, since: what do we do with a zero-length request?  
Not
creating them helps us be conservative in what we send out to the
nfs_pageio_ops.

I don't understand what conditions can cause 
iov_iter_copy_from_user_atomic
to fail since pagefaults are disabled.  Can anyone enlighten me?

To reproduce this upstream, I've simulated 
iov_iter_copy_from_user_atomic()
returning zero once, then inserted a small delay after 
nfs_write_end to
allow me to sneak in a sync.

Any advice on how to proceed?  I'll send a patch to fix the first 
way
otherwise.  If anyone has trouble with that RHEL BZ, send me your 
bugzilla
email and I'll give you access.

I think adding a check to nfs_updatepage() to bail out if 
‘count’ is zero would seem to be the right thing to do.

This is exactly what we did as a workaround for customer,
so I can already confirm it helps. :^)

Makes sense to me too!

Who wants to take the lead and post a patch? Ben?

Sure.. I'll send it.  I was hunting around trying to figure out an 
appropriate version number for a stable tag..   I think it became a 
problem in v3.15 or so -- but there's a lot that moved around so I am 
not sure.

Ben
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html