Re: RHEL7 hang (and probably mainline) while copying big files because of subrequests merging(?)

Weston Andros Adamson <dros@xxxxxxxxxx> · Mon, 18 Jul 2016 10:36:13 -0400

> On Jul 18, 2016, at 10:32 AM, Alexey Dobriyan <adobriyan@xxxxxxxxx> wrote:
> 
> On Mon, Jul 18, 2016 at 3:35 PM, Trond Myklebust
> <trondmy@xxxxxxxxxxxxxxx> wrote:
>> 
>>> On Jul 18, 2016, at 07:43, Benjamin Coddington <bcodding@xxxxxxxxxx> wrote:
>>> 
>>> On 11 Jul 2016, at 9:28, Trond Myklebust wrote:
>>> 
>>>>> On Jul 11, 2016, at 08:59, Alexey Dobriyan <adobriyan@xxxxxxxxx> wrote:
>>>>> 
>>>>> We have a customer who was able to reliably reproduce the following hang:
>>>>> (hang itself is rare but there are many machines, so it is not rare)
>>>>> 
>>>>> INFO: task ascp:66692 blocked for more than 120 seconds.
>>>>> 
>>>>>> bt 66692
>>>>> PID: 66692  TASK: ffff883f124ba280  CPU: 5   COMMAND: "ascp"
>>>>> 
>>>>> __schedule
>>>>> schedule
>>>>> schedule_timeout
>>>>> io_schedule_timeout
>>>>> io_schedule
>>>>> nfs_wait_bit_uninterruptible
>>>>> __wait_on_bit
>>>>> out_of_line_wait_on_bit
>>>>> nfs_wait_on_request
>>>>> nfs_try_to_update_request
>>>>> nfs_setup_write_request
>>>>> nfs_writepage_setup
>>>>> nfs_updatepage
>>>>> nfs_write_end
>>>>> generic_file_buffered_write
>>>>> __generic_file_aio_write
>>>>> generic_file_aio_write
>>>>> nfs_file_write
>>>>> do_sync_write
>>>>> vfs_write
>>>>> sys_write
>>>>> system_call_fastpath
>>>>> 
>>>>> ascp is Aspera secure copy program. It is multithreaded.
>>>>> When the hang happens, 5 threads are in S-state and 1 on D-state.
>>>>> 
>>>>> The workload is copying ~100+ GB files over NFS3 with 10GiB ethernet
>>>>> with the following mount options:
>>>>> rw,relatime,vers=3,rsize=131072,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=X.X.X.X,mountvers=3,mountport=300,mountproto=udp,local_lock=none,addr=X.X.X.X
>>>>> 
>>>>> We got coredump (2 actually) and observations are below:
>>>>> 
>>>>> 1) one of the inode's pagecache pages has page->private available:
>>>>> struct nfs_page {
>>>>> wb_list = {
>>>>>  next = 0xffff885623ac4f80,
>>>>>  prev = 0xffff885623ac4f80
>>>>> },
>>>>> wb_page = 0xffffea01218c2600,
>>>>> wb_context = 0xffff887f2265de00,
>>>>> wb_lock_context = 0xffff887f2265de00,
>>>>> wb_index = 2649328,
>>>>> wb_offset = 0,
>>>>> wb_pgbase = 0,
>>>>> wb_bytes = 0,
>>>>> wb_kref = {
>>>>>  refcount = {
>>>>>    counter = 3
>>>>>  }
>>>>> },
>>>>> wb_flags = 19,
>>>>> wb_verf = {
>>>>>  data = "\000\000\000\000\000\000\000"
>>>>> },
>>>>> wb_this_page = 0xffff885623ac4f80,
>>>>> wb_head = 0xffff885623ac4f80
>>>>> }
>>>>> 
>>>>> ->wb_list is always empty,
>>>>> ->wb_bytes is always 0 (!).
>>>>> ->wb_kref is always 3.
>>>>> ->wb_flags = PG_BUSY|PG_MAPPED|PG_INODE_REF,
>>>>> 
>>>>> page->flags = 0x2869 =
>>>>> PG_writeback|PG_private|PG_active|PF_lru|PG_uptodate|PG_locked
>>>>> 
>>>>> The zero-length request is created at nfs_create_request():
>>>>> WARN_ON_ONCE(req->wb_bytes == 0)
>>>>> nfs_create_request
>>>>> nfs_setup_write_request
>>>>> nfs_updatepage
>>>>> nfs_write_end
>>>>> generic_perform_write
>>>>> generic_file_buffered_write
>>>>> __generic_file_aio_write
>>>>> generic_file_aio_write
>>>>> nfs_file_write
>>>>> do_sync_write
>>>>> vfs_write
>>>>> SyS_write
>>>>> 
>>>>> with count=0 coming from generic_perform_write (where else?).
>>>>> 
>>>>> Customer is saying that RHEL6 was totally OK, and there are several reports
>>>>> of other people hitting same bug with RHEL7:
>>>>> https://gist.github.com/Millnert/ecc10d8cc79c81b55d7f
>>>>> https://bugs.centos.org/view.php?id=9284
>>>>> 
>>>> 
>>>> Why is this being reported here and not to Red Hat? Is the bug reproducible on the upstream kernel?
>>> 
>>> There's a RHEL BZ open for it:
>>> https://bugzilla.redhat.com/show_bug.cgi?id=132463
>>> 
>>> Upstream has the problem, too.  The pgio layer doesn't expect zero-length
>>> requests, so an nfs_page can get locked but never added to a pageio
>>> descriptor.
>>> 
>>> To create such a problem, writeback has to happen just after nfs_write_end()
>>> has created a nfs_page with wb_bytes = 0.  This can happen if
>>> iov_iter_copy_from_user_atomic() fails to copy pages in
>>> generic_perform_write().  Normally, we'd just go around the loop and try the
>>> copy again, but if writeback sneaks in then the request gets locked but
>>> never added to a pgio descriptor due to the way we use conditions on zero
>>> bytes to move requests around.
>>> 
>>> I see two potential ways to fix:
>>>   - Just skip the creation of the request if the length is zero.  We're
>>>     bound to just come around and created it again.
>>> 
>>>   - Allow zero-length requests, but fix up __nfs_pageio_add_request,
>>>     nfs_generic_pg_test, and others that use conditions on zero.
>>> 
>>> I think the first option to handle the case early is the simpler fix, and
>>> best approach, since: what do we do with a zero-length request?  Not
>>> creating them helps us be conservative in what we send out to the
>>> nfs_pageio_ops.
>>> 
>>> I don't understand what conditions can cause iov_iter_copy_from_user_atomic
>>> to fail since pagefaults are disabled.  Can anyone enlighten me?
>>> 
>>> To reproduce this upstream, I've simulated iov_iter_copy_from_user_atomic()
>>> returning zero once, then inserted a small delay after nfs_write_end to
>>> allow me to sneak in a sync.
>>> 
>>> Any advice on how to proceed?  I'll send a patch to fix the first way
>>> otherwise.  If anyone has trouble with that RHEL BZ, send me your bugzilla
>>> email and I'll give you access.
>> 
>> I think adding a check to nfs_updatepage() to bail out if ‘count’ is zero would seem to be the right thing to do.
> 
> This is exactly what we did as a workaround for customer,
> so I can already confirm it helps. :^)

Makes sense to me too!

Who wants to take the lead and post a patch? Ben?

Thanks!

-dros

--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html