Re: scp bug due to progress indicator when copying from remote to local on Linux

Steve French <smfrench@xxxxxxxxx> · Fri, 11 Jan 2019 17:18:31 -0600

On Fri, Jan 11, 2019 at 5:05 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
>
> On Fri, Jan 11, 2019 at 03:50:02PM -0600, Steve French wrote:
> > On Fri, Jan 11, 2019 at 3:22 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
> > > Right ... so the code never calls ftruncate() again.  Changing all of
> > > userspace is just not going to happen; maybe you could get stuff fixed in
> > > libc, but really ftruncate() should only be interrupted by a fatal signal
> > > and not by SIGALRM.
> >
> > Looking at the places wait_event_interruptible is done I didn't see code
> > in fs/cifs that would match the (presumably) code path, mostly those
> > calls are in
> > smbdirect (RDMA) code) - for example cifs_setattr does call
> > filemap_write_and_wait
> > but as it goes down into the mm layer and then to cifs_writepages and
> > the SMB3 write
> > code, I didn't spot a "wait_event_interruptible" in that path (I might
> > have missed
> > something in the mm layer).  I do see one in the cifs reconnect path,
> > but that is
> > not what we are typically hitting.   Any ideas how to match what we
> > are blocked in when
> > we get the annoying SIGALRM?  Another vague thought - is it possible
> > to block SIGALRM
> > across all of cifs_setattr?  If it is - why do so few (only 3!) file
> > systems (ceph, jffs2, ocfs2
> > ever call sigprocmask)?
>
> You can see where a task is currently sleeping with 'cat /proc/$pid/stack'.
> If you can provoke a long duration ftruncate, that'd be a good place to
> start looking.

Not surprisingly it is waiting in mm code:

root@smf-copy-test3:~# cat /proc/92189/stack
[<0>] io_schedule+0x16/0x40
[<0>] wait_on_page_bit_common+0x14f/0x350
[<0>] __filemap_fdatawait_range+0x104/0x160
[<0>] filemap_write_and_wait+0x4d/0x90
[<0>] cifs_setattr+0xc9/0xe80 [cifs]
[<0>] notify_change+0x2d2/0x460
[<0>] do_truncate+0x78/0xc0
[<0>] do_sys_ftruncate+0x14c/0x1c0
[<0>] __x64_sys_ftruncate+0x1b/0x20
[<0>] do_syscall_64+0x5a/0x110
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[<0>] 0xffffffffffffffff

So that brings me back to thinking about whether it is practical to mask
signals (non killable signals) in a few places in cifs.ko as apparently
at least a few file systems (ceph and jffs2 and ocfs2) do.  In particular
mask SIGALRM across calls to filemap_write_and_wait

-- 
Thanks,

Steve