Re: scp bug due to progress indicator when copying from remote to local on Linux

Steve French <smfrench@xxxxxxxxx> · Fri, 11 Jan 2019 15:50:02 -0600

On Fri, Jan 11, 2019 at 3:22 PM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
>
> On Fri, Jan 11, 2019 at 03:13:05PM -0600, Steve French wrote:
> > On Fri, Jan 11, 2019 at 7:28 AM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
> > > Are you saying the SIGALRM interrupts ftruncate() and causes the ftruncate
> > > to fail?
> >
> > So ftruncate does not really fail (the file contents and size match on
> > source and target after the copy) but the scp 'fails' and the user
> > would be quite confused (and presumably the network stack doesn't like
> > this signal, which can cause disconnects etc. which in theory could
> > cause reconnect/data loss issues in some corner cases).
>
> You've run into the problem that userspace simply doesn't check the
> return value from syscalls.  It's not just scp, it's every program.
> Looking through cifs, you seem to do a lot of wait_event_interruptible()
> where you maybe should be doing wait_event_killable()?
>
> > ftruncate(3, 262144000)                 = ? ERESTARTSYS (To be
> > restarted if SA_RESTART is set)
> > --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
> > --- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
> > rt_sigreturn({mask=[ALRM]})             = 0
> > ioctl(1, TIOCGWINSZ, {ws_row=51, ws_col=156, ws_xpixel=0, ws_ypixel=0}) = 0
> > getpgrp()                               = 82563
>
> Right ... so the code never calls ftruncate() again.  Changing all of
> userspace is just not going to happen; maybe you could get stuff fixed in
> libc, but really ftruncate() should only be interrupted by a fatal signal
> and not by SIGALRM.

Looking at the places wait_event_interruptible is done I didn't see code
in fs/cifs that would match the (presumably) code path, mostly those
calls are in
smbdirect (RDMA) code) - for example cifs_setattr does call
filemap_write_and_wait
but as it goes down into the mm layer and then to cifs_writepages and
the SMB3 write
code, I didn't spot a "wait_event_interruptible" in that path (I might
have missed
something in the mm layer).  I do see one in the cifs reconnect path,
but that is
not what we are typically hitting.   Any ideas how to match what we
are blocked in when
we get the annoying SIGALRM?  Another vague thought - is it possible
to block SIGALRM
across all of cifs_setattr?  If it is - why do so few (only 3!) file
systems (ceph, jffs2, ocfs2
ever call sigprocmask)?

-- 
Thanks,

Steve