Re: scp bug due to progress indicator when copying from remote to local on Linux

Steve French <smfrench@xxxxxxxxx> · Fri, 11 Jan 2019 15:13:05 -0600

On Fri, Jan 11, 2019 at 7:28 AM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
>
> On Fri, Jan 11, 2019 at 12:11:00AM -0600, Steve French wrote:
> > In discussing an interesting problem with scp recently with Pavel, we
> > found a fairly obvious bug in scp when it is run with a progress
> > indicator (which is the default when the source file is remote).
> >
> > scp triggers "SIGALRM" probably from update_progress_meter() in
> > progressmeter.c when executed with "scp localhost:somelargefile /mnt"
> > ie with an ssh source path but a local path for the target, as long as
> > the flush of the large amount of cached data to disk (which seems to
> > be triggered by the ftruncate call in scp.c) takes more than a few
> > seconds (which can be common depending on disk or network speed).
>
> Are you saying the SIGALRM interrupts ftruncate() and causes the ftruncate
> to fail?

So ftruncate does not really fail (the file contents and size match on
source and target after the copy) but the scp 'fails' and the user
would be quite confused (and presumably the network stack doesn't like
this signal, which can cause disconnects etc. which in theory could
cause reconnect/data loss issues in some corner cases).

Here is sample output from rerunning scp a few minutes ago.  (in
tracing it earlier I can see scp triggering the SIGALRM in the
progress indicator code).  scp returned non-zero value (failure), but
scp 'succeeded' (file was copied fine).  If I run with -q (quiet mode,
no progress indicator, there is no problem). For a relatively small
file size like this it fails less than 1/2 of the time, but if the
file size is much larger (gigabytes) the ftruncate obviously takes
longer and SIGALRM will always trigger.

smfrench@smf-copy-test3:~$ scp localhost:250M /mnt/trgtfile
smfrench@localhost's password:
250M
                                                   100%  250MB
18.6MB/s   00:13    /mnt/trgtfile: truncate: Interrupted system call

smfrench@smf-copy-test3:~$ echo $?
1

smfrench@smf-copy-test3:~$ diff 250M /mnt/trgtfile

Here is sample strace output from another run, note a very long series
of crazy small write sizes (16K) which complete quite quickly since
cached by Linux mm, then a relatively long (few second) pause as
ftruncate triggers the flush which causes the SIGALRM time out in the
scp progress indicator.

<snip>
write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
16384) = 16384
read(7, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
16384) = 16384
write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
16384) = 16384
read(7, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
16384) = 16384
write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
16384) = 16384
fcntl(7, F_GETFL)                       = 0x800 (flags O_RDONLY|O_NONBLOCK)
fcntl(7, F_SETFL, O_RDONLY)             = 0
ftruncate(3, 262144000)                 = ? ERESTARTSYS (To be
restarted if SA_RESTART is set)
--- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
--- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---
rt_sigreturn({mask=[ALRM]})             = 0
ioctl(1, TIOCGWINSZ, {ws_row=51, ws_col=156, ws_xpixel=0, ws_ypixel=0}) = 0
getpgrp()                               = 82563

> > It is interesting that this can be avoided easily by running in quiet
> > mode ("-q") which disables the progress meter in scp

--
Thanks,

Steve