On Fri, Jan 11, 2019 at 7:28 AM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > > On Fri, Jan 11, 2019 at 12:11:00AM -0600, Steve French wrote: > > In discussing an interesting problem with scp recently with Pavel, we > > found a fairly obvious bug in scp when it is run with a progress > > indicator (which is the default when the source file is remote). > > > > scp triggers "SIGALRM" probably from update_progress_meter() in > > progressmeter.c when executed with "scp localhost:somelargefile /mnt" > > ie with an ssh source path but a local path for the target, as long as > > the flush of the large amount of cached data to disk (which seems to > > be triggered by the ftruncate call in scp.c) takes more than a few > > seconds (which can be common depending on disk or network speed). > > Are you saying the SIGALRM interrupts ftruncate() and causes the ftruncate > to fail? So ftruncate does not really fail (the file contents and size match on source and target after the copy) but the scp 'fails' and the user would be quite confused (and presumably the network stack doesn't like this signal, which can cause disconnects etc. which in theory could cause reconnect/data loss issues in some corner cases). Here is sample output from rerunning scp a few minutes ago. (in tracing it earlier I can see scp triggering the SIGALRM in the progress indicator code). scp returned non-zero value (failure), but scp 'succeeded' (file was copied fine). If I run with -q (quiet mode, no progress indicator, there is no problem). For a relatively small file size like this it fails less than 1/2 of the time, but if the file size is much larger (gigabytes) the ftruncate obviously takes longer and SIGALRM will always trigger. smfrench@smf-copy-test3:~$ scp localhost:250M /mnt/trgtfile smfrench@localhost's password: 250M 100% 250MB 18.6MB/s 00:13 /mnt/trgtfile: truncate: Interrupted system call smfrench@smf-copy-test3:~$ echo $? 1 smfrench@smf-copy-test3:~$ diff 250M /mnt/trgtfile Here is sample strace output from another run, note a very long series of crazy small write sizes (16K) which complete quite quickly since cached by Linux mm, then a relatively long (few second) pause as ftruncate triggers the flush which causes the SIGALRM time out in the scp progress indicator. <snip> write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384) = 16384 read(7, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384) = 16384 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384) = 16384 read(7, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384) = 16384 write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 16384) = 16384 fcntl(7, F_GETFL) = 0x800 (flags O_RDONLY|O_NONBLOCK) fcntl(7, F_SETFL, O_RDONLY) = 0 ftruncate(3, 262144000) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) --- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} --- --- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} --- rt_sigreturn({mask=[ALRM]}) = 0 ioctl(1, TIOCGWINSZ, {ws_row=51, ws_col=156, ws_xpixel=0, ws_ypixel=0}) = 0 getpgrp() = 82563 > > It is interesting that this can be avoided easily by running in quiet > > mode ("-q") which disables the progress meter in scp -- Thanks, Steve