Re: scp bug due to progress indicator when copying from remote to local on Linux

Andreas Dilger <adilger@xxxxxxxxx> · Fri, 11 Jan 2019 16:51:00 -0700

On Jan 11, 2019, at 3:02 PM, Steve French <smfrench@xxxxxxxxx> wrote:
> 
> On Fri, Jan 11, 2019 at 3:48 PM Andreas Dilger <adilger@xxxxxxxxx> wrote:
>> 
>> On Jan 11, 2019, at 2:13 PM, Steve French <smfrench@xxxxxxxxx> wrote:
>>> 
>>> On Fri, Jan 11, 2019 at 7:28 AM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
>>>> 
>>>> On Fri, Jan 11, 2019 at 12:11:00AM -0600, Steve French wrote:
>>>>> In discussing an interesting problem with scp recently with Pavel, we
>>>>> found a fairly obvious bug in scp when it is run with a progress
>>>>> indicator (which is the default when the source file is remote).
>>>>> 
>>>>> scp triggers "SIGALRM" probably from update_progress_meter() in
>>>>> progressmeter.c when executed with "scp localhost:somelargefile /mnt"
>>>>> ie with an ssh source path but a local path for the target, as long as
> <snip>
> 
>> My typical response for issues like this is to find the github repo for
>> the code and push a patch to fix the code myself.  If I can't find the
>> upstream project/source repo easily, download the .src.rpm and make a
>> patch and submit it to the major distro package maintainers (RHEL, SLES,
>> Ubuntu or Debian), so that at least it is fixed for the vast majority
>> of users, and those maintainers can handle pushing the patch further
>> upstream (and smaller distros will copy patches/sources from them).
> 
> I have built the scp code as an experiment and by the way there are many,
> many things that could be changed (other copy tools e.g. robocopy even rsync
> and cp) look much more mature (with lots more useful performance options).
> My earlier experiments e.g. changing the scp i/o size to something larger
> did help as expected for various cases (16K is suboptimal in many cases).

Years ago I submitted a patch to fix GNU fileutils so that "cp" could use
st_blksize to determine the data IO size for the copy.  Originally it was
allocate its copy buffer on the stack, but when Lustre returned st_blksize
of 8MB it exploded.  2MB was safe at the time, and what we used for many
years but fixing "cp" allowed us to increase st_blksize over time. Probably
scp should be modified to do the same.

It is often just a case of the developer didn't know any better, or the code
was written so long ago that 16KB transfers were seen as a huge improvement
over 4KB or 1KB transfers, and nobody has looked at it again.  Since we all
depend on other developers to know their side of the house (e.g. I'm wholly
dependent on others for the crypto implementation of SSH/scp), we may as
well provide some help when there is something as widely used as scp that
we can improve in some useful way.

> But in this case it seems much simpler - the code path is basically
> read 16K over ssh
> write 16K to local/network file system (cached)
> (repeated 1000s of times)
> then ftruncate

Even if the filesystem is transferring large IOs from disk/network, doing
many thousands of excess system calls is pure overhead that could be avoided.
Also, even if scp could only do 16KB network transfers, staying in tight
"scp copies data over ssh" loop is better for icache/dcache/context, then
doing a single large write call to/from the filesystem is most efficient
on that side of the house.

>> ... why would the truncate() take a long time even if the file
>> is large, since it shouldn't actually be doing any work to change the
>> size of the file (I'm assuming that the target file is either empty or
>> about the same size during an overwrite, you didn't mention anything
>> unusual as a requirement).
> 
> ftruncate if it were simply setting the file size would be very fast,
> except that setting the file size needs to have the data written out
> (writes otherwise could risk extending the file size) so triggers a
> call to writepages (filemap_write_and_wait) as a sideeffect
> so all the cached data is finally written out which can take a few
> seconds and cause the scp progress indicator to malfunction
> and trigger the SIGALRM.  This is normally not a problem
> fsync etc. can take longer than a single i/o operation.

This seems more like a kernel/filesystem issue.  We shouldn't have to flush
the entire file to disk just to change the file size.  In the exceedingly
common case of the current size == truncate() size, the truncate should be
a no-op unless the on-disk size is > new size?  In the unlikely case that
file size > new size then it could truncate only the pages and on-disk
blocks beyond the new size?

I don't see how pages in cache < new size could cause the file size to be
extended.  Is this a VFS/MM thing, or CIFS-specific?

Cheers, Andreas

Attachment:
signature.asc

Description: Message signed with OpenPGP