Re: scp bug due to progress indicator when copying from remote to local on Linux

Steve French <smfrench@xxxxxxxxx> · Fri, 11 Jan 2019 18:06:23 -0600



On Fri, Jan 11, 2019 at 5:51 PM Andreas Dilger <adilger@xxxxxxxxx> wrote:
>
> On Jan 11, 2019, at 3:02 PM, Steve French <smfrench@xxxxxxxxx> wrote:
> >
> > On Fri, Jan 11, 2019 at 3:48 PM Andreas Dilger <adilger@xxxxxxxxx> wrote:
> >>
> >> On Jan 11, 2019, at 2:13 PM, Steve French <smfrench@xxxxxxxxx> wrote:
> >>>
> >>> On Fri, Jan 11, 2019 at 7:28 AM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote:
> >>>>
> >>>> On Fri, Jan 11, 2019 at 12:11:00AM -0600, Steve French wrote:
> >>>>> In discussing an interesting problem with scp recently with Pavel, we
> >>>>> found a fairly obvious bug in scp when it is run with a progress
> >>>>> indicator (which is the default when the source file is remote).
> >>>>>
> >>>>> scp triggers "SIGALRM" probably from update_progress_meter() in
> >>>>> progressmeter.c when executed with "scp localhost:somelargefile /mnt"
> >>>>> ie with an ssh source path but a local path for the target, as long as
> > <snip>
> >
> >> My typical response for issues like this is to find the github repo for
> >> the code and push a patch to fix the code myself.  If I can't find the
> >> upstream project/source repo easily, download the .src.rpm and make a
> >> patch and submit it to the major distro package maintainers (RHEL, SLES,
> >> Ubuntu or Debian), so that at least it is fixed for the vast majority
> >> of users, and those maintainers can handle pushing the patch further
> >> upstream (and smaller distros will copy patches/sources from them).
> >
> > I have built the scp code as an experiment and by the way there are many,
> > many things that could be changed (other copy tools e.g. robocopy even rsync
> > and cp) look much more mature (with lots more useful performance options).
> > My earlier experiments e.g. changing the scp i/o size to something larger
> > did help as expected for various cases (16K is suboptimal in many cases).
>
> Years ago I submitted a patch to fix GNU fileutils so that "cp" could use
> st_blksize to determine the data IO size for the copy.  Originally it was
> allocate its copy buffer on the stack, but when Lustre returned st_blksize
> of 8MB it exploded.  2MB was safe at the time, and what we used for many
> years but fixing "cp" allowed us to increase st_blksize over time. Probably
> scp should be modified to do the same.
>
> It is often just a case of the developer didn't know any better, or the code
> was written so long ago that 16KB transfers were seen as a huge improvement
> over 4KB or 1KB transfers, and nobody has looked at it again.  Since we all
> depend on other developers to know their side of the house (e.g. I'm wholly
> dependent on others for the crypto implementation of SSH/scp), we may as
> well provide some help when there is something as widely used as scp that
> we can improve in some useful way.
>
> > But in this case it seems much simpler - the code path is basically
> > read 16K over ssh
> > write 16K to local/network file system (cached)
> > (repeated 1000s of times)
> > then ftruncate
>
> Even if the filesystem is transferring large IOs from disk/network, doing
> many thousands of excess system calls is pure overhead that could be avoided.
> Also, even if scp could only do 16KB network transfers, staying in tight
> "scp copies data over ssh" loop is better for icache/dcache/context, then
> doing a single large write call to/from the filesystem is most efficient
> on that side of the house.
>
> >> ... why would the truncate() take a long time even if the file
> >> is large, since it shouldn't actually be doing any work to change the
> >> size of the file (I'm assuming that the target file is either empty or
> >> about the same size during an overwrite, you didn't mention anything
> >> unusual as a requirement).
> >
> > ftruncate if it were simply setting the file size would be very fast,
> > except that setting the file size needs to have the data written out
> > (writes otherwise could risk extending the file size) so triggers a
> > call to writepages (filemap_write_and_wait) as a sideeffect
> > so all the cached data is finally written out which can take a few
> > seconds and cause the scp progress indicator to malfunction
> > and trigger the SIGALRM.  This is normally not a problem
> > fsync etc. can take longer than a single i/o operation.
>
> This seems more like a kernel/filesystem issue.  We shouldn't have to flush
> the entire file to disk just to change the file size.  In the exceedingly
> common case of the current size == truncate() size, the truncate should be
> a no-op unless the on-disk size is > new size?  In the unlikely case that
> file size > new size then it could truncate only the pages and on-disk
> blocks beyond the new size?
>
> I don't see how pages in cache < new size could cause the file size to be
> extended.  Is this a VFS/MM thing, or CIFS-specific?

I suspect that for various other network/cluster file systems, flushing (not
to disk but at least to the server, ie across the network) is considered safer
to do before certain metadata updates (including setting file size)


-- 
Thanks,

Steve