On Fri, Jan 11, 2019 at 5:51 PM Andreas Dilger <adilger@xxxxxxxxx> wrote: > > On Jan 11, 2019, at 3:02 PM, Steve French <smfrench@xxxxxxxxx> wrote: > > > > On Fri, Jan 11, 2019 at 3:48 PM Andreas Dilger <adilger@xxxxxxxxx> wrote: > >> > >> On Jan 11, 2019, at 2:13 PM, Steve French <smfrench@xxxxxxxxx> wrote: > >>> > >>> On Fri, Jan 11, 2019 at 7:28 AM Matthew Wilcox <willy@xxxxxxxxxxxxx> wrote: > >>>> > >>>> On Fri, Jan 11, 2019 at 12:11:00AM -0600, Steve French wrote: > >>>>> In discussing an interesting problem with scp recently with Pavel, we > >>>>> found a fairly obvious bug in scp when it is run with a progress > >>>>> indicator (which is the default when the source file is remote). > >>>>> > >>>>> scp triggers "SIGALRM" probably from update_progress_meter() in > >>>>> progressmeter.c when executed with "scp localhost:somelargefile /mnt" > >>>>> ie with an ssh source path but a local path for the target, as long as > > <snip> > > > >> My typical response for issues like this is to find the github repo for > >> the code and push a patch to fix the code myself. If I can't find the > >> upstream project/source repo easily, download the .src.rpm and make a > >> patch and submit it to the major distro package maintainers (RHEL, SLES, > >> Ubuntu or Debian), so that at least it is fixed for the vast majority > >> of users, and those maintainers can handle pushing the patch further > >> upstream (and smaller distros will copy patches/sources from them). > > > > I have built the scp code as an experiment and by the way there are many, > > many things that could be changed (other copy tools e.g. robocopy even rsync > > and cp) look much more mature (with lots more useful performance options). > > My earlier experiments e.g. changing the scp i/o size to something larger > > did help as expected for various cases (16K is suboptimal in many cases). > > Years ago I submitted a patch to fix GNU fileutils so that "cp" could use > st_blksize to determine the data IO size for the copy. Originally it was > allocate its copy buffer on the stack, but when Lustre returned st_blksize > of 8MB it exploded. 2MB was safe at the time, and what we used for many > years but fixing "cp" allowed us to increase st_blksize over time. Probably > scp should be modified to do the same. > > It is often just a case of the developer didn't know any better, or the code > was written so long ago that 16KB transfers were seen as a huge improvement > over 4KB or 1KB transfers, and nobody has looked at it again. Since we all > depend on other developers to know their side of the house (e.g. I'm wholly > dependent on others for the crypto implementation of SSH/scp), we may as > well provide some help when there is something as widely used as scp that > we can improve in some useful way. > > > But in this case it seems much simpler - the code path is basically > > read 16K over ssh > > write 16K to local/network file system (cached) > > (repeated 1000s of times) > > then ftruncate > > Even if the filesystem is transferring large IOs from disk/network, doing > many thousands of excess system calls is pure overhead that could be avoided. > Also, even if scp could only do 16KB network transfers, staying in tight > "scp copies data over ssh" loop is better for icache/dcache/context, then > doing a single large write call to/from the filesystem is most efficient > on that side of the house. > > >> ... why would the truncate() take a long time even if the file > >> is large, since it shouldn't actually be doing any work to change the > >> size of the file (I'm assuming that the target file is either empty or > >> about the same size during an overwrite, you didn't mention anything > >> unusual as a requirement). > > > > ftruncate if it were simply setting the file size would be very fast, > > except that setting the file size needs to have the data written out > > (writes otherwise could risk extending the file size) so triggers a > > call to writepages (filemap_write_and_wait) as a sideeffect > > so all the cached data is finally written out which can take a few > > seconds and cause the scp progress indicator to malfunction > > and trigger the SIGALRM. This is normally not a problem > > fsync etc. can take longer than a single i/o operation. > > This seems more like a kernel/filesystem issue. We shouldn't have to flush > the entire file to disk just to change the file size. In the exceedingly > common case of the current size == truncate() size, the truncate should be > a no-op unless the on-disk size is > new size? In the unlikely case that > file size > new size then it could truncate only the pages and on-disk > blocks beyond the new size? > > I don't see how pages in cache < new size could cause the file size to be > extended. Is this a VFS/MM thing, or CIFS-specific? I suspect that for various other network/cluster file systems, flushing (not to disk but at least to the server, ie across the network) is considered safer to do before certain metadata updates (including setting file size) -- Thanks, Steve