Re: Copy tools on Linux

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Steve,

On 06-29 21:37, Steve French wrote:
> I have been looking at i/o patterns from various copy tools on Linux,
> and it is pretty discouraging - I am hoping that I am forgetting an
> important one that someone can point me to ...
> 
> Some general problems:
> 1) if source and target on the same file system it would be nice to
> call the copy_file_range syscall (AFAIK only test tools call that),
> although in some cases at least cp can do it for --reflink

I have submitted a patch set for copy_file_range() across filesystems
which can atleast use splice() [1] as a part of enabling holes in
copy_file_range(), but it has not been incorporated so far.

> 2) if source and target on different file systems there are multiple problems
>     a) smaller i/o  (rsync e.g. maxes at 128K!)
>     b) no async parallelized writes sent down to the kernel so writes
> get serialized (either through page cache, or some fs offer option to
> disable it - but it still is one thread at a time)
>     c) sparse file support is mediocre (although cp has some support
> for it, and can call fiemap in some cases)
>     d) for file systems that prefer setting the file size first (to
> avoid metadata penalties with multiple extending writes) - AFAIK only
> rsync offers that, but rsync is one of the slowest tools otherwise
> 
> I have looked at cp, dd, scp, rsync, gio, gcp ... are there others?
> 
> What I am looking for (and maybe we just need to patch cp and rsync
> etc.) is more like what you see with other OS ...
> 1) options for large i/o sizes (network latencies in network/cluster
> fs can be large, so prefer larger 1M or 8M in some cases I/Os)

Unfortunately tools derive I/O size from stat.st_blksize which may be
pretty small for performing "efficient" I/O. However, the tools such as
cp also determine series of zeros to convert into holes. So for that
reason it works well. OTOH, that is not the most common case of tools,
which I agree could be made faster.


> 2) parallelizing writes so not just one write in flight at a time

What would the resultant file be in case of errors? Should the
destination file be considered partially copied? man cp does not cover
the case errors but currently it is assumed the file is partially copied
and correct until the point of error.

> 3) options to turn off the page cache (large number of large file
> copies are not going to benefit from reuse of pages in the page cache
> so going through the page cache may be suboptimal in that case)

In most cases, pagecache is faster than direct I/O. Yes, large files
may not benefit from it. But it would still be faster to use up the
memory and defer writebacks.

> 4) option to set the file size first, and then fill in writes (so
> non-extending writes)

File size or file allocation? How would you determine what file
size to set? Consider the case the source file is sparse. It can be
calculated, but needs more thought.

> 5) sparse file support
> (and it would also be nice to support copy_file_range syscall ... but
> that is unrelated to the above)

Yup, main objective of [1]

> 
> Am I missing some magic tool?  Seems like Windows has various options
> for copy tools - but looking at Linux i/o patterns from these tools
> was pretty depressing - I am hoping that there are other choices.
> 


[1] https://www.spinics.net/lists/linux-fsdevel/msg128450.html

-- 
Goldwyn



[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux