On 04/27/2015 03:00 PM, Ernie Dunbar wrote:
On 2015-04-27 14:09, Joe Julian wrote:
I've also noticed that if I increase the count of those writes, the
transfer speed increases as well:
2097152 bytes (2.1 MB) copied, 0.036291 s, 57.8 MB/s
root@backup:/home/webmailbak# dd if=/dev/zero of=/mnt/testfile
count=2048 bs=1024; sync
2048+0 records in
2048+0 records out
2097152 bytes (2.1 MB) copied, 0.0362724 s, 57.8 MB/s
root@backup:/home/webmailbak# dd if=/dev/zero of=/mnt/testfile
count=2048 bs=1024; sync
2048+0 records in
2048+0 records out
2097152 bytes (2.1 MB) copied, 0.0360319 s, 58.2 MB/s
root@backup:/home/webmailbak# dd if=/dev/zero of=/mnt/testfile
count=10240 bs=1024; sync
10240+0 records in
10240+0 records out
10485760 bytes (10 MB) copied, 0.127219 s, 82.4 MB/s
root@backup:/home/webmailbak# dd if=/dev/zero of=/mnt/testfile
count=10240 bs=1024; sync
10240+0 records in
10240+0 records out
10485760 bytes (10 MB) copied, 0.128671 s, 81.5 MB/s
This is correct, there is overhead that happens with small files and
the smaller the file the less throughput you get. That said, since
files are smaller you should get more files / second but less MB /
second. I have found that when you go under 16k changing files size
doesn't matter, you will get the same number of 16k files / second as
you do 1 k files.
The overhead happens regardless. You just notice it more when you're
doing it a lot more frequently.
Well, it would be helpful to know what specifically rsync is trying to
do when it's sitting there making overhead, and whether it's possible
to tell rsync to avoid doing it, and just copy files instead (which it
does quite quickly).
I suppose technically speaking, it's an rsync-specific question, but
it's all about making rsync and glusterfs play nice, and we pretty
much all need to know that!
Yes, that's very rsync specific. Rsync not only checks the files
metadata, but it also does a hash comparison.
Each lookup() of each file requires a lookup from *each* replica server.
Lookup's are triggered on open, or even fstat. Since rsync requests the
stat of every file for comparison, this requires a little extra network
time. The client queries all the replica in case one is out of date to
ensure it's returning accurate results (and heal if a replica needs it).
After bulding a list of files that differ between the source and the
target, rsync copies the file to a temporary filename. After completing
the temporary file, rsync then renames the temporary to the target
filename. This has the disadvantage of putting the target file on the
"wrong" dht subvolume because the hash for the temporary filename is
different from the target filename.
rsync's network transfer is over ssh by default, so you're also hitting
encryption and buffer overhead.
The optimum process for *initially* copying large numbers of files to
gluster would be to blindly copy a list of files from the source to the
target without reading the target. If copying across a network,
maximizing the packet size is also advantageous. I've found tar ( + pv
if you want to monitor throughput) + netcat (or socat) to be much faster
than rsync.
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users