Hi Vijay Thanks - I didn't expect a reply on a Sunday. It amazes me sometimes that I can use a tool like tar as a familiar friend without ever thinking about the inner workings! -P in the gnu version (in Debian wheezy) is used to preserve leading '/' on file names. I'll try it later just in case it's an undocumented feature. Please tell me more about hashed read-child - a link to a web page would be great. I had hoped that all reads would be to the local host (exactly 2 nodes and exactly 2 replicas). Thanks again Allan On 14/07/13 16:31, Vijay Bellur wrote: > On 07/14/2013 02:34 PM, Allan Latham wrote: >> Hi all >> >> I'm running some initial sanity and performance checks on this: >> >> root at h06 /root # glusterd -V >> glusterfs 3.4.0beta4 built on Jul 10 2013 15:14:50 >> >> The setup is a two node replicating cluster (h06 and h65) the client is >> on h06. >> >> The network is limited to 100Mbits/second and rtt is 0.6ms so >> performance is not spectacular but that is not the problem which >> currently concerns me. >> >> The test was to take a typical linux rootfs with 52K files of various >> sizes totaling 1.9GByte and copy it to a gluster mount using tar: >> >> root at h06 /root/mnt/h06 # sync;time (tar -c *|(cd /gluster/mnt/20 && tar >> -x);sync) >> >> real 20m24.836s >> user 0m3.517s >> sys 0m39.814s >> >> Times are not brilliant but it will be OK for the usage scenario. >> >> Extracts from 'mount': >> >> /dev/mapper/vs-h06 on /root/mnt/h06 type ext4 >> (rw,relatime,barrier=1,data=ordered) >> >> /var/lib/glusterd/vols/gl/gl-fuse.vol on /gluster/mnt type >> fuse.glusterfs >> (rw,nosuid,nodev,noatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072) >> >> >> Among other tests I wanted to see that the copy in /gluster/mnt/20 was >> identical to the original in /root/mnt/h06 >> >> The original has 5261 symbolic links: >> >> root at h06 /root/mnt/h06 # find -type l |wc >> 5261 5261 182824 >> >> The copy has only 4089: >> >> root at h65 /gluster/mnt/20 # find -type l |wc >> 4089 4089 141644 >> >> Here is an example: >> >> root at h06 /root # ls -ld ~/mnt/h06/bin/netcat >> lrwxrwxrwx 1 root root 24 Jun 30 09:03 /root/mnt/h06/bin/netcat -> >> /etc/alternatives/netcat >> >> root at h06 /root # ls -ld /gluster/mnt/20/bin/netcat >> ---------- 1 root root 0 Jul 14 06:37 /gluster/mnt/20/bin/netcat >> >> A scan with md5sum on the original and the copy in gluster shows only >> these links as being different. All normal files checksum the same. >> >> The mirrored gluster filesystem on h65 (no surprise) shows the identical >> result - some symbolic links have been changed in empty files. >> >> To the best of my knowledge the gluster filesystems are identical on the >> two nodes but differ from the original. >> >> To me it appears that the command: >> >> (cd source && tar -c *)|(cd gluster && tar -x) >> >> has changed some symbolic links in 'source' into empty files in >> 'gluster'. > > This seems related to the way tar extracts symbolic links. In a nutshell > the following steps are performed by tar for creation of symbolic links > on the destination: > > a) Create an empty regular placeholder file with permission bits set to > 0 and the name being that of the symlink source file. > > b) Record the device, inode numbers and the mtime of the placeholder > file through stat. > > c) After the first pass of extraction is complete, there is a second > pass involved to set right symbolic links. In this phase a stat is > performed on the placeholder file. If all attributes recorded in b) are > in sync with the latest information from stat buf, only then the > placeholder is unlinked and a new symbolic link is created. If any > attribute is out of sync, the unlink and creation of symbolic link do > not happen. > > With gluster's replication, the mtimes can vary across the nodes during > the creation of placeholder files. If the stat calls in steps b) and c) > land on different nodes, then there is a very good likelihood (due to > different mtimes) that tar would skip creation of symbolic links and > leave behind the placeholder file. > > A little more about this particular implementation of symlinks for tar > can be found here: > > http://lists.debian.org/debian-user/2003/03/msg03249.html > > To overcome this behavior, we can make use of -P option with tar during > extraction. This will create the link file directly and not involve the > 2 phased approach outlined above. > > In addition to this, using an option like hashed read-child (available > with GlusterFS 3.4) can ensure that read calls for an inode/file land on > the same node always. With that, tar should not get varying mtimes > across calls and the placeholder file should get replaced with the > actual symlink. > > -Vijay > >