On 07/14/2013 02:34 PM, Allan Latham wrote: > Hi all > > I'm running some initial sanity and performance checks on this: > > root at h06 /root # glusterd -V > glusterfs 3.4.0beta4 built on Jul 10 2013 15:14:50 > > The setup is a two node replicating cluster (h06 and h65) the client is > on h06. > > The network is limited to 100Mbits/second and rtt is 0.6ms so > performance is not spectacular but that is not the problem which > currently concerns me. > > The test was to take a typical linux rootfs with 52K files of various > sizes totaling 1.9GByte and copy it to a gluster mount using tar: > > root at h06 /root/mnt/h06 # sync;time (tar -c *|(cd /gluster/mnt/20 && tar > -x);sync) > > real 20m24.836s > user 0m3.517s > sys 0m39.814s > > Times are not brilliant but it will be OK for the usage scenario. > > Extracts from 'mount': > > /dev/mapper/vs-h06 on /root/mnt/h06 type ext4 > (rw,relatime,barrier=1,data=ordered) > > /var/lib/glusterd/vols/gl/gl-fuse.vol on /gluster/mnt type > fuse.glusterfs > (rw,nosuid,nodev,noatime,user_id=0,group_id=0,default_permissions,allow_other,max_read=131072) > > Among other tests I wanted to see that the copy in /gluster/mnt/20 was > identical to the original in /root/mnt/h06 > > The original has 5261 symbolic links: > > root at h06 /root/mnt/h06 # find -type l |wc > 5261 5261 182824 > > The copy has only 4089: > > root at h65 /gluster/mnt/20 # find -type l |wc > 4089 4089 141644 > > Here is an example: > > root at h06 /root # ls -ld ~/mnt/h06/bin/netcat > lrwxrwxrwx 1 root root 24 Jun 30 09:03 /root/mnt/h06/bin/netcat -> > /etc/alternatives/netcat > > root at h06 /root # ls -ld /gluster/mnt/20/bin/netcat > ---------- 1 root root 0 Jul 14 06:37 /gluster/mnt/20/bin/netcat > > A scan with md5sum on the original and the copy in gluster shows only > these links as being different. All normal files checksum the same. > > The mirrored gluster filesystem on h65 (no surprise) shows the identical > result - some symbolic links have been changed in empty files. > > To the best of my knowledge the gluster filesystems are identical on the > two nodes but differ from the original. > > To me it appears that the command: > > (cd source && tar -c *)|(cd gluster && tar -x) > > has changed some symbolic links in 'source' into empty files in 'gluster'. This seems related to the way tar extracts symbolic links. In a nutshell the following steps are performed by tar for creation of symbolic links on the destination: a) Create an empty regular placeholder file with permission bits set to 0 and the name being that of the symlink source file. b) Record the device, inode numbers and the mtime of the placeholder file through stat. c) After the first pass of extraction is complete, there is a second pass involved to set right symbolic links. In this phase a stat is performed on the placeholder file. If all attributes recorded in b) are in sync with the latest information from stat buf, only then the placeholder is unlinked and a new symbolic link is created. If any attribute is out of sync, the unlink and creation of symbolic link do not happen. With gluster's replication, the mtimes can vary across the nodes during the creation of placeholder files. If the stat calls in steps b) and c) land on different nodes, then there is a very good likelihood (due to different mtimes) that tar would skip creation of symbolic links and leave behind the placeholder file. A little more about this particular implementation of symlinks for tar can be found here: http://lists.debian.org/debian-user/2003/03/msg03249.html To overcome this behavior, we can make use of -P option with tar during extraction. This will create the link file directly and not involve the 2 phased approach outlined above. In addition to this, using an option like hashed read-child (available with GlusterFS 3.4) can ensure that read calls for an inode/file land on the same node always. With that, tar should not get varying mtimes across calls and the placeholder file should get replaced with the actual symlink. -Vijay