Re: Questions about gluster reblance

"Paul Guo" <bigpaulguo@xxxxxxxxxxx> · Fri, 12 Sep 2014 11:55:59 +0800

Hello Shyam. Thanks for the reply. Please see my reply below, starting with [paul:]

Please add me in address list besides gluster-uses when replying so that I can easier
reply since I subscribed gluster-users with the digest mode (No other choice if I
remember correctly.)

Date: Wed, 10 Sep 2014 10:36:41 -0400
From: Shyam <srangana@xxxxxxxxxx>
To: gluster-users@xxxxxxxxxxx
Subject: Re: [Gluster-users] Questions about gluster reblance
Message-ID: <541061F9.7000800@xxxxxxxxxx>
Content-Type: text/plain; charset=UTF-8; format=flowed

On 09/10/2014 03:27 AM, Paul Guo wrote:
> Hello,
>
> Recently I spent a bit time understanding rebalance since I want to know its
> performance given that there could be more and more bricks to be added into
> my glusterfs volume and there will be more and more files and directories
> in the existing glusterfs volume. During the test I saw something which I'm
> really confused about.
>
> Steps:
>
> SW versions: glusterfs 3.4.4 + centos 6.5
> Inital Configuration: replica 2, lab1:/brick1 + lab2:/brick1
>
> fuse_mount it on /mnt
> cp -rf /sbin /mnt (~300+ files under /sbin)
> add two more bricks: lab1:/brick2 + lab2:/brick2.
> run gluster reblance.
>
> 1) fix-layout only (e.g. gluster volume rebalance g1 fix-layout start)?
>
> After rebalance is done (observed via "gluster volume rebalance g1
> status"),?
> I found there is no file under lab1:/brick2/sbin. The hash ranges of
> new brick?lab1:/brick2/sbin and old brick lab1:/brick1/sbin appear to
> be ok.
>
> [root@lab1 Desktop]# getfattr -dm. -e hex /brick2/sbin
> getfattr: Removing leading '/' from absolute path names
> # file: brick2/sbin
> trusted.gfid=0x35976c2034d24dc2b0639fde18de007d
> trusted.glusterfs.dht=0x00000001000000007fffffffffffffff
>
> [root@lab1 Desktop]# getfattr -dm. -e hex /brick1/sbin
> getfattr: Removing leading '/' from absolute path names
> # file: brick1/sbin
> trusted.gfid=0x35976c2034d24dc2b0639fde18de007d
> trusted.glusterfs.dht=0x0000000100000000000000007ffffffe
> ?
> The question is: AFAIK, fix-layout would create "linkto" files
> (files with "linkto" xattr and with sticky bit set only)
> for those ones whose hash values belong
> to the new subvol. so there should have been some "linkto" files
> under lab1:/brick2, but no one now, why?

fix-layout only fixes the layout, i.e spreads the layout to the newer 
bricks (or bricks previously not participating in the layout). It would 
not create the linkto files.

Post fix-layout, if one were to perform a lookup on a file, that should 
have belonged to the newer brick as per the layout and hash of that file 
name, one can see the linkto file being present.

Hope this explains (1).

[paul:]

After fix-layout is complete, I mount the volume on /mnt, then 
run "ls -l /mnt/sbin/*" and "file /mnt/sbin/*",
and then I found just several linkto files are created while most files,
which should have been created under the new brick (i.e. brick2),
are not created.
[root@lab1 ~]# ls -l /brick2/sbin
total 0
---------T 2 root root 0 Sep 12 09:26 dmraid
---------T 2 root root 0 Sep 12 09:26 initctl
---------T 2 root root 0 Sep 12 09:26 ip6tables-multi
---------T 2 root root 0 Sep 12 09:26 portreserve
---------T 2 root root 0 Sep 12 09:26 reboot
---------T 2 root root 0 Sep 12 09:26 swapon
[root@lab1 ~]# getfattr -dm. -e hex /brick2/sbin
getfattr: Removing leading '/' from absolute path names
# file: brick2/sbin
trusted.gfid=0x94bc07cd18914a91ab12fbe931c63431
trusted.glusterfs.dht=0x00000001000000007fffffffffffffff
[root@lab1 ~]# ./gethash.py reboot
0xd48b11f6L
[root@lab1 ~]# ./gethash.py swapon
0x93129578L

The hash values of reboot & swapon are in the range of
/brick2/sbin (i.e. 7fffFFFF - ffffFFFF) so the linkto files
for the two binaries are expected, but there are more
linkto files missing, e.g. xfsdump
[root@lab1 ~]# ./gethash.py xfsdump
0xc17ff86bL

Even I umount /mnt, stop-then-start the volume,
restart glusterd, remount /mnt and then do the experiment
again, I still find no more linkto files under /brick2/sbin.

>
> 2) fix-layout + data_migrate (e.g. gluster volume rebalance g1 start)
>
> After migration is done, I saw linkto files under brick2/sbin.?
> There are totally 300+ files under system /sbin. Under brick2/sbin,
> I found the 300+ files are all there! either migrated or linkto-ed.
>
> -rwxr-xr-x 2 root root   17400 Sep 10 12:02 vmcore-dmesg
> ---------T 2 root root       0 Sep 10 12:03 weak-modules
> ---------T 2 root root       0 Sep 10 12:03 wipefs
> -rwxr-xr-x 2 root root  295656 Sep 10 12:02 xfsdump
> -rwxr-xr-x 2 root root  510000 Sep 10 12:02 xfs_repair
> -rwxr-xr-x 2 root root  348088 Sep 10 12:02 xfsrestore
>
> And under brick1/sbin, those migrated files are gone as expected.
> There are near to 150 files under brick/sbin.
> ?
> This confuses me since creating those linkto files seems to
> be unnecessary, at least for files whose hash values do not belong
> to the subvol. (My understanding is that if a file's hash value is
> in the range of a subvol then it will be stored in that subvol.)

Can you check if a lookup of the file post rebalance clears up these 
_stale_ linkto files?

[paul:] No. Those linkto files are still there after I run "file /mnt/sbin/*".

How did you compute the hash of these files and decide that they do not 
belong to the new brick (i.e brick2)? I did them on my end and you are 
right (based on the layout you presented above), but I am curious as to 
how you arrived at the same conclusion.

[paul:]
The hash calculating script comes from
http://joejulian.name/blog/dht-misses-are-expensive/
It uses language binding by calling gf_dm_hashfn() in the
glusterfs C code finally. I added debug code in glusterfs
and double-confirmed that the script works correctly.
See more analysis below,
[root@lab1 ~]# ls -l /brick2/sbin
....
-rwxr-xr-x 2 root root   16576 Sep 12 11:26 wipefs
---------T 2 root root       0 Sep 12 11:27 xfsdump
[root@lab1 ~]# ./gethash.py xfsdump
0xc17ff86bL
[root@lab1 ~]# ./gethash.py wipefs
0x6afa24a9L
[root@lab1 ~]# getfattr -dm. -e hex /brick2/sbin
getfattr: Removing leading '/' from absolute path names
# file: brick2/sbin
trusted.gfid=0xddd06defaf1242b4b8ec5d41fdaa01e3
trusted.glusterfs.dht=0x00000001000000007fffffffffffffff

[paul:] It means that wipefs belongs to /brick1/sbin
and xfsdump belongs to /brick2/sbin, but under /brick2/sbin,
wipefs is migrated and xfsdump is linkto-ed. This does not
make sense. Or I did something wrong?
This is another issue besides the uncesessary linkto file issue.

Rebalance could choose to not move files but just create the linkto 
files based on space usage between the source and target bricks etc. Not 
stating this is what happened here, but a possibility.

[paul:] All the bricks are empty before testing. They are virtual disk partitions
in virtualbox centos 6.5 guests.

>
> I quickly looked at the code. gf_defrag_start_crawl() appears to
> be the function for this operation. I do see code that does file migration
> from the code path, but debugging code shows that those "linkto" files
> seem to be not created by gf_defrag_start_crawl(). I'm not that familar with
> the code detail and the theory so I'm not sure who created those
> "linkto" files and why the "linkto" file are created.

I am going to leave this part as, dht_linkfile_create does this and 
mostly would happen during lookup.

[paul:] I added debug code in dht_linkfile_create(). It appears that
it is called for those migrated files only.
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-users