Ben, On no. of threads: Sent throttle patch here:http://review.gluster.org/#/c/10526/ to limit thread numbers[Not merged]. The rebalance process in current model spawns 20 threads and in addition to that there will be a max 16 syncop threads. Crash: The crash should be fixed by this: http://review.gluster.org/#/c/10459/. Rebalance time taken is a factor of number of files and their size. If the frequency of files getting added to the global queue[on which the migrator threads act] is higher, faster will be the rebalance. I guess here we are seeing the effect of local crawl mostly as only 81GB is migrated out of 500GB. Thanks, Susant ----- Original Message ----- > From: "Benjamin Turner" <bennyturns@xxxxxxxxx> > To: "Vijay Bellur" <vbellur@xxxxxxxxxx> > Cc: "Gluster Devel" <gluster-devel@xxxxxxxxxxx> > Sent: Monday, May 4, 2015 5:18:13 PM > Subject: Re: Rebalance improvement design > > Thanks Vijay! I forgot to upgrade the kernel(thinp 6.6 perf bug gah) before I > created this data set, so its a bit smaller: > > total threads = 16 > total files = 7,060,700 (64 kb files, 100 files per dir) > total data = 430.951 GB > 88.26% of requested files processed, minimum is 70.00 > 10101.355737 sec elapsed time > 698.985382 files/sec > 698.985382 IOPS > 43.686586 MB/sec > > I updated everything and ran the rebalanace on > glusterfs-3.8dev-0.107.git275f724.el6.x86_64.: > > [root@gqas001 ~]# gluster v rebalance testvol status > Node Rebalanced-files size scanned failures skipped status run time in secs > --------- ----------- ----------- ----------- ----------- ----------- > ------------ -------------- > localhost 1327346 81.0GB 3999140 0 0 completed 55088.00 > gqas013.sbu.lab.eng.bos.redhat.com 0 0Bytes 1 0 0 completed 26070.00 > gqas011.sbu.lab.eng.bos.redhat.com 0 0Bytes 0 0 0 failed 0.00 > gqas014.sbu.lab.eng.bos.redhat.com 0 0Bytes 0 0 0 failed 0.00 > gqas016.sbu.lab.eng.bos.redhat.com 1325857 80.9GB 4000865 0 0 completed > 55088.00 > gqas015.sbu.lab.eng.bos.redhat.com 0 0Bytes 0 0 0 failed 0.00 > volume rebalance: testvol: success: > > > A couple observations: > > I am seeing lots of threads / processes running: > > [root@gqas001 ~]# ps -eLf | grep glu | wc -l > 96 <- 96 gluster threads > [root@gqas001 ~]# ps -eLf | grep rebal | wc -l > 36 <- 36 rebal threads. > > Is this tunible? Is there a use case where we would need to limit this? Just > curious, how did we arrive at 36 rebal threads? > > # cat /var/log/glusterfs/testvol-rebalance.log | wc -l > 4,577,583 > [root@gqas001 ~]# ll /var/log/glusterfs/testvol-rebalance.log -h > -rw------- 1 root root 1.6G May 3 12:29 > /var/log/glusterfs/testvol-rebalance.log > > :) How big is this going to get when I do the 10-20 TB? I'll keep tabs on > this, my default test setup only has: > > [root@gqas001 ~]# df -h > Filesystem Size Used Avail Use% Mounted on > /dev/mapper/vg_gqas001-lv_root 50G 4.8G 42G 11% / > tmpfs 24G 0 24G 0% /dev/shm > /dev/sda1 477M 65M 387M 15% /boot > /dev/mapper/vg_gqas001-lv_home 385G 71M 366G 1% /home > /dev/mapper/gluster_vg-lv_bricks 9.5T 219G 9.3T 3% /bricks > > Next run I want to fill up a 10TB cluster and double the # of bricks to > simulate running out of space doubling capacity. Any other fixes or changes > that need to go in before I try a larger data set? Before that I may run my > performance regression suite against a system while a rebal is in progress > and check how it affects performance. I'll turn both these cases into perf > regression tests that I run with iozone smallfile and such, any other use > cases I should add? Should I add hard / soft links / whatever else tot he > data set? > > -b > > > On Sun, May 3, 2015 at 11:48 AM, Vijay Bellur < vbellur@xxxxxxxxxx > wrote: > > > On 05/01/2015 10:23 AM, Benjamin Turner wrote: > > > Ok I have all my data created and I just started the rebalance. One > thing to not in the client log I see the following spamming: > > [root@gqac006 ~]# cat /var/log/glusterfs/gluster-mount-.log | wc -l > 394042 > > [2015-05-01 00:47:55.591150] I [MSGID: 109036] > [dht-common.c:6478:dht_log_new_layout_for_dir_selfheal] 0-testvol-dht: > Setting layout of > /file_dstdir/ > gqac006.sbu.lab.eng.bos.redhat.com/thrd_05/d_001/d_000/d_004/d_006 > < http://gqac006.sbu.lab.eng.bos.redhat.com/thrd_05/d_001/d_000/d_004/d_006 > > with [Subvol_name: testvol-replicate-0, Err: -1 , Start: 0 , Stop: > 2141429669 ], [Subvol_name: testvol-replicate-1, Err: -1 , Start: > 2141429670 , Stop: 4294967295 ], > [2015-05-01 00:47:55.596147] I > [dht-selfheal.c:1587:dht_selfheal_layout_new_directory] 0-testvol-dht: > chunk size = 0xffffffff / 19920276 = 0xd7 > [2015-05-01 00:47:55.596177] I > [dht-selfheal.c:1626:dht_selfheal_layout_new_directory] 0-testvol-dht: > assigning range size 0x7fa39fa6 to testvol-replicate-1 > > > I also noticed the same set of excessive logs in my tests. Have sent across a > patch [1] to address this problem. > > -Vijay > > [1] http://review.gluster.org/10281 > > > > > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel@xxxxxxxxxxx > http://www.gluster.org/mailman/listinfo/gluster-devel > _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-devel