'remove-brick' is removing more bytes than are in the brick(?)

harry.mangalam at uci.edu (Harry Mangalam) · Mon, 21 May 2012 14:41:09 -0700

I'm running 3.3b3 on a 5brick/Ubuntu 10.04.4 system with mixed 
IPoIB/GbE.  It's behaving well other than the current problem.  The 
gluster filesystem is live and being used lightly by our cluster.

Note that the gli volume has 2 bricks on pbs2ib.  I'm trying to clear 
the smaller brick in preparation to replace the disks with larger 
ones.
=====
root at pbs1:/var/log/glusterfs# gluster volume info

Volume Name: gli
Type: Distribute
Volume ID: 76cc5e88-0ac4-42ac-a4a3-31bf2ba611d4
Status: Started
Number of Bricks: 5
Transport-type: tcp,rdma
Bricks:
Brick1: pbs1ib:/bducgl
Brick2: pbs2ib:/bducgl   <--- to remain
Brick3: pbs2ib:/bducgl1  <--- to be removed
Brick4: pbs3ib:/bducgl
Brick5: pbs4ib:/bducgl
Options Reconfigured:
performance.io-cache: on
performance.quick-read: on
performance.io-thread-count: 64
=====

'df' reports the brick (on /bducgl1) has 1265060072 KB:

=====
root at pbs2:~# df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sdb1             28959736  15104536  12384124  55% /
..
/dev/md0             8788707776 1524178616 7264529160  18% /bducgl
/dev/sda             1952129740 1265060072 687069668  65% /bducgl1
                                ^^^^^^^^^^
=====
(incidentally, this number '1265060072' does not change when as files 
are being removed, even as the files are being removed - ie files that 
the log say are being removed are no longer visible on the brick 
filesystem - is this expected?)

However the remove-brick operation has been going for about a day and 
reports having moved 1,369,285,939,442 bytes
=====
root at pbs1:
# gluster volume  remove-brick gli pbs2ib:/bducgl1  status
      Node Rebalanced-files          size       scanned         status
 ---------      -----------   -----------   -----------   ------------
 localhost               90       189616        87639   not started
    pbs4ib                0            0            0   not started
    pbs3ib                0            0            0   not started
    pbs2ib           8617041369285939442      2941430   in progress
                           ^^^^^^^^^^^^^
=====

This is more than even 1265060072*1024=1.29542151373e+12 bytes, so I'm 
wondering when/if this process is going to end..?

If I examine the 'gli-rebalance.log', I am still getting log entries 
like this (at about 1/sec - I would have expected considerably faster)

[2012-05-21 14:27:31.629995] I [dht-rebalance.c:854:dht_migrate_file] 
0-gli-dht: completed migration of 
/alamng/Research/Scheraga/F8i1m/set15/Fig8_int1_template_Hamil_set15.dat 
from subvolume gli-client-2 to gli-client-1

so migration appears to be happening and at repeated 'status' updates, 
the numbers change, but why is the gluster byte info so different from 
the 'df' info?

And is there any way to get an idea of when the process will end?  The 
'scanned' column number also is increasing so it's obviously not the 
total number of files to be moved.

After writing most of this note, this is the status about 10m later:

# gluster volume  remove-brick gli pbs2ib:/bducgl1  status
     Node Rebalanced-files          size       scanned      status
---------      -----------   -----------   -----------   ---------
localhost               90       189616        87639   not started
   pbs4ib                0            0            0   not started
   pbs3ib                0            0            0   not started
   pbs2ib           8780091379699182236      2994733   in progress

-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[ZOT 2225] / 92697  Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
--
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gluster.org/pipermail/gluster-users/attachments/20120521/6e7a1f3f/attachment-0001.htm>