Non-progressing, Unstoppable rebalance on 3.3

hjmangalam at gmail.com (Harry Mangalam) · Thu, 23 Aug 2012 11:23:02 -0700

Following an interchange with Jeff Darcy and Shishir Gowda, I started
a rebalance of my cluster (3.3 on Ubuntu 10.04.4).
Note: shortly after it started, 3/4 of the glusterfsd's shut down
(which was exciting..).  I stopped and restarted glusterd and the
glusterfsd's restarted in turn and all was well, however it may have
caused a problem with the rebalance:

After 2 days of waiting, the rebalance has apparently done nothing
(distracted by other things) and presents with the same values as it
had originally:

Thu Aug 23 10:35:11 [0.00 0.00 0.00]  root at pbs1:/var/log/glusterfs
770 $ gluster volume rebalance gli status
     Node Rebalanced-files          size       scanned      failures
      status
---------      -----------   -----------   -----------   -----------
------------
localhost                0            0            0            0    in progress
   pbs4ib                0            0            0            0    not started
   pbs2ib             1380    547324969         7686            3      completed
   pbs3ib                0            0            0            0    not started

(the above has the leading 32 blanks trimmed from the output - is
there a reason for including those in the output?)
the above implies that it is at least partially  "in progress", but
after stopping it:

Thu Aug 23 10:53:26 [0.00 0.00 0.00]  root at pbs1:/var/log/glusterfs
774 $ gluster volume rebalance gli stop
     Node Rebalanced-files          size       scanned      failures
      status
---------      -----------   -----------   -----------   -----------
------------
localhost                0            0            0            0    in progress
   pbs4ib                0            0            0            0    not started
   pbs2ib             1380    547324969         7686            3      completed
   pbs3ib                0            0            0            0    not started
Stopped rebalance process on volume gli

it still seems to be going:
Thu Aug 23 10:53:28 [0.00 0.00 0.00]  root at pbs1:/var/log/glusterfs
775 $ gluster volume rebalance gli status
     Node Rebalanced-files          size       scanned      failures
      status
---------      -----------   -----------   -----------   -----------
------------
localhost                0            0            0            0    in progress
   pbs4ib                0            0            0            0    not started
   pbs2ib             1380    547324969         7686            3      completed
   pbs3ib                0            0            0            0    not started

Examining the server nodes, only pbs1 (localhost in the above output)
had glusterfs running, and since it may have been 'orphaned' when I
had the glusterfsd hiccups and has been hanging since that time.
However, when I killed it, nothing changes.  gluster still reports
that the rebalance is in progress (even tho no glusterfs's are running
on any of the nodes).

If I try to reset it with a 'start force':
Thu Aug 23 11:14:39 [0.06 0.04 0.00]  root at pbs1:/var/log/glusterfs
789 $ gluster volume rebalance gli start force
Rebalance on gli is already started

and the status remains exactly as above.