----- Original Message ----- > On 22/05/2014, at 1:34 PM, Kaushal M wrote: > > Thanks Justin, I found the problem. The VM can be deleted now. > > Done. :) > > > > Turns out, there was more than enough time for the rebalance to complete. > > But we hit a race, which caused a command to fail. > > > > The particular test that failed is waiting for rebalance to finish. It does > > this by doing a 'gluster volume rebalance <> status' command and checking > > the result. The EXPECT_WITHIN function runs this command till we have a > > match, the command fails or the timeout happens. > > > > For a rebalance status command, glusterd sends a request to the rebalance > > process (as a brick_op) to get the latest stats. It had done the same in > > this case as well. But while glusterd was waiting for the reply, the > > rebalance completed and the process stopped itself. This caused the rpc > > connection between glusterd and rebalance proc to close. This caused the > > all pending requests to be unwound as failures. Which in turnlead to the > > command failing. > > > > I cannot think of a way to avoid this race from within glusterd. For this > > particular test, we could avoid using the 'rebalance status' command if we > > directly checked the rebalance process state using its pid etc. I don't > > particularly approve of this approach, as I think I used the 'rebalance > > status' command for a reason. But I currently cannot recall the reason, > > and if cannot come with it soon, I wouldn't mind changing the test to > > avoid rebalance status. > I think its the rebalance daemon's life cycle which is problematic. It makes it inconvenient, if not impossible, for glusterd to gather progress/status deterministically. The rebalance process could wait for the rebalance-commit subcommand to terminate. There is no other daemon, managed by glusterd, has this kind of life cycle. I don't see any good reason why rebalance should kill itself on completion of data migration. Thoughts? ~Krish > Hmmm, is it the kind of thing where the "rebalance status" command > should retry, if it's connection gets closed by a just-completed- > rebalance (as happened here)? > > Or would that not work as well? > > + Justin > > -- > Open Source and Standards @ Red Hat > > twitter.com/realjustinclift > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel@xxxxxxxxxxx > http://supercolony.gluster.org/mailman/listinfo/gluster-devel > _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://supercolony.gluster.org/mailman/listinfo/gluster-devel