Re: bug-857330/normal.t failure

Krishnan Parthasarathi <kparthas@xxxxxxxxxx> · Fri, 23 May 2014 00:01:27 -0400 (EDT)

----- Original Message -----
> On 22/05/2014, at 1:34 PM, Kaushal M wrote:
> > Thanks Justin, I found the problem. The VM can be deleted now.
> 
> Done. :)
> 
> 
> > Turns out, there was more than enough time for the rebalance to complete.
> > But we hit a race, which caused a command to fail.
> > 
> > The particular test that failed is waiting for rebalance to finish. It does
> > this by doing a 'gluster volume rebalance <> status' command and checking
> > the result. The EXPECT_WITHIN function runs this command till we have a
> > match, the command fails or the timeout happens.
> > 
> > For a rebalance status command, glusterd sends a request to the rebalance
> > process (as a brick_op) to get the latest stats. It had done the same in
> > this case as well. But while glusterd was waiting for the reply, the
> > rebalance completed and the process stopped itself. This caused the rpc
> > connection between glusterd and rebalance proc to close. This caused the
> > all pending requests to be unwound as failures. Which in turnlead to the
> > command failing.
> > 
> > I cannot think of a way to avoid this race from within glusterd. For this
> > particular test, we could avoid using the 'rebalance status' command if we
> > directly checked the rebalance process state using its pid etc. I don't
> > particularly approve of this approach, as I think I used the 'rebalance
> > status' command for a reason. But I currently cannot recall the reason,
> > and if cannot come with it soon, I wouldn't mind changing the test to
> > avoid rebalance status.
> 

I think its the rebalance daemon's life cycle which is problematic. It makes it
inconvenient, if not impossible, for glusterd to gather progress/status deterministically.
The rebalance process could wait for the rebalance-commit subcommand to terminate.
There is no other daemon, managed by glusterd, has this kind of life cycle.
I don't see any good reason why rebalance should kill itself on completion
of data migration.

Thoughts?

~Krish

> Hmmm, is it the kind of thing where the "rebalance status" command
> should retry, if it's connection gets closed by a just-completed-
> rebalance (as happened here)?
> 
> Or would that not work as well?
> 
> + Justin
> 
> --
> Open Source and Standards @ Red Hat
> 
> twitter.com/realjustinclift
> 
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel@xxxxxxxxxxx
> http://supercolony.gluster.org/mailman/listinfo/gluster-devel
> 
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-devel