Re: Release 3.12: Glusto run status

Shyam Ranganathan <srangana@xxxxxxxxxx> · Mon, 28 Aug 2017 18:43:03 -0400

Nigel, Shwetha,

The latest Glusto run [a] that was started by Nigel, post fixing the 
prior timeout issue, failed (much later though) again.

I took a look at the logs and my analysis is here [b]

@atin, @kaushal, @ppai can you take a look and see if the analysis is 
correct?

In short glusterd has got an error when checking for rebalance stats 
from one of the nodes as:
"Received commit RJT from uuid: 6f9524e6-9f9e-44aa-b2f4-393404adfd9d"

and the rebalance deamon on the node with that UUID is not really ready 
to serve requests when this was called, hence I am assuming this is 
causing the error. But need a once over by one of you folks.

@Shwetha, can we add a further timeout between rebalance start and 
checking the status, just so that we avoid this timing issue on these nodes.

Thanks,
Shyam

[a] glusto run: https://ci.centos.org/view/Gluster/job/gluster_glusto/377/

[b] analysis of the failure: 
https://paste.fedoraproject.org/paste/mk6ynJ0B9AH6H9ncbyru5w
On 08/25/2017 04:29 PM, Shyam Ranganathan wrote:
Nigel was kind enough to kick off a glusto run on 3.12 head a couple of 
days back. The status can be seen here [1].

The run failed, but managed to get past what Glusto does on master (see 
[2]). Not that this is a consolation, but just stating the fact.

The run [1] failed at,
17:05:57 
functional/bvt/test_cvt.py::TestGlusterHealSanity_dispersed_glusterfs::test_self_heal_when_io_in_progress 
FAILED

The test case failed due to,
17:10:28 E       AssertionError: ('Volume %s : All process are not 
online', 'testvol_dispersed')

The test case can be seen here [3], and the reason for failure is that 
Glusto did not wait long enough for the down brick to come up (it waited 
for 10 seconds, but the brick came up after 12 seconds or within the 
same second as the test for it being up. The log snippets pointing to 
this problem are here [4]. In short there was no real bug or issue that 
caused the failure as yet.

Glusto as a gating factor for this release was desirable, but having got 
this far on 3.12 does help.

@nigel, we could try post increasing the timeout between bringing the 
brick up to checking if it is up, and try another run, let me know if 
that works, and what is needed from me to get this going.

Shyam

[1] Glusto 3.12 run: 
https://ci.centos.org/view/Gluster/job/gluster_glusto/365/

[2] Glusto on master: 
https://ci.centos.org/view/Gluster/job/gluster_glusto/360/testReport/functional.bvt.test_cvt/ 

[3] Failed test case: 
https://ci.centos.org/view/Gluster/job/gluster_glusto/365/testReport/functional.bvt.test_cvt/TestGlusterHealSanity_dispersed_glusterfs/test_self_heal_when_io_in_progress/ 

[4] Log analysis pointing to the failed check: 
https://paste.fedoraproject.org/paste/znTPiFLrc2~vsWuoYRToZA

"Releases are made better together"
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-devel
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-devel