Re: Brick-Mux tests failing for over 11+ weeks

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 05/14/2018 08:35 PM, Shyam Ranganathan wrote:
> Further to the mail below,
> 
> 1. Test bug-1559004-EMLINK-handling.t possibly just needs a larger
> script timeout in mux based testing. I can see no errors in the 2-3
> times that it has failed, other than taking over 1000 seconds. Further
> investigation on normal non-mux regression also shows that this test
> takes 850-950 seconds to complete at times, I assume increasing the
> timeout will fix the failures due to this.
> 
> 2. We still need answers for the following
> - add-brick-and-validate-replicated-volume-options.t
> 
> Details on where it is failing is given in the mail below (point (2),
> points to possible glusterd issue again). This does not seem to
> correlate to the other glusterd stale port map information (as glusterd
> is restarted in this case), so we possibly need to narrow this down
> further. Help appreciated!

Looks like (2) above is fixed and has not reoccurred in the last 8+
runs, the fix being https://review.gluster.org/#/c/19924/

Can we get some more details on the fix, as to why the port mapper had a
stale port and for which brick? (because glusterd is restarted in this
test in between the issue of stale port map as present in other cases
does not apply, and hence would like more data in the bug or here).

> 
> Thanks,
> Shyam
> On 05/13/2018 06:56 PM, Shyam Ranganathan wrote:
>> Hi,
>>
>> Nigel pointed out that the nightly brick-mux tests are now failing for
>> about 11 weeks and we do not have a clear run of the same.
>>
>> Spent some time on Friday collecting what tests failed and to an extent
>> why, and filed bug https://bugzilla.redhat.com/show_bug.cgi?id=1577672
>>
>> Asks: Whoever has cycles please look into these failures ASAP as these
>> tests failing are blockers for 4.1 release, and overall the state of
>> master (and hence 4.1 release branch) are not clean when these tests are
>> failing for over 11 weeks.
>>
>> Most of the tests fail if run on a local setup as well, so debugging the
>> same should be easier than requiring the mux or regression setup, just
>> ensure that mux is turned on (either by default in the code base you are
>> testing or in the test case adding the line `TEST $CLI volume set all
>> cluster.brick-multiplex on` after any cleanup and post starting glusterd.
>>
>> 1) A lot of test cases time out, of which, the following 2 have the most
>> failures, and hence possibly can help with the debugging of the root
>> cause faster. Request Glusterd and bitrot teams to look at this, as the
>> failures do not seem to bein replicate or client side layers (at present).
>>
>> (number in brackets is # times this failed in the last 13 instances of
>> mux testing)
>> ./tests/basic/afr/entry-self-heal.t (4)
>> ./tests/bitrot/br-state-check.t (8)
>>
>> 2)
>> ./tests/bugs/glusterd/add-brick-and-validate-replicated-volume-options.t (7)
>>
>> The above test constantly fails at this point:
>> ------------
>> 16:46:28 volume add-brick: failed: /d/backends/patchy3 is already part
>> of a volume
>> 16:46:28 not ok 25 , LINENUM:47
>> 16:46:28 FAILED COMMAND: gluster --mode=script --wignore volume
>> add-brick patchy replica 3 builder104.cloud.gluster.org:/d/backends/patchy3
>> ------------
>>
>> From the logs the failure is occurring from here:
>> ------------
>> [2018-05-03 16:47:12.728893] E [MSGID: 106053]
>> [glusterd-utils.c:13865:glusterd_handle_replicate_brick_ops]
>> 0-management: Failed to set extended attribute trusted.add-brick :
>> Transport endpoint is not connected [Transport endpoint is not connected]
>> [2018-05-03 16:47:12.741438] E [MSGID: 106073]
>> [glusterd-brick-ops.c:2590:glusterd_op_add_brick] 0-glusterd: Unable to
>> add bricks
>> ------------
>>
>> This seems like the added brick is not accepting connections.
>>
>> 3) The following tests also show similar behaviour to (2), where the AFR
>> checks for brick up fails after timeout, as the birck is not accepting
>> connections.
>>
>> ./tests/bugs/replicate/bug-1363721.t (4)
>> ./tests/basic/afr/lk-quorum.t (5)
>>
>> I would suggest someone familiar with mux process and also brick muxing
>> look at these from the initialization/RPC/socket front, as these seem to
>> be bricks that do not show errors in the logs but are failing connections.
>>
>> As we find different root causes, we may want different bugs than the
>> one filed, please do so and post patches in an effort to move this forward.
>>
>> Thanks,
>> Shyam
>> _______________________________________________
>> Gluster-devel mailing list
>> Gluster-devel@xxxxxxxxxxx
>> http://lists.gluster.org/mailman/listinfo/gluster-devel
>>
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel@xxxxxxxxxxx
> http://lists.gluster.org/mailman/listinfo/gluster-devel
> 
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-devel



[Index of Archives]     [Gluster Users]     [Ceph Users]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux