Hi, Nigel pointed out that the nightly brick-mux tests are now failing for about 11 weeks and we do not have a clear run of the same. Spent some time on Friday collecting what tests failed and to an extent why, and filed bug https://bugzilla.redhat.com/show_bug.cgi?id=1577672 Asks: Whoever has cycles please look into these failures ASAP as these tests failing are blockers for 4.1 release, and overall the state of master (and hence 4.1 release branch) are not clean when these tests are failing for over 11 weeks. Most of the tests fail if run on a local setup as well, so debugging the same should be easier than requiring the mux or regression setup, just ensure that mux is turned on (either by default in the code base you are testing or in the test case adding the line `TEST $CLI volume set all cluster.brick-multiplex on` after any cleanup and post starting glusterd. 1) A lot of test cases time out, of which, the following 2 have the most failures, and hence possibly can help with the debugging of the root cause faster. Request Glusterd and bitrot teams to look at this, as the failures do not seem to bein replicate or client side layers (at present). (number in brackets is # times this failed in the last 13 instances of mux testing) ./tests/basic/afr/entry-self-heal.t (4) ./tests/bitrot/br-state-check.t (8) 2) ./tests/bugs/glusterd/add-brick-and-validate-replicated-volume-options.t (7) The above test constantly fails at this point: ------------ 16:46:28 volume add-brick: failed: /d/backends/patchy3 is already part of a volume 16:46:28 not ok 25 , LINENUM:47 16:46:28 FAILED COMMAND: gluster --mode=script --wignore volume add-brick patchy replica 3 builder104.cloud.gluster.org:/d/backends/patchy3 ------------ >From the logs the failure is occurring from here: ------------ [2018-05-03 16:47:12.728893] E [MSGID: 106053] [glusterd-utils.c:13865:glusterd_handle_replicate_brick_ops] 0-management: Failed to set extended attribute trusted.add-brick : Transport endpoint is not connected [Transport endpoint is not connected] [2018-05-03 16:47:12.741438] E [MSGID: 106073] [glusterd-brick-ops.c:2590:glusterd_op_add_brick] 0-glusterd: Unable to add bricks ------------ This seems like the added brick is not accepting connections. 3) The following tests also show similar behaviour to (2), where the AFR checks for brick up fails after timeout, as the birck is not accepting connections. ./tests/bugs/replicate/bug-1363721.t (4) ./tests/basic/afr/lk-quorum.t (5) I would suggest someone familiar with mux process and also brick muxing look at these from the initialization/RPC/socket front, as these seem to be bricks that do not show errors in the logs but are failing connections. As we find different root causes, we may want different bugs than the one filed, please do so and post patches in an effort to move this forward. Thanks, Shyam _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://lists.gluster.org/mailman/listinfo/gluster-devel