Hi, After the fix provided by Atin here [1] for the issue reported below, we ran 7-8 runs of brick mux regressions against this fix, and we have had 1/3 runs successful (even those have some tests retried). The run links are in the review at [1]. The failures are as below, sorted in descending order of frequency. Requesting respective component owners/peers to take a stab at root causing these, as the current pass rate is not sufficient to qualify the release (or master) as stable. 1) ./tests/bitrot/br-state-check.t (bitrot folks please take a look, this has the maximum instances of failures, including a core in the run [2]) 2) ./tests/bugs/replicate/bug-1363721.t (Replicate component owners please note, there are some failures in GFID comparison that seems outside of mux cases as well) 3) ./tests/bugs/distribute/bug-1543279.t (Distribute) ./tests/bugs/index/bug-1559004-EMLINK-handling.t (I think we need to up the SCRIPT timeout on this, if someone can confirm looking at the runs and failures, it would help determining the same) ------ We can possibly wait to analyze things below this line as the instance count is 2 or less ------ 4) ./tests/bugs/glusterd/add-brick-and-validate-replicated-volume-options.t ./tests/bugs/snapshot/bug-1482023-snpashot-issue-with-other-processes-accessing-mounted-path.t ./tests/bugs/quota/bug-1293601.t 5) ./tests/bugs/distribute/bug-1161311.t ./tests/bitrot/bug-1373520.t Thanks, Shyam [1] Review containing the fix and the regression run links for logs: https://review.gluster.org/#/c/20022/3 [2] Test with core: https://build.gluster.org/job/regression-on-demand-multiplex/20/ On 05/14/2018 08:31 PM, Shyam Ranganathan wrote: > *** Calling out to Glusterd folks to take a look at this ASAP and > provide a fix. *** > > Further to the mail sent yesterday, work done in my day with Johnny > (RaghuB), points to a problem in glusterd rpc port map having stale > entries for certain bricks as the cause for connection failures when > running in the multiplex mode. > > It seems like this problem has been partly addressed in this bug: > https://bugzilla.redhat.com/show_bug.cgi?id=1545048 > > What is occurring now is that glusterd retains older ports in its > mapping table against bricks that have recently terminated, when a > volume is stopped and restarted, this leads to connection failures from > clients as there are no listeners on the now stale port. > > Test case as in [1], when run on my F27 machine fails 1 in 5 times with > the said error. > > The above does narrow down failures in tests: > - lk-quorum.t > - br-state-check.t > - entry-self-heal.t > - bug-1363721.t (possibly) > > Failure from client mount logs can be seen as using the wrong port > number in messages like "[rpc-clnt.c:2069:rpc_clnt_reconfig] > 6-patchy-client-2: changing port to 49156 (from 0)" when there are > failures, the real port for the brick-mux process would be different. > > We also used gdb to inspect glusterd pmap registry and found that older > stale port map data is present (in function pmap_registry_search as > clients invoke a connection). > > Thanks, > Shyam > > On 05/13/2018 06:56 PM, Shyam Ranganathan wrote: >> Hi, >> >> Nigel pointed out that the nightly brick-mux tests are now failing for >> about 11 weeks and we do not have a clear run of the same. >> >> Spent some time on Friday collecting what tests failed and to an extent >> why, and filed bug https://bugzilla.redhat.com/show_bug.cgi?id=1577672 >> >> Asks: Whoever has cycles please look into these failures ASAP as these >> tests failing are blockers for 4.1 release, and overall the state of >> master (and hence 4.1 release branch) are not clean when these tests are >> failing for over 11 weeks. >> >> Most of the tests fail if run on a local setup as well, so debugging the >> same should be easier than requiring the mux or regression setup, just >> ensure that mux is turned on (either by default in the code base you are >> testing or in the test case adding the line `TEST $CLI volume set all >> cluster.brick-multiplex on` after any cleanup and post starting glusterd. >> >> 1) A lot of test cases time out, of which, the following 2 have the most >> failures, and hence possibly can help with the debugging of the root >> cause faster. Request Glusterd and bitrot teams to look at this, as the >> failures do not seem to bein replicate or client side layers (at present). >> >> (number in brackets is # times this failed in the last 13 instances of >> mux testing) >> ./tests/basic/afr/entry-self-heal.t (4) >> ./tests/bitrot/br-state-check.t (8) >> >> 2) >> ./tests/bugs/glusterd/add-brick-and-validate-replicated-volume-options.t (7) >> >> The above test constantly fails at this point: >> ------------ >> 16:46:28 volume add-brick: failed: /d/backends/patchy3 is already part >> of a volume >> 16:46:28 not ok 25 , LINENUM:47 >> 16:46:28 FAILED COMMAND: gluster --mode=script --wignore volume >> add-brick patchy replica 3 builder104.cloud.gluster.org:/d/backends/patchy3 >> ------------ >> >> From the logs the failure is occurring from here: >> ------------ >> [2018-05-03 16:47:12.728893] E [MSGID: 106053] >> [glusterd-utils.c:13865:glusterd_handle_replicate_brick_ops] >> 0-management: Failed to set extended attribute trusted.add-brick : >> Transport endpoint is not connected [Transport endpoint is not connected] >> [2018-05-03 16:47:12.741438] E [MSGID: 106073] >> [glusterd-brick-ops.c:2590:glusterd_op_add_brick] 0-glusterd: Unable to >> add bricks >> ------------ >> >> This seems like the added brick is not accepting connections. >> >> 3) The following tests also show similar behaviour to (2), where the AFR >> checks for brick up fails after timeout, as the birck is not accepting >> connections. >> >> ./tests/bugs/replicate/bug-1363721.t (4) >> ./tests/basic/afr/lk-quorum.t (5) >> >> I would suggest someone familiar with mux process and also brick muxing >> look at these from the initialization/RPC/socket front, as these seem to >> be bricks that do not show errors in the logs but are failing connections. >> >> As we find different root causes, we may want different bugs than the >> one filed, please do so and post patches in an effort to move this forward. >> >> Thanks, >> Shyam >> _______________________________________________ >> Gluster-devel mailing list >> Gluster-devel@xxxxxxxxxxx >> http://lists.gluster.org/mailman/listinfo/gluster-devel >> >> >> >> _______________________________________________ >> Gluster-devel mailing list >> Gluster-devel@xxxxxxxxxxx >> http://lists.gluster.org/mailman/listinfo/gluster-devel _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://lists.gluster.org/mailman/listinfo/gluster-devel