Some further analysis based on what Mohit commented on the patch: 1) gf_attach used to kill a brick is taking more time, causing timeouts in tests, mainly br-state-check.t. Usually when there are back to back kill_bricks in the test. 2) Problem in ./tests/bugs/replicate/bug-1363721.t seems to be that kill_brick has not completed before an attach request, causing it to be a duplicate attach and hence dropped/ignored? (speculation) Writing a test case to see if this is reproducible in that short case! The above replicate test seems to also have a different issue when it compares the md5sums towards the end of the tests (can be seen in the console logs), which seems to be unrelated to brick-mux, (see: https://build.gluster.org/job/centos7-regression/853/console for example). Would be nice if someone from the replicate team took a look at this one. 3) ./tests/bugs/index/bug-1559004-EMLINK-handling.t seems to be a timeout in most (if not all cases), stuck in the last iteration. I will be modifying the patch (discussed in this thread) to add more time for 1 and 3 sfailures, and fire off a few more regressions, as I try to reproduce 2. Shyam P.S: If work is happening on these issues, request that the data/analysis be posted to the lists, reduces rework! On 05/15/2018 09:10 PM, Shyam Ranganathan wrote: > Hi, > > After the fix provided by Atin here [1] for the issue reported below, we > ran 7-8 runs of brick mux regressions against this fix, and we have had > 1/3 runs successful (even those have some tests retried). The run links > are in the review at [1]. > > The failures are as below, sorted in descending order of frequency. > Requesting respective component owners/peers to take a stab at root > causing these, as the current pass rate is not sufficient to qualify the > release (or master) as stable. > > 1) ./tests/bitrot/br-state-check.t (bitrot folks please take a look, > this has the maximum instances of failures, including a core in the run [2]) > > 2) ./tests/bugs/replicate/bug-1363721.t (Replicate component owners > please note, there are some failures in GFID comparison that seems > outside of mux cases as well) > > 3) ./tests/bugs/distribute/bug-1543279.t (Distribute) > > ./tests/bugs/index/bug-1559004-EMLINK-handling.t (I think we need to up > the SCRIPT timeout on this, if someone can confirm looking at the runs > and failures, it would help determining the same) > > ------ We can possibly wait to analyze things below this line as the > instance count is 2 or less ------ > > 4) ./tests/bugs/glusterd/add-brick-and-validate-replicated-volume-options.t > > ./tests/bugs/snapshot/bug-1482023-snpashot-issue-with-other-processes-accessing-mounted-path.t > ./tests/bugs/quota/bug-1293601.t > > 5) ./tests/bugs/distribute/bug-1161311.t > ./tests/bitrot/bug-1373520.t > > Thanks, > Shyam > > [1] Review containing the fix and the regression run links for logs: > https://review.gluster.org/#/c/20022/3 > > [2] Test with core: > https://build.gluster.org/job/regression-on-demand-multiplex/20/ > On 05/14/2018 08:31 PM, Shyam Ranganathan wrote: >> *** Calling out to Glusterd folks to take a look at this ASAP and >> provide a fix. *** >> >> Further to the mail sent yesterday, work done in my day with Johnny >> (RaghuB), points to a problem in glusterd rpc port map having stale >> entries for certain bricks as the cause for connection failures when >> running in the multiplex mode. >> >> It seems like this problem has been partly addressed in this bug: >> https://bugzilla.redhat.com/show_bug.cgi?id=1545048 >> >> What is occurring now is that glusterd retains older ports in its >> mapping table against bricks that have recently terminated, when a >> volume is stopped and restarted, this leads to connection failures from >> clients as there are no listeners on the now stale port. >> >> Test case as in [1], when run on my F27 machine fails 1 in 5 times with >> the said error. >> >> The above does narrow down failures in tests: >> - lk-quorum.t >> - br-state-check.t >> - entry-self-heal.t >> - bug-1363721.t (possibly) >> >> Failure from client mount logs can be seen as using the wrong port >> number in messages like "[rpc-clnt.c:2069:rpc_clnt_reconfig] >> 6-patchy-client-2: changing port to 49156 (from 0)" when there are >> failures, the real port for the brick-mux process would be different. >> >> We also used gdb to inspect glusterd pmap registry and found that older >> stale port map data is present (in function pmap_registry_search as >> clients invoke a connection). >> >> Thanks, >> Shyam >> >> On 05/13/2018 06:56 PM, Shyam Ranganathan wrote: >>> Hi, >>> >>> Nigel pointed out that the nightly brick-mux tests are now failing for >>> about 11 weeks and we do not have a clear run of the same. >>> >>> Spent some time on Friday collecting what tests failed and to an extent >>> why, and filed bug https://bugzilla.redhat.com/show_bug.cgi?id=1577672 >>> >>> Asks: Whoever has cycles please look into these failures ASAP as these >>> tests failing are blockers for 4.1 release, and overall the state of >>> master (and hence 4.1 release branch) are not clean when these tests are >>> failing for over 11 weeks. >>> >>> Most of the tests fail if run on a local setup as well, so debugging the >>> same should be easier than requiring the mux or regression setup, just >>> ensure that mux is turned on (either by default in the code base you are >>> testing or in the test case adding the line `TEST $CLI volume set all >>> cluster.brick-multiplex on` after any cleanup and post starting glusterd. >>> >>> 1) A lot of test cases time out, of which, the following 2 have the most >>> failures, and hence possibly can help with the debugging of the root >>> cause faster. Request Glusterd and bitrot teams to look at this, as the >>> failures do not seem to bein replicate or client side layers (at present). >>> >>> (number in brackets is # times this failed in the last 13 instances of >>> mux testing) >>> ./tests/basic/afr/entry-self-heal.t (4) >>> ./tests/bitrot/br-state-check.t (8) >>> >>> 2) >>> ./tests/bugs/glusterd/add-brick-and-validate-replicated-volume-options.t (7) >>> >>> The above test constantly fails at this point: >>> ------------ >>> 16:46:28 volume add-brick: failed: /d/backends/patchy3 is already part >>> of a volume >>> 16:46:28 not ok 25 , LINENUM:47 >>> 16:46:28 FAILED COMMAND: gluster --mode=script --wignore volume >>> add-brick patchy replica 3 builder104.cloud.gluster.org:/d/backends/patchy3 >>> ------------ >>> >>> From the logs the failure is occurring from here: >>> ------------ >>> [2018-05-03 16:47:12.728893] E [MSGID: 106053] >>> [glusterd-utils.c:13865:glusterd_handle_replicate_brick_ops] >>> 0-management: Failed to set extended attribute trusted.add-brick : >>> Transport endpoint is not connected [Transport endpoint is not connected] >>> [2018-05-03 16:47:12.741438] E [MSGID: 106073] >>> [glusterd-brick-ops.c:2590:glusterd_op_add_brick] 0-glusterd: Unable to >>> add bricks >>> ------------ >>> >>> This seems like the added brick is not accepting connections. >>> >>> 3) The following tests also show similar behaviour to (2), where the AFR >>> checks for brick up fails after timeout, as the birck is not accepting >>> connections. >>> >>> ./tests/bugs/replicate/bug-1363721.t (4) >>> ./tests/basic/afr/lk-quorum.t (5) >>> >>> I would suggest someone familiar with mux process and also brick muxing >>> look at these from the initialization/RPC/socket front, as these seem to >>> be bricks that do not show errors in the logs but are failing connections. >>> >>> As we find different root causes, we may want different bugs than the >>> one filed, please do so and post patches in an effort to move this forward. >>> >>> Thanks, >>> Shyam >>> _______________________________________________ >>> Gluster-devel mailing list >>> Gluster-devel@xxxxxxxxxxx >>> http://lists.gluster.org/mailman/listinfo/gluster-devel >>> >>> >>> >>> _______________________________________________ >>> Gluster-devel mailing list >>> Gluster-devel@xxxxxxxxxxx >>> http://lists.gluster.org/mailman/listinfo/gluster-devel > _______________________________________________ > Gluster-devel mailing list > Gluster-devel@xxxxxxxxxxx > http://lists.gluster.org/mailman/listinfo/gluster-devel > _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://lists.gluster.org/mailman/listinfo/gluster-devel