Re: Brick-Mux tests failing for over 11+ weeks

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Some further analysis based on what Mohit commented on the patch:

1) gf_attach used to kill a brick is taking more time, causing timeouts
in tests, mainly br-state-check.t. Usually when there are back to back
kill_bricks in the test.

2) Problem in ./tests/bugs/replicate/bug-1363721.t seems to be that
kill_brick has not completed before an attach request, causing it to be
a duplicate attach and hence dropped/ignored? (speculation)

Writing a test case to see if this is reproducible in that short case!

The above replicate test seems to also have a different issue when it
compares the md5sums towards the end of the tests (can be seen in the
console logs), which seems to be unrelated to brick-mux, (see:
https://build.gluster.org/job/centos7-regression/853/console for
example). Would be nice if someone from the replicate team took a look
at this one.

3) ./tests/bugs/index/bug-1559004-EMLINK-handling.t seems to be a
timeout in most (if not all cases), stuck in the last iteration.

I will be modifying the patch (discussed in this thread) to add more
time for 1 and 3 sfailures, and fire off a few more regressions, as I
try to reproduce 2.

Shyam
P.S: If work is happening on these issues, request that the
data/analysis be posted to the lists, reduces rework!

On 05/15/2018 09:10 PM, Shyam Ranganathan wrote:
> Hi,
> 
> After the fix provided by Atin here [1] for the issue reported below, we
> ran 7-8 runs of brick mux regressions against this fix, and we have had
> 1/3 runs successful (even those have some tests retried). The run links
> are in the review at [1].
> 
> The failures are as below, sorted in descending order of frequency.
> Requesting respective component owners/peers to take a stab at root
> causing these, as the current pass rate is not sufficient to qualify the
> release (or master) as stable.
> 
> 1) ./tests/bitrot/br-state-check.t (bitrot folks please take a look,
> this has the maximum instances of failures, including a core in the run [2])
> 
> 2) ./tests/bugs/replicate/bug-1363721.t (Replicate component owners
> please note, there are some failures in GFID comparison that seems
> outside of mux cases as well)
> 
> 3) ./tests/bugs/distribute/bug-1543279.t (Distribute)
> 
> ./tests/bugs/index/bug-1559004-EMLINK-handling.t (I think we need to up
> the SCRIPT timeout on this, if someone can confirm looking at the runs
> and failures, it would help determining the same)
> 
> ------ We can possibly wait to analyze things below this line as the
> instance count is 2 or less ------
> 
> 4)  ./tests/bugs/glusterd/add-brick-and-validate-replicated-volume-options.t
> 
> ./tests/bugs/snapshot/bug-1482023-snpashot-issue-with-other-processes-accessing-mounted-path.t
>     ./tests/bugs/quota/bug-1293601.t
> 
> 5)  ./tests/bugs/distribute/bug-1161311.t
>     ./tests/bitrot/bug-1373520.t
> 
> Thanks,
> Shyam
> 
> [1] Review containing the fix and the regression run links for logs:
> https://review.gluster.org/#/c/20022/3
> 
> [2] Test with core:
> https://build.gluster.org/job/regression-on-demand-multiplex/20/
> On 05/14/2018 08:31 PM, Shyam Ranganathan wrote:
>> *** Calling out to Glusterd folks to take a look at this ASAP and
>> provide a fix. ***
>>
>> Further to the mail sent yesterday, work done in my day with Johnny
>> (RaghuB), points to a problem in glusterd rpc port map having stale
>> entries for certain bricks as the cause for connection failures when
>> running in the multiplex mode.
>>
>> It seems like this problem has been partly addressed in this bug:
>> https://bugzilla.redhat.com/show_bug.cgi?id=1545048
>>
>> What is occurring now is that glusterd retains older ports in its
>> mapping table against bricks that have recently terminated, when a
>> volume is stopped and restarted, this leads to connection failures from
>> clients as there are no listeners on the now stale port.
>>
>> Test case as in [1], when run on my F27 machine fails 1 in 5 times with
>> the said error.
>>
>> The above does narrow down failures in tests:
>> - lk-quorum.t
>> - br-state-check.t
>> - entry-self-heal.t
>> - bug-1363721.t (possibly)
>>
>> Failure from client mount logs can be seen as using the wrong port
>> number in messages like "[rpc-clnt.c:2069:rpc_clnt_reconfig]
>> 6-patchy-client-2: changing port to 49156 (from 0)" when there are
>> failures, the real port for the brick-mux process would be different.
>>
>> We also used gdb to inspect glusterd pmap registry and found that older
>> stale port map data is present (in function pmap_registry_search as
>> clients invoke a connection).
>>
>> Thanks,
>> Shyam
>>
>> On 05/13/2018 06:56 PM, Shyam Ranganathan wrote:
>>> Hi,
>>>
>>> Nigel pointed out that the nightly brick-mux tests are now failing for
>>> about 11 weeks and we do not have a clear run of the same.
>>>
>>> Spent some time on Friday collecting what tests failed and to an extent
>>> why, and filed bug https://bugzilla.redhat.com/show_bug.cgi?id=1577672
>>>
>>> Asks: Whoever has cycles please look into these failures ASAP as these
>>> tests failing are blockers for 4.1 release, and overall the state of
>>> master (and hence 4.1 release branch) are not clean when these tests are
>>> failing for over 11 weeks.
>>>
>>> Most of the tests fail if run on a local setup as well, so debugging the
>>> same should be easier than requiring the mux or regression setup, just
>>> ensure that mux is turned on (either by default in the code base you are
>>> testing or in the test case adding the line `TEST $CLI volume set all
>>> cluster.brick-multiplex on` after any cleanup and post starting glusterd.
>>>
>>> 1) A lot of test cases time out, of which, the following 2 have the most
>>> failures, and hence possibly can help with the debugging of the root
>>> cause faster. Request Glusterd and bitrot teams to look at this, as the
>>> failures do not seem to bein replicate or client side layers (at present).
>>>
>>> (number in brackets is # times this failed in the last 13 instances of
>>> mux testing)
>>> ./tests/basic/afr/entry-self-heal.t (4)
>>> ./tests/bitrot/br-state-check.t (8)
>>>
>>> 2)
>>> ./tests/bugs/glusterd/add-brick-and-validate-replicated-volume-options.t (7)
>>>
>>> The above test constantly fails at this point:
>>> ------------
>>> 16:46:28 volume add-brick: failed: /d/backends/patchy3 is already part
>>> of a volume
>>> 16:46:28 not ok 25 , LINENUM:47
>>> 16:46:28 FAILED COMMAND: gluster --mode=script --wignore volume
>>> add-brick patchy replica 3 builder104.cloud.gluster.org:/d/backends/patchy3
>>> ------------
>>>
>>> From the logs the failure is occurring from here:
>>> ------------
>>> [2018-05-03 16:47:12.728893] E [MSGID: 106053]
>>> [glusterd-utils.c:13865:glusterd_handle_replicate_brick_ops]
>>> 0-management: Failed to set extended attribute trusted.add-brick :
>>> Transport endpoint is not connected [Transport endpoint is not connected]
>>> [2018-05-03 16:47:12.741438] E [MSGID: 106073]
>>> [glusterd-brick-ops.c:2590:glusterd_op_add_brick] 0-glusterd: Unable to
>>> add bricks
>>> ------------
>>>
>>> This seems like the added brick is not accepting connections.
>>>
>>> 3) The following tests also show similar behaviour to (2), where the AFR
>>> checks for brick up fails after timeout, as the birck is not accepting
>>> connections.
>>>
>>> ./tests/bugs/replicate/bug-1363721.t (4)
>>> ./tests/basic/afr/lk-quorum.t (5)
>>>
>>> I would suggest someone familiar with mux process and also brick muxing
>>> look at these from the initialization/RPC/socket front, as these seem to
>>> be bricks that do not show errors in the logs but are failing connections.
>>>
>>> As we find different root causes, we may want different bugs than the
>>> one filed, please do so and post patches in an effort to move this forward.
>>>
>>> Thanks,
>>> Shyam
>>> _______________________________________________
>>> Gluster-devel mailing list
>>> Gluster-devel@xxxxxxxxxxx
>>> http://lists.gluster.org/mailman/listinfo/gluster-devel
>>>
>>>
>>>
>>> _______________________________________________
>>> Gluster-devel mailing list
>>> Gluster-devel@xxxxxxxxxxx
>>> http://lists.gluster.org/mailman/listinfo/gluster-devel
> _______________________________________________
> Gluster-devel mailing list
> Gluster-devel@xxxxxxxxxxx
> http://lists.gluster.org/mailman/listinfo/gluster-devel
> 
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-devel



[Index of Archives]     [Gluster Users]     [Ceph Users]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux