Re: POC- Distributed regression testing framework

Sanju Rakonde <srakonde@xxxxxxxxxx> · Thu, 4 Oct 2018 06:10:34 +0530

On Wed, Oct 3, 2018 at 3:26 PM Deepshikha Khandelwal <dkhandel@xxxxxxxxxx> wrote:
Hello folks,

Distributed-regression job[1] is now a part of Gluster's

nightly-master build pipeline. The following are the issues we have

resolved since we started working on this:

1) Collecting gluster logs from servers.

2) Tests failed due to infra-related issues have been fixed.

3) Time taken to run regression testing reduced to ~50-60 minutes.

To get time down to 40 minutes needs your help!

Currently, there is a test that is failing:

tests/bugs/glusterd/optimized-basic-testcases-in-cluster.t

This needs fixing first.

Where can I get the logs of this test case? In https://build.gluster.org/job/distributed-regression/264/console I see this test case is failed and re-attempted. But I couldn't find logs.

There's a test that takes 14 minutes to complete -

`tests/bugs/index/bug-1559004-EMLINK-handling.t`. A single test taking

14 minutes is not something we can distribute. Can we look at how we

can speed this up[2]? When this test fails, it is re-attempted,

further increasing the time. This happens in the regular

centos7-regression job as well.

If you see any other issues, please file a bug[3].

[1]: https://build.gluster.org/job/distributed-regression

[2]: https://build.gluster.org/job/distributed-regression/264/console

[3]: https://bugzilla.redhat.com/enter_bug.cgi?product=glusterfs&component=project-infrastructure

Thanks,

Deepshikha Khandelwal

On Tue, Jun 26, 2018 at 9:02 AM Nigel Babu <nigelb@xxxxxxxxxx> wrote:

>

>

>

> On Mon, Jun 25, 2018 at 7:28 PM Amar Tumballi <atumball@xxxxxxxxxx> wrote:

>>

>>

>>

>>> There are currently a few known issues:

>>> * Not collecting the entire logs (/var/log/glusterfs) from servers.

>>

>>

>> If I look at the activities involved with regression failures, this can wait.

>

>

> Well, we can't debug the current failures without having the logs. So this has to be fixed first.

>

>>

>>

>>>

>>> * A few tests fail due to infra-related issues like geo-rep tests.

>>

>>

>> Please open bugs for this, so we can track them, and take it to closure.

>

>

> These are failing due to infra reasons. Most likely subtle differences in the setup of these nodes vs our normal nodes. We'll only be able to debug them once we get the logs. I know the geo-rep ones are easy to fix. The playbook for setting up geo-rep correctly just didn't make it over to the playbook used for these images.

>

>>

>>

>>>

>>> * Takes ~80 minutes with 7 distributed servers (targetting 60 minutes)

>>

>>

>> Time can change with more tests added, and also please plan to have number of server as 1 to n.

>

>

> While the n is configurable, however it will be fixed to a single digit number for now. We will need to place *some* limitation somewhere or else we'll end up not being able to control our cloud bills.

>

>>

>>

>>>

>>> * We've only tested plain regressions. ASAN and Valgrind are currently untested.

>>

>>

>> Great to have it running not 'per patch', but as nightly, or weekly to start with.

>

>

> This is currently not targeted until we phase out current regressions.

>

>>>

>>>

>>> Before bringing it into production, we'll run this job nightly and

>>> watch it for a month to debug the other failures.

>>>

>>

>> I would say, bring it to production sooner, say 2 weeks, and also plan to have the current regression as is with a special command like 'run regression in-one-machine' in gerrit (or something similar) with voting rights, so we can fall back to this method if something is broken in parallel testing.

>>

>> I have seen that regardless of amount of time we put some scripts in testing, the day we move to production, some thing would be broken. So, let that happen earlier than later, so it would help next release branching out. Don't want to be stuck for branching due to infra failures.

>

>

> Having two regression jobs that can vote is going to cause more confusion than it's worth. There are a couple of intermittent memory issues with the test script that we need to debug and fix before I'm comfortable in making this job a voting job. We've worked around these problems right now. It still pops up now and again. The fact that things break often is not an excuse to prevent avoidable failures.  The one month timeline was taken with all these factors into consideration. The 2-week timeline is a no-go at this point.

>

> When we are ready to make the switch, we won't be switching 100% of the job. We'll start with a sliding scale so that we can monitor failures and machine creation adequately.

>

> --

> nigelb

_______________________________________________

Gluster-devel mailing list

Gluster-devel@xxxxxxxxxxx

https://lists.gluster.org/mailman/listinfo/gluster-devel

-- 
Thanks,
Sanju

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-devel