Re: POC- Distributed regression testing framework

Amar Tumballi <atumball@xxxxxxxxxx> · Mon, 25 Jun 2018 19:28:04 +0530

On Mon, Jun 25, 2018 at 7:17 PM, Deepshikha Khandelwal <dkhandel@xxxxxxxxxx> wrote:
Hello folks,

>From last few months, I've been working on bringing distributed

regression testing to production. Our regression framework takes about

4+ hours to run on a single machine. To reduce the waiting time,

Facebook contributed a distributed test runner.

The solution supports the following:

1) Shares worker pool across different testers.

2) Try failure 3 times on 3 different machines before calling it a failure.

3) Supports running ASAN, Valgrind, ASAN without leaks.

4) Store the failed test logs on a centralized server[1].

distributed-regression[2] is the Jenkins job for this

distributed-regression testing.

There are currently a few known issues:

* Not collecting the entire logs (/var/log/glusterfs) from servers.

If I look at the activities involved with regression failures, this can wait.

* A few tests fail due to infra-related issues like geo-rep tests.

Please open bugs for this, so we can track them, and take it to closure.

* Takes ~80 minutes with 7 distributed servers (targetting 60 minutes)

Time can change with more tests added, and also please plan to have number of server as 1 to n.

* We've only tested plain regressions. ASAN and Valgrind are currently untested.

Great to have it running not 'per patch', but as nightly, or weekly to start with. 

Before bringing it into production, we'll run this job nightly and

watch it for a month to debug the other failures.

I would say, bring it to production sooner, say 2 weeks, and also plan to have the current regression as is with a special command like 'run regression in-one-machine' in gerrit (or something similar) with voting rights, so we can fall back to this method if something is broken in parallel testing.

I have seen that regardless of amount of time we put some scripts in testing, the day we move to production, some thing would be broken. So, let that happen earlier than later, so it would help next release branching out. Don't want to be stuck for branching due to infra failures.

Regards,
Amar

Please let us know if you find any issues.

[1] https://ci-logs.gluster.org

[2] https://build.gluster.org/job/distributed-regression

Regards,

Deepshikha Khandelwal

_______________________________________________

Gluster-devel mailing list

Gluster-devel@xxxxxxxxxxx

http://lists.gluster.org/mailman/listinfo/gluster-devel

-- 
Amar Tumballi (amarts)

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-devel