Re: Difference in bad_tests count in mainline vs 3.7 branch

Raghavendra Talur <rtalur@xxxxxxxxxx> · Tue, 8 Sep 2015 01:55:22 +0530

On Fri, Sep 4, 2015 at 12:56 PM, Raghavendra Talur <rtalur@xxxxxxxxxx> wrote:

Maintainers - can you please take stock of this and ensure sanity of your components before merging patches that do not fix a failing test?

Here is my proposal to get this fixed.

This weekend, 5th September 0400 UTC, I will start a jenkins run on master and 3.7 branches.
It will be re-based with code just before it is run, so all patches merged by 4th September would be tested.
It will run each test for 10 times in succession. Why 10?
Hope to find tests that fail occasionally.
If the tests fails only for 1st run, it could very well be a cleanup issue with last run test.
Failures within the 10 runs in a pattern is again indicative of some cleanup/timeout error.
It will run all tests and not stop at the first failure.
I will have scripts modified to get maximum data from logs. (It will still be INFO level logs)
After the test completes, I will file a bug against the component of the .t tests that fail in this run and immediately add the test to bad tests list.
What should the maintainers do after that?
If a bug is filed against your component, please spend some time on Monday and root cause the issue by Monday EOD.
If the root cause proves that the bug is in .t file
It is would be mostly because
The timeouts are not enough all the time. Change EXPECT_WITHIN values and check.
The test is not deterministic enough ; some of the assumptions that test makes might not always be true. For example, a SIGTERM followed by a TEST which assumes that process is definitely killed is a wrong assumption. Use SIGKILL in such cases. (I know SIGKILL may not work too if the process is in D state, but its a good enough example)
It is easier to fix bugs in.t once the root cause is found. Please fix the issue and remove it from bad tests list. Use the bug filed against this .t file.
If the root cause proves that the bug is in Gluster code:
If the bug is in same component as the .t file:
In this case, you are the component owner, change the description and summary of the bug filed to indicate the actual issue.
If the time required to fix the issue in Gluster code is non-minimal
Put a workaround in .t file with a comment clearly stating the bug number which would later fix it and remove the test from bad test list.
If a workaround is not possible let the test remain in bad test list.
If the bug is not in same component as the .t file:
Update the bug with details which prove that bug is not in the same component and change the component accordingly.
It is new owner's responsibility to provide a workaround for all .t files hit by the issue and fix the code.
Note to all maintainers:
I would request everyone to resist merging patches this weekend unless critically required. It would help us in debugging on Monday.

 I did try this over the weekend. Refer to the patch at http://review.gluster.org/#/c/12109/.

However, I discovered that tests failed continuously after certain tests failed in a run thereby
indicating that our cleanup function is not sufficient/complete.

I will be working on fixing few functions in run-tests.sh and include.rc before coming back to this next weekend.

Lets hope that when we do a similar jenkins run on next weekend, September 12th, we don't find any failures.

Suggestions welcome for any changes in the above plan.

Thanks,
Raghavendra Talur

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel