Re: good job on fixing heavy hitters in spurious regressions

Shyam <srangana@xxxxxxxxxx> · Fri, 08 May 2015 14:09:17 -0400

On 05/08/2015 01:27 PM, Pranith Kumar Karampuri wrote:

On 05/08/2015 06:45 PM, Shyam wrote:
On 05/08/2015 08:16 AM, Jeff Darcy wrote:
Here are some of the things that I can think of: 0) Maintainers
should also maintain tests that are in their component.

It is not possible for me as glusterd co-maintainer to 'maintain'
tests that are added under tests/bugs/glusterd. Most of them don't
test core glusterd functionality.  They are almost always tied to a
particular feature whose implementation had bugs in its glusterd code.
I would expect the test authors (esp. the more recent ones) to chip
in.

Good point.  Nobody should be penalized for having code that everyone
else touches (or rewarded for having code that nobody dares to).
First responsibility for debugging a regression-test failure lies
with the owner of the patch that failed.  If they determine that the
failure is spurious - which is easy if it's already on a list - then
responsibility falls to the owner of the test.  Either should be able
to draw on the expertise of others in the group, but that doesn't
shift *responsibility*.  Only when a problem has been tracked down to
a particular piece of production code should responsibility move
again - either to the person whose earlier patch caused the breakage,
or to the subsystem maintainer.

+1, could not have said it better, but chipping in my vote for the
order, owner of patch first -> owner of test -> maintainer of module,
with the said responsibility in place.
I don't know man, this order failed me so many times in the past with
spurious failures for both 3.6.0 (Sent around 15 patches at that time to
fix spurious failures) and now for 3.7.0. also had to spend reasonable
amount of time including the long weekend keeping away from ec
deliverables for this release. I am very happy that lot more people
helped solve the problems this time around because I feel they
experienced the pain of screwed up regressions and realize the
importance of fixing them.
I'll tell you what experiences lead me to suggest that the maintainer
take this responsibility.

I submit a patch for new-component/changing log-level of one of the logs
for which there is not a single caller after you moved it from INFO ->
DEBUG. So the code is not at all going to be executed. Yet the
regressions will fail. I am 100% sure it has nothing to do with my
patch. I neither have time nor expertise to debug the test that I have
no clue about, so the least I can do is to intimate people who may do
something about it i.e. owner of test or maintainer of module. You feel
lets ask the owner of the test about what the problem is, owner of the
test moves on to different component and is busy with their own work. So
you are left with going to the maintainer who tells you so and so is the
problem and so and so is the reason as soon as you show the test number,
so you end up feeling why didn't I ask him/her first.

I agree on the experience and have no questions/comments on that.

The deal is though, we at least had people (including myself) ignoring 
spurious failures, re-triggering jobs, to get that +1 V and move on. 
Which causes issues, as the failures could have been at least flagged 
for others to be aware of and take forward. Which is the reason the 
patch submitter is the first to take a stab at correcting the problem.

We can always mail the owner next and then move on to the maintainer 
(which is what you did).

I would say we stop any merge if its history shows no analysis of 
regression failures (assuming regression failures occurred). This is a 
maintainer role that does this. (changing tracks: Let's please be more 
vocal in Gerrit, we can mark issues addressed as "Done" and also write a 
few lines on what is going on in terms of tests done etc.)

On people moving on and having no time for older areas/code, I guess we 
have to be stronger maintainers, as they are the only ones left to deal 
with the problem then. In which case, we should deal with this earlier 
than later, mandating _quality_ test cases for changes, than accepting 
things on the face of the code. Of course this is not correctable in 
retrospection now, so something for the future.

Mostly this is just common sense.  Perhaps the change that's needed
is to make the fixing of likely-spurious test failures a higher
priority than adding new features.  That has to be reflected not
only in Bugzilla, but also in how we schedule individual developers'
time and evaluate their progress toward goals.
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel