Re: Analyzing the nightlies

Gregory Farnum <greg@xxxxxxxxxxx> · Mon, 11 May 2015 15:50:21 -0700

On Sat, May 9, 2015 at 2:31 AM, Loic Dachary <loic@xxxxxxxxxxx> wrote:
> Hi Yuri,
>
> It would be useful to add more information bout how the nightlies are analyzed at
>
>    http://tracker.ceph.com/projects/ceph-releases/wiki/HOWTO_monitor_the_automated_tests_AKA_nightlies
>
> At this point my understanding is that you look over all of them and you carry the burden of
>
> * sorting out the environmental noise
> * creating new bugs for errors for which there is no match in the tracker
> * add a link to the failed job in pre-existing issues found in the tracker (useful to figure out the frequency and helps with debug when there are multiple outputs / logs)
>
> You do so by using tools such as https://github.com/jcsp/scrape/blob/master/scrape.py and maybe others and you also format your mail messages so that they can be parsed by a program (although such a program does not exist yet, it could go over all your messages and build a database from the mails you sent).
>
> In the http://lists.ceph.com/private.cgi/ceph-qa-ceph.com/ archives, I see that Greg also regularly goes over the errors and other developers also do. What I'm not sure about is if it's best effort ? Is there a time like bug scrubbing or sprint planning when developers say "Let's analyze QA results and dig bugs" ?

I know that Yuri looks at some nightlies but I'm not sure which ones
he's responsible for — I think it's the upgrade suites?
In general analyzing the nightlies is (unfortunately? maybe
positively) the team lead's responsibility to make happen. Right now I
think that means we each pretty much go over (or ignore) the tests
covering our area as it suits us; I send emails because I find it
convenient but I know Sam mostly just makes bugs. Now that things have
settled down some in the labs I'm planning to sett up a rotation
amongst my team to cover them (it is *not* a small time commitment,
sadly).

The most annoying part of the job is when the lab breaks — realizing
that this means we can or cannot ignore such-and-such a set of
symptoms, making sure that it's not something new in Ceph or the test
that we changed, and then adjudicating responsibility for the fix
between the very nebulous group of people whose fault or
responsibility it might be. (We have a lot of hands in teuthology and
a lot in the lab, not in an entirely overlapping set, and any one of
them can cause breakage.) It's not always clear when I see a batch of
runs failed over the weekend if the problem has been resolved yet or
not. :(
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html