On Sat, May 9, 2015 at 2:31 AM, Loic Dachary <loic@xxxxxxxxxxx> wrote: > Hi Yuri, > > It would be useful to add more information bout how the nightlies are analyzed at > > http://tracker.ceph.com/projects/ceph-releases/wiki/HOWTO_monitor_the_automated_tests_AKA_nightlies > > At this point my understanding is that you look over all of them and you carry the burden of > > * sorting out the environmental noise > * creating new bugs for errors for which there is no match in the tracker > * add a link to the failed job in pre-existing issues found in the tracker (useful to figure out the frequency and helps with debug when there are multiple outputs / logs) > > You do so by using tools such as https://github.com/jcsp/scrape/blob/master/scrape.py and maybe others and you also format your mail messages so that they can be parsed by a program (although such a program does not exist yet, it could go over all your messages and build a database from the mails you sent). > > In the http://lists.ceph.com/private.cgi/ceph-qa-ceph.com/ archives, I see that Greg also regularly goes over the errors and other developers also do. What I'm not sure about is if it's best effort ? Is there a time like bug scrubbing or sprint planning when developers say "Let's analyze QA results and dig bugs" ? I know that Yuri looks at some nightlies but I'm not sure which ones he's responsible for — I think it's the upgrade suites? In general analyzing the nightlies is (unfortunately? maybe positively) the team lead's responsibility to make happen. Right now I think that means we each pretty much go over (or ignore) the tests covering our area as it suits us; I send emails because I find it convenient but I know Sam mostly just makes bugs. Now that things have settled down some in the labs I'm planning to sett up a rotation amongst my team to cover them (it is *not* a small time commitment, sadly). The most annoying part of the job is when the lab breaks — realizing that this means we can or cannot ignore such-and-such a set of symptoms, making sure that it's not something new in Ceph or the test that we changed, and then adjudicating responsibility for the fix between the very nebulous group of people whose fault or responsibility it might be. (We have a lot of hands in teuthology and a lot in the lab, not in an entirely overlapping set, and any one of them can cause breakage.) It's not always clear when I see a batch of runs failed over the weekend if the problem has been resolved yet or not. :( -Greg -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html