Re: Telemetry crashes integration with Redmine

Casey Bodley <cbodley@xxxxxxxxxx> · Thu, 31 Mar 2022 11:38:47 -0400

hi Yaarit,

i've still been going through the latest batch of telemetry crashes
under the rgw project at https://tracker.ceph.com/projects/rgw/issues.
we recently discussed some ideas about duplicate detection, so i
wanted to share https://tracker.ceph.com/issues/51927 as an extreme
example. this latest batch of telemetry reports included 18 new
tracker issues with this same root cause, in addition to the 7
existing issues from the earlier batch

for context about the crash itself, this was fixed in
https://github.com/ceph/ceph/pull/43581. the root cause was the early
release of a mutex, which results in racing reads/writes to entries in
the ObjectCache

many of the crashes happen inside of ObjectCache::get() itself, and
look a lot like the original backtrace in
https://tracker.ceph.com/issues/51927. others like
https://tracker.ceph.com/issues/54917 crash just after
ObjectCache::get() returns due to corrupted memory

so the majority of the backtraces look like one of these two, at least
at/above the ObjectCache part. but below that, the backtraces diverge
quite a bit because there are several different code paths that call
into ObjectCache

On Tue, Jul 27, 2021 at 12:09 AM Yaarit Hatuka <yhatuka@xxxxxxxxxx> wrote:
>
> Hi Matt,
>
> Many thanks for your feedback.
> In the long run we wish Redmine to be the source of truth for all of the reported crashes of supported versions. The main challenge we try to solve now is how to prioritize opening this massive amount of Redmine issues.
> All crash signatures can be viewed via a dedicated dashboard [1], so signatures with a single crash event are not being filtered-out indefinitely -- developers can always access them. And of course, once there are new crash events reported of any of these ~1,400 signatures their priority changes. When the bot opens / updates an issue, it includes a link to the dashboard which holds dynamic statistics about the signature, including the number of clusters affected, graphs of crash occurrences over time, and across versions.
>
> I think that it is not straightforward to decide for certain what signature is of a higher priority, but we look for best efforts for the initial Redmine setup.
>
> [1] http://telemetry.front.sepia.ceph.com:4000/d/GiO_B8bMz
>
>
> On Mon, Jul 26, 2021 at 5:05 PM Matthias Muench <mmuench@xxxxxxxxxx> wrote:
>>
>> Hi,
>>
>> from my past experience, I remember that even a small number of crashes could flag significant issues that happen under certain conditions that might be not spread in the wild with the subscribers but might affect numerous others. So, kicking off an issue only with a set number of events might not be the best and perhaps hiding important dependencies. Instead, some of the issues might have manifest with additional conditions but would shed better light on others.
>> Couldn't we just tag those new issues with a "low number" flag instead of ignoring those and changing it towards "profound" issues later on ? This would perhaps give better understanding on how the things might relate to the number of clusters reporting and perhaps illustrate a timeline, too ?
>> I'm not a developer, so might be not well thought through..
>>
>> Thanks a lot,
>> -matt
>>
>> On 26.07.21 21:40, Yaarit Hatuka wrote:
>>
>> Hi everyone,
>>
>> tl;dr:  We wish to open / update issues in tracker.ceph.com for each crash signature received via telemetry. There are ~2.5K signatures. We wish to do it in a way which makes sense to developers. Please share your suggestions.
>>
>> Users who have opted-in to telemetry, and specifically its ‘crash’ channel, send daily anonymized information about the crashes that occurred within their clusters. This information includes the crashed daemon name, its version, the backtrace, the crash’s signature (a fingerprint which represents similar crash events), the assert function and condition (if applicable), etc.
>>
>> Our goal is to make these telemetry crash reports available and actionable to developers, and to be able to track their statuses. For this we need to have an associated Redmine issue for each crash signature.
>>
>> Currently there are ~2,500 signatures that should be tracked. An integration bot [1] can open / update corresponding Redmine issues [2, 3], but we wish not to overwhelm developers with a massive amount of new issues all at once.
>>
>> In the CLT meeting Ilya suggested having a crash count threshold, so we only open issues for signatures with at least 2 crash events; or even to combine this with the number of clusters affected by the crash signature. Neha suggested that we include signatures of recent releases, regardless of the number of clusters affected by them.
>>
>> There are about ~1,400 signatures with only one crash event so far. See [4] for breakdown by version.
>> This leaves us with ~1,100 signatures (plus 61 of version 15.2.13 and 6 of 16.2.5, plus future signatures). Should we handle X of them every week? For instance, open 100 new issues per week? Should we prioritize these by versions and number of clusters affected? What cadence would make the most sense, bug-scrub-wise?
>>
>> We will discuss this topic on our next CDM, please join.
>>
>> Thanks!
>> Yaarit
>>
>>
>> [1] https://pad.ceph.com/p/telemetry-redmine-bot
>> [2] https://tracker.ceph.com/issues/51756
>> [3] https://tracker.ceph.com/issues/49666
>> [4] Count of signatures with a single crash event, by version:
>> {15.2.8} 319
>> {15.2.5} 159
>> {16.2.4} 148
>> {15.2.7} 146
>> {15.2.9} 122
>> {15.2.4} 115
>> {15.2.10} 79
>> {15.2.13} 61
>> {15.2.11} 39
>> {16.2.0} 36
>> {15.2.6} 30
>> {16.2.1} 28
>> {15.2.3} 26
>> {15.2.1} 26
>> {16.2.3} 12
>> {15.2.12} 9
>> {15.2.0} 9
>> {15.1.0} 6
>> {16.2.5} 6
>> {15.2.2} 5
>> {15.0.0} 4
>> {16.1.0} 2
>> {16.0.0} 2
>> {16.2.2} 1
>>
>>
>> _______________________________________________
>> Dev mailing list -- dev@xxxxxxx
>> To unsubscribe send an email to dev-leave@xxxxxxx
>>
>>
>> --
>> ——————————————————
>> Matthias Muench
>> Principal Specialist Solution Architect
>> EMEA Storage Specialist
>> matthias.muench@xxxxxxxxxx
>> Phone: +49-160-92654111
>>
>> Red Hat GmbH
>> Werner-von-Siemens-Ring 14
>> 85630 Grasbrunn
>> Germany
>> _______________________________________________________________________
>> Red Hat GmbH, http://www.de.redhat.com · Registered seat: Grasbrunn, Commercial register: Amtsgericht Muenchen
>> HRB 153243 · Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill
>>
>> _______________________________________________
>> Dev mailing list -- dev@xxxxxxx
>> To unsubscribe send an email to dev-leave@xxxxxxx
>
> _______________________________________________
> Dev mailing list -- dev@xxxxxxx
> To unsubscribe send an email to dev-leave@xxxxxxx

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx