Re: Telemetry crashes integration with Redmine

Matthias Muench <mmuench@xxxxxxxxxx> · Mon, 26 Jul 2021 23:04:44 +0200



    Hi,

    
    from my past experience, I remember that even a small number of
    crashes could flag significant issues that happen under certain
    conditions that might be not spread in the wild with the subscribers
    but might affect numerous others. So, kicking off an issue only with
    a set number of events might not be the best and perhaps hiding
    important dependencies. Instead, some of the issues might have
    manifest with additional conditions but would shed better light on
    others. 

    Couldn't we just tag those new issues with a "low number" flag
    instead of ignoring those and changing it towards "profound" issues
    later on ? This would perhaps give better understanding on how the
    things might relate to the number of clusters reporting and perhaps
    illustrate a timeline, too ? 

    I'm not a developer, so might be not well thought through..

    
    Thanks a lot,

    -matt

    
    On 26.07.21 21:40, Yaarit Hatuka wrote:

    
      Hi everyone,

        
        tl;dr:  We wish to open / update issues in tracker.ceph.com for each crash
        signature received via telemetry. There are ~2.5K signatures. We
        wish to do it in a way which makes sense to developers. Please
        share your suggestions.
        

          Users who have opted-in to telemetry, and specifically its
          ‘crash’ channel, send daily anonymized information about the
          crashes that occurred within their clusters. This information
          includes the crashed daemon name, its version, the backtrace,
          the crash’s signature (a fingerprint which represents similar
          crash events), the assert function and condition (if
          applicable), etc.

          
          Our goal is to make these telemetry crash reports available
          and actionable to developers, and to be able to track their
          statuses. For this we need to have an associated Redmine issue
          for each crash signature.

          
          Currently there are ~2,500 signatures that should be tracked.
          An integration bot
          [1] can open / update corresponding Redmine
          issues [2, 3], but we wish not to overwhelm developers with a
          massive amount of new issues all at once.

          
          In the CLT meeting Ilya suggested having a crash count
          threshold, so we only open issues for signatures with at least
          2 crash events; or even to combine this with the number of
          clusters affected by the crash signature. Neha suggested that
          we include signatures of recent releases, regardless of the
          number of clusters affected by them.

          
          There are about ~1,400 signatures with only one crash event so
          far. See [4] for breakdown by version.

          This leaves us with ~1,100 signatures (plus 61 of version
          15.2.13 and 6 of 16.2.5, plus future signatures). Should we
          handle X of them every week? For instance, open 100 new issues
          per week? Should we prioritize these by versions and number of
          clusters affected? What cadence would make the most sense,
          bug-scrub-wise?

          
          We will discuss this topic on our next CDM, please join.

          
          Thanks!

          Yaarit

          
          [1] https://pad.ceph.com/p/telemetry-redmine-bot

          [2] https://tracker.ceph.com/issues/51756

          [3] https://tracker.ceph.com/issues/49666

          [4] Count of signatures with a single crash event, by version:

          {15.2.8} 319

          {15.2.5} 159

          {16.2.4} 148

          {15.2.7} 146

          {15.2.9} 122

          {15.2.4} 115

          {15.2.10} 79

          {15.2.13} 61

          {15.2.11} 39

          {16.2.0} 36

          {15.2.6} 30

          {16.2.1} 28

          {15.2.3} 26

          {15.2.1} 26

          {16.2.3} 12

          {15.2.12} 9

          {15.2.0} 9

          {15.1.0} 6

          {16.2.5} 6

          {15.2.2} 5

          {15.0.0} 4

          {16.1.0} 2

          {16.0.0} 2

          {16.2.2} 1
        

      _______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx

    
    -- 
——————————————————
Matthias Muench
Principal Specialist Solution Architect
EMEA Storage Specialist
matthias.muench@xxxxxxxxxx
Phone: +49-160-92654111

Red Hat GmbH
Werner-von-Siemens-Ring 14
85630 Grasbrunn
Germany
_______________________________________________________________________
Red Hat GmbH, http://www.de.redhat.com · Registered seat: Grasbrunn, Commercial register: Amtsgericht Muenchen
HRB 153243 · Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill
  

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx