Re: mutable health warnings

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, 13 Jun 2019, Neha Ojha wrote:
> Hi everyone,
> 
> There has been some interest in a feature that helps users to mute
> health warnings. There is a trello card[1] associated with it and
> we've had some discussion[2] in the past in a CDM about it. In
> general, we want to understand a few things:
> 
> 1. what is the level of interest in this feature
> 2. for how long should we mute these warnings - should the period be
> decided by us or the user
> 3. possible misuse of this feature and negative impacts of muting some warnings
> 
> Let us know what you think.
> 
> [1] https://trello.com/c/vINMkfTf/358-mute-health-warnings
> [2] https://pad.ceph.com/p/cephalocon-usability-brainstorming

What if we start with something like:

- a 'mute' targets a specific warning code (e.g., OSD_DOWN)
  e.g., 'ceph health mute OSD_DOWN'
- the mute matches the alert code and the short description (e.g., "2 osds 
  down")
  - this could be more specific, like matching the detail items too
  - or, it could be less specific, so that e.g., a OSD_DOWN going from 2 
    to 1 osd won't unmute
  - or, individual detail items could be the things that get muted
  -> we might need to make alerts include more structured fields (besides 
     a summary string and vector<string> of details) in order to make this 
     work perfectly... but we can start start simple (with just the 
     summary string match?).
- the mute goes away if
  - the description changes
  - the alert resolves
  - the TTL/expiration time is reached
  - the user unmutes (the specific mute 'ceph health unmute <code>' or all 
    mutes with 'ceph health umute')

- 'ceph -s' will say HEALTH_OK (if all alerts are muted), but *also* say 
  how many muted alerts there are, e.g.

  cluster:
    id:     28f7427e-5558-4ffd-ae1a-51ec3042759a
    health: HEALTH_OK
            2 muted alerts: OSD_DOWN, TOO_MANY_PGS

  services:
    ...

- 'ceph health' will say HEALTH_OK (if all alerts are muted)
- 'ceph health detail' will say HEALTH_OK (if all alerts are muted), but 
  will *also* show all of the muted alerts in a separate section (along 
  with the mute TTL/expiration)
- the dashboard would show HEALTH_OK, plus some clear visual 
  indication that there are one or more mutes, with an easy UI to 
  mute/unmute

sage
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux