Re: Genart last call review of draft-ietf-ccamp-alarm-module-07

Dan Romascanu <dromasca@xxxxxxxxx> · Tue, 19 Mar 2019 20:54:08 +0200

Hi Stefan, 

Thank you for your answer and for addressing my concerns. I am comfortable with your proposals. If your AD agrees, I would include these in a revised version before submission to the approval of the IESG. 

Regards,

Dan

On Tue, Mar 19, 2019 at 5:11 PM stefan vallin <stefan@xxxxxxxxx> wrote:
Hi Dan!

Thanks for your review, an honour to have RFC 3877 in the loop :)

See inline

br Stefan

> 

> 

> Major issues:

> 

> 1. The definition of Alarm is key for the whole model. It reads like this:

> Alarm (the general concept): An alarm signifies an undesirable state in a

> resource that requires corrective action.

> 

> However, RFC 3877 already defined a number of concepts including:

>  Error

>      A deviation of a system from normal operation.

> 

>   Fault

>      Lasting error or warning condition.

> 

>   ....

> 

>   Alarm

>      Persistent indication of a fault.

> 

> I believe that there is a need to show why the model defined by RFC 3877 needs

> to be changed, and why the difference that RFC 3877 was making between a Fault

> and an Alarm is no longer needed.

Good comment, you are right, and we need to keep the distinction between fault and alarm.

That distinction is used in X.733, 3GPP IRP and others. The general pattern is that “fault”

refers to what is really broken, and the alarm the manifestation of that underlying cause. 

There is not a simple 1-1 relationship between a fault and an alarm

* 1 fault may have many alarms due to limited root cause capabilities of the system

* There might be no underlying fault to an alarm, consider a non-optimal QoS configuration 

  which gives bad quality in VOIP calls. Certainly a MOS alarm from the VOIP probe, but there

  is no “fault” as such (if you do not consider a non-optimal config as a fault)

So X.733

X.733 fault: The physical or algorithmic cause of a malfunction

3GPP fault: a deviation of a system from normal operation, which may result in the loss of operational capabilities of the element or the loss of redundancy in case of a redundant configuration

I suggest we add the following to terminology:

Fault: the underlying cause of an undesired behaviour

If we then turn to the term “alarm". I have added two aspects to the definition of an alarm:

An alarm signifies an *undesirable state* in a resource that *requires corrective action*.

Mostly based on the alarm standardization work in the process industry (see draft references).

1) Rather than “deviation from normal”, we say “undesirable”, subtle difference.

  In IT environments it is easier to define what is normal, a normal load to a web server.

  And anything deviation from that normal load could be an alarm.

  In networking, things are more dynamic, and deviation from normal might be the desired state.

  So the definition stresses the fact that it is an undesired state, not just deviation from normal.

2) Adding the requirement that an alarm per definition should require an action. This is a sound

  requirement that puts requirements on what qualifies as an alarm and limits the amounts of alarms.

  (See for example the EEMUA, and ISA182 references in the draft). The 3GPP Alarm standard

  also added this to their definition at the later revisions to address the alarm overload problem.

> Also, RFC 3877 defined in Section 3 a

> Framework and an Architecture that was consistent with X.733. This document has

> no such section, and while acknowledging the need for a mapping to X.733 it

> states as a goal:

> Mapping to X.733, which is a requirement for some alarm systems. Still, keep

> some of the X.733 concepts out of the core model in order to make the model

> small and easy to understand

> 

> More details about what is left out and why these are not needed would help.

The alarm YANG model  does not *require* the X.733 parameter

definitions of for example probable-cause enum values. Today, most networking devices 

and management systems do not rely on those enumerations.

Those are defined in the X733 augmentation module in order to keep the core model as

small and useful as possible. X733 requirements come more often from telecom environments.

> 

> Minor issues:

> 

> 1. Section 2 makes a statement that includes

> ... While IETF has not really addressed alarm management

> 

> This is is actually not accurate. RFC 3877 addressed Alarm Management. Maybe

> there is a need to revise that approach, but this should be done explicitly,

> not by stating that it did not exist.

Correct, bad wording.

OLD TEXT:

Address alarm usability requirements, see Appendix G.  While IETF

      has not really addressed alarm management, telecom standards has

      addressed it purely from a protocol perspective.  The process

      industry has published several relevant standards addressing

      requirements for a useful alarm interface; [EEMUA], [ISA182].

      This alarm module defines usability requirements as well as a YANG

      data model.

SUGGESTION:

Address alarm usability requirements, see Appendix G.  While IETF

      and telecom standards have addressed alarms mostly from a 

      protocol perspective, the process industry has published 

      several relevant standards addressing requirements for a useful 

      alarm interface; [EEMUA], [ISA182].

      This alarm module defines usability requirements as well as a YANG

      data model.

> 

> 2. Section 3.5:

> Closing an alarm implies that the operator considers the corrective action

> performed.

> 

> Is this always true? The undesirable state may have been cancelled by some

> other event than corrective action, for example the resource is no longer used,

> or the time elapsed mat have made the undesirable state irrelevant.

I think it is important to keep the two perspectives in mind. An operator closing an

alarm is only a flag from the operations team that the alarm does not need an action.

It might be cleared or not cleared by the system.

So in your first example, the alarm is probably cleared by the instrumentation, 

correlating “the other event”.

If the resource is no longer used a shelf should be created.

If time has passed, depends, ….

> 

> 3. In section 3.5.1:

> Alarms are not cleared by operators, only the underlying instrumentation can

> clear an alarm.  Operators can close alarms.

> 

> So, the document makes a distinction between clearing an alarm and closing an

> alarm. It may be good to define two two concepts to make the distinction clear.

Good point!

Suggested terminology additions:

* Cleared alarm: a cleared alarm is an alarm where the system/server considers the

undesired state to be cleared. Operators can not clear alarms, clearance is managed

by the system. A linkUp notification can be considered a clear condition for a linkDown state.

* Closed alarm: operators can close alarms irrespective of the alarm being cleared or not.

A closed alarm indicates that the alarm does not need attention, either since the corrective

action has been taken or that it can be ignored for other reasons.

> 

> 4. Appendix F.1:

> The alarm MIB is state oriented rather than notification oriented, an alarm

> is a "lasting condition", not a discrete notification reporting about a

> condition state change.

Good catch, will rephrase, the alarm MIB and the alarm YANG has a stateful view

of alarms, not notification-focused.

Suggested change:

OLD

RFC 3877 defines alarm referring back to "a deviation from normal operation". This is

problematic, since this might not require an  operator action. The alarm MIB is state 

oriented rather than notification oriented,  an alarm is a "lasting  condition", not a 

discrete notification reporting about a condition state change.

NEW:

RFC 3877 defines alarm referring back to "a deviation from normal operation". The Alarm YANG

model adds the requirement that it should require an corrective action and should be undesired, 

not only a deviation from normal. The alarm MIB is state oriented in the same way as the Alarm YANG,

it focuses on the  "lasting  condition", not the individual notifications.

> 

> I am not sure that I understand this comment. Alarm states are defined also in

> this document, and Alarms as defined here are also different than ' a discrete

> notification reporting about a condition state change'. So, what does this

> comment really try to say?

> 

> Nits/editorial comments:

> 

>