Re: [Last-Call] Secdir last call review of draft-ietf-opsawg-service-assurance-architecture-11

Jean Quilbeuf <jean.quilbeuf=40huawei.com@xxxxxxxxxxxxxx> · Wed, 23 Nov 2022 16:19:20 +0000

Hi Christian,
Thanks for your review. We tried to address the comments, see our answers inline.

The diff of the whole draft is here: https://www.ietf.org/rfcdiff?url2=draft-ietf-opsawg-service-assurance-architecture-12 

Best,
Jean

> -----Original Message-----
> From: Christian Huitema via Datatracker [mailto:noreply@xxxxxxxx]
> Sent: Sunday 20 November 2022 23:47
> To: secdir@xxxxxxxx
> Cc: draft-ietf-opsawg-service-assurance-architecture.all@xxxxxxxx; last-
> call@xxxxxxxx; opsawg@xxxxxxxx
> Subject: Secdir last call review of draft-ietf-opsawg-service-assurance-
> architecture-11
> 
> Reviewer: Christian Huitema
> Review result: Has Nits
> 
> I have reviewed this document as part of the security directorate's ongoing
> effort to review all IETF documents being processed by the IESG. These
> comments were written primarily for the benefit of the security area
> directors. Document editors and WG chairs should treat these comments just
> like any other last call comments.
> 
> This document proposes an architecture implementing Service Assurance for
> Intent-Based Networking (SAIN). The architecture defines a "service
> assurance graph", which is decomposed in components. The graph is a
> directed graph, in which the root is the service to assure, and edges lead to
> the components or subservices on which a service or a component depends.
> The stated goal is to efficiently verify whether a service is working as
> intended by following the graph and examining the state of each
> dependency. The graph is not guaranteed to be free of cycles or "circular
> dependencies", which the document proposes to manage by promoting
> each cycle to a virtual component, and repacing edges between cycle
> components by edges starting at the virtual component. The document
> defines operation on the graph, maintenance of component states, and how
> to mark components as unavailable during maintenance. The operations
> assume that components have synchronized clocks.
> 
> Writing security considerations for an architecture like this is challenging,
> because the architecture itself is rather abstract. The figure 1 describes
> multiple SAIN agents each managing components and collecting metrics,
> obtaining configuration data from a SAIN orchestrator, feeding health status
> to a SAIN collector, with the collector providing data to the Service
> orchestrator, and the service orchestrator interacting with the SAIN
> orchestrator and with the network itself. In theory, each of the edges of the
> graph in figure 1 could be subject to attacks, such as denial of service,
> spoofing, etc. For example, network components could deliver incorrect
> metrics to the SAIN agents, the SAIN agents could report incorrect statues,
> the configurations managed by the orchestrator could be wrong, the
> communication lines between componnents may be severed, etc. All these
> potential threats have different possible consequences.
>  At this level of abstraction, the recommendations will have to be high level,
> but they should provide enough guidance for the developers of the various
> modules.
> 
> The security consideration section of this document makes a series of
> recommendations:
> 
> * securing the various SAIN agents, because a compromised agent could
> inject false information in the system. * using SSH or TLS when updating the
> configuration of devices. * balance the risk of exposing too much
> configuration information and enabling third parties to understand and
> "efficiently attack"
> the system, versus not exposing enough and being unable to address some
> issues.t
> * acknowledge that "a lying device or compromised agent could trigger
> partial reconfiguration of the service or network".
> 
> On the first point, the document says that "the SAIN agents must be
> secured", but does not say how. It would be nice if this was developed.
> 

Restricted to YANG and refered to companion draft for more detail.

> On the second point, mentioning SSH or TLS is nice but very generic. What
> kind of credentials should SAIN agents provide or check? What kind of
> permissions should they be granted?

Added
 " Devices should be configured so that agents have their own credentials with write access only for the YANG nodes configuring the telemetry."

> 
> The third point is a recurring issue with automation of management,
> diagnostic, etc. Management is easier if there is enough data available to
> describe and understand a whole system, but the same data could be used
> by attackers to understand how to efficiently sabotage that system. There
> are various kind of plausible mitigations. For example, it could be argued that
> some data is already public, available for example in user manuals of network
> components, and that codifying it will improve management without
> increasing the attack surface. But that's not always the case, and there are
> other cases in which fully exposing configuration details will definitely
> facilitate attacks. There may be other mitigations, such as access control on
> configuration data. It would be very nice if the architecture document
> provided clear guidance for future deployments.
> 

Added paragraph about configuration from service orchestrator, tried to give guidelines.

> The fourth point boild down to throwing the towel, as in "[if devices lie] The
> SAIN architecture neither augments nor reduces this risk." The service
> assurance, at a minimum, could detect anomalies, as in "service X depends
> on devices Y and Z; the service X is not functional, yet Y and Z both report
> correct behavior; hence, one or several of those devices may be in a bad
> state." This may well be some form of future work, but flagging the issue
> would be useful.

Added: 
          A potential improvement is to use the SAIN architecture to detect discrepancies between symptoms reported by different agents and thus detect anomalies if an agent or a device is lying.

> 
> Reading the document, I found other issues that might affect security of
> operation. The operation requires receiving streams of metric values, or
> repeated polling for these values. What happens if DOS attacks slow down or
> prevent the arrival of metric data? Section 3 mentions that "The SAIN
> architecture requires time synchronization, with Network Time Protocol
> (NTP) [RFC5905] as a candidate, between all elements". What happens if the
> network time service is compromised?
> 

Added:
  If NTP service goes down, the devices clocks might lose their synchronization.
          In that case, correlating information from different devices, such as symptoms about a link or correlation of symptoms from different devices, will give inaccurate results

> Finally, a consideration based on experience with the Windows Diagnostic
> system, which was similarly using graphs of dependencies to answer
> questions like "why is my Wi-Fi not connecting" or "why can I not read this
> web site?"
> The system would conduct series of tests based on dependency analysis,
> very much as what is envisaged here. It was in improvement over the
> previous state of error diagnostic, but it was not perfect. Such systems can
> fail in frustrating ways if part of the automation is missing, when some tests
> are not available, when some metric data cannot be connect, or when the
> description of dependencies is incomplete. They can also become very slow
> if the description of dependencies is too extensive, leading to too many tests
> lasting too long.
> The dependency graph needs to be curated over time, and that curation
> probably should be described in the architecture.
> 
> 
> 
Subservices are independent and can be executed in parallel, a long list of dependencies does not necessarily mean a long time.

However, I retained the idea of curation and added at the end of section 3.2:

The assurance graph, or more precisely the subservices and
   dependencies that a SAIN orchestrator can instantiate, should be
   curated.  The organization of such a process is out-of-scope for this
   document and should aim to:

   o  Ensure that existing subservices are reused as much as possible.

   o  Avoid circular dependencies.

-- 
last-call mailing list
last-call@xxxxxxxx
https://www.ietf.org/mailman/listinfo/last-call