Hi Christian, Thanks for your review. We tried to address the comments, see our answers inline. The diff of the whole draft is here: https://www.ietf.org/rfcdiff?url2=draft-ietf-opsawg-service-assurance-architecture-12 Best, Jean > -----Original Message----- > From: Christian Huitema via Datatracker [mailto:noreply@xxxxxxxx] > Sent: Sunday 20 November 2022 23:47 > To: secdir@xxxxxxxx > Cc: draft-ietf-opsawg-service-assurance-architecture.all@xxxxxxxx; last- > call@xxxxxxxx; opsawg@xxxxxxxx > Subject: Secdir last call review of draft-ietf-opsawg-service-assurance- > architecture-11 > > Reviewer: Christian Huitema > Review result: Has Nits > > I have reviewed this document as part of the security directorate's ongoing > effort to review all IETF documents being processed by the IESG. These > comments were written primarily for the benefit of the security area > directors. Document editors and WG chairs should treat these comments just > like any other last call comments. > > This document proposes an architecture implementing Service Assurance for > Intent-Based Networking (SAIN). The architecture defines a "service > assurance graph", which is decomposed in components. The graph is a > directed graph, in which the root is the service to assure, and edges lead to > the components or subservices on which a service or a component depends. > The stated goal is to efficiently verify whether a service is working as > intended by following the graph and examining the state of each > dependency. The graph is not guaranteed to be free of cycles or "circular > dependencies", which the document proposes to manage by promoting > each cycle to a virtual component, and repacing edges between cycle > components by edges starting at the virtual component. The document > defines operation on the graph, maintenance of component states, and how > to mark components as unavailable during maintenance. The operations > assume that components have synchronized clocks. > > Writing security considerations for an architecture like this is challenging, > because the architecture itself is rather abstract. The figure 1 describes > multiple SAIN agents each managing components and collecting metrics, > obtaining configuration data from a SAIN orchestrator, feeding health status > to a SAIN collector, with the collector providing data to the Service > orchestrator, and the service orchestrator interacting with the SAIN > orchestrator and with the network itself. In theory, each of the edges of the > graph in figure 1 could be subject to attacks, such as denial of service, > spoofing, etc. For example, network components could deliver incorrect > metrics to the SAIN agents, the SAIN agents could report incorrect statues, > the configurations managed by the orchestrator could be wrong, the > communication lines between componnents may be severed, etc. All these > potential threats have different possible consequences. > At this level of abstraction, the recommendations will have to be high level, > but they should provide enough guidance for the developers of the various > modules. > > The security consideration section of this document makes a series of > recommendations: > > * securing the various SAIN agents, because a compromised agent could > inject false information in the system. * using SSH or TLS when updating the > configuration of devices. * balance the risk of exposing too much > configuration information and enabling third parties to understand and > "efficiently attack" > the system, versus not exposing enough and being unable to address some > issues.t > * acknowledge that "a lying device or compromised agent could trigger > partial reconfiguration of the service or network". > > On the first point, the document says that "the SAIN agents must be > secured", but does not say how. It would be nice if this was developed. > Restricted to YANG and refered to companion draft for more detail. > On the second point, mentioning SSH or TLS is nice but very generic. What > kind of credentials should SAIN agents provide or check? What kind of > permissions should they be granted? Added " Devices should be configured so that agents have their own credentials with write access only for the YANG nodes configuring the telemetry." > > The third point is a recurring issue with automation of management, > diagnostic, etc. Management is easier if there is enough data available to > describe and understand a whole system, but the same data could be used > by attackers to understand how to efficiently sabotage that system. There > are various kind of plausible mitigations. For example, it could be argued that > some data is already public, available for example in user manuals of network > components, and that codifying it will improve management without > increasing the attack surface. But that's not always the case, and there are > other cases in which fully exposing configuration details will definitely > facilitate attacks. There may be other mitigations, such as access control on > configuration data. It would be very nice if the architecture document > provided clear guidance for future deployments. > Added paragraph about configuration from service orchestrator, tried to give guidelines. > The fourth point boild down to throwing the towel, as in "[if devices lie] The > SAIN architecture neither augments nor reduces this risk." The service > assurance, at a minimum, could detect anomalies, as in "service X depends > on devices Y and Z; the service X is not functional, yet Y and Z both report > correct behavior; hence, one or several of those devices may be in a bad > state." This may well be some form of future work, but flagging the issue > would be useful. Added: A potential improvement is to use the SAIN architecture to detect discrepancies between symptoms reported by different agents and thus detect anomalies if an agent or a device is lying. > > Reading the document, I found other issues that might affect security of > operation. The operation requires receiving streams of metric values, or > repeated polling for these values. What happens if DOS attacks slow down or > prevent the arrival of metric data? Section 3 mentions that "The SAIN > architecture requires time synchronization, with Network Time Protocol > (NTP) [RFC5905] as a candidate, between all elements". What happens if the > network time service is compromised? > Added: If NTP service goes down, the devices clocks might lose their synchronization. In that case, correlating information from different devices, such as symptoms about a link or correlation of symptoms from different devices, will give inaccurate results > Finally, a consideration based on experience with the Windows Diagnostic > system, which was similarly using graphs of dependencies to answer > questions like "why is my Wi-Fi not connecting" or "why can I not read this > web site?" > The system would conduct series of tests based on dependency analysis, > very much as what is envisaged here. It was in improvement over the > previous state of error diagnostic, but it was not perfect. Such systems can > fail in frustrating ways if part of the automation is missing, when some tests > are not available, when some metric data cannot be connect, or when the > description of dependencies is incomplete. They can also become very slow > if the description of dependencies is too extensive, leading to too many tests > lasting too long. > The dependency graph needs to be curated over time, and that curation > probably should be described in the architecture. > > > Subservices are independent and can be executed in parallel, a long list of dependencies does not necessarily mean a long time. However, I retained the idea of curation and added at the end of section 3.2: The assurance graph, or more precisely the subservices and dependencies that a SAIN orchestrator can instantiate, should be curated. The organization of such a process is out-of-scope for this document and should aim to: o Ensure that existing subservices are reused as much as possible. o Avoid circular dependencies. -- last-call mailing list last-call@xxxxxxxx https://www.ietf.org/mailman/listinfo/last-call