Dear Christopher, Thank you for your questions and detailed comments. We are working on a new version to address all the inputs received during the Last Call. Please find my answers inline tagged as [GF]. Regards, Giuseppe -----Original Message----- From: Christopher Wood via Datatracker <noreply@xxxxxxxx> Sent: Wednesday, June 16, 2021 1:16 AM To: secdir@xxxxxxxx Cc: draft-ietf-6man-ipv6-alt-mark.all@xxxxxxxx; ipv6@xxxxxxxx; last-call@xxxxxxxx Subject: Secdir last call review of draft-ietf-6man-ipv6-alt-mark-06 Reviewer: Christopher Wood Review result: Has Issues General comments: I don't quite understand the need for this mechanism -- why would one use these markings instead of transport-layer signals a la ECN? -- so I've constrained my comments to the mechanical details. My only high level comment pertains to the threat model and value of these metrics. In particular, it's not clear to me how an operator would distinguish between actual operational problems causing loss or delay from an attacker that's modifying marking flags to give the appearance of loss or delay. In untrusted domains, how are these markings expected to be used reliably? (I guess I just don't understand the threat model well enough, and I couldn't glean it from the security considerations.) [GF]: The Alternate-Marking methodology, described in RFC 8321 and RFC 8889, is different from ECN. It is an on-path telemetry technique and permits very detailed packet loss, delay and delay variation measurements both hop-by-hop and end-to-end. So you can get much more information than the end-to-end notification of network congestion. The technique consists in synchronizing the measurements in different points of a network by switching the value of a marking bit and therefore divide the packet flow into batches. Each batch represents a measurable entity unambiguously recognizable by all network nodes along the path. By counting the number of packets in each batch and comparing the values measured by different nodes, it is possible to measure precise loss. In a similar way the alternation of the values of the marking bits can be used as a time reference to calculate the delay and delay variation. The value for operator is the possibility to exactly locate the issues in the network. Regarding the threat model, the possibility of an attack by modifying the flags to give the appearance of loss or delay, is a common issue for all the on-path telemetry technique (e.g. In-situ OAM). The only definitive solution is that this methodology MUST be applied in a controlled domain as also mentioned in RFC 8799. Also the application to untrusted domain is NOT RECOMMENDED. We will highlight this strong requirement in the next version. Specific comments: Section 2. o In case of Hop-by-Hop Option Header carrying Alternate Marking bits, it is not inserted or deleted, but can be read by any node along the path. The intermediate nodes may be configured to support this Option or not and the measurement can be done only for the nodes configured to read the Option. Anyway this should not affect the traffic throughput on nodes that do not recognize the Option, as further discussed in Section 4. A couple questions come to mind when reading this. In no particular order: - What stops a hop along the path from inserting or deleting these markings? What is affected if that happens? [GF]: The source node is the only one that writes the Option Header to mark alternately the flow (for both Hop-by-Hop and Destination Option). The intermediate nodes and destination node must only read the marking values of the option without modifying the Option Header. Of course, an attacker can modify, insert or delete these markings, and if that happens it affects the results of the measurements, causing, for example, an intervention where it is not necessary or vice versa. In my opinion if an attacker can modify the packet it may have additional malicious purpose more harmful than affecting only the performance results. In any case the requirement of the controlled domain mitigates this kind of attack. I will include more details on this in the next version. - Does it affect throughput on nodes that _do_ recognize the option? [GF]: In theory, it should not affect the throughput. But, of course, there is a difference between the theory and the implementation and, in the draft, we also highlighted that it can happen that packets with Hop-by-Hop are forced onto the slow path. Anyway this is a general issue and in V6OPS and 6MAN there are drafts trying to address this problem (e.g. draft-peng-v6ops-hbh, draft-hinden-6man-hbh-processing,...) While the threat model (monitoring within a controlled domain) seems to rule out these issues, the implications of alterations, even if accidental, seem worth elaborating upon. [GF]: Agree, we can add more consideration on that in the Security part. Flow Label and FlowMonID within the same packet have different scope, identify different flows, and are intended for different use cases. Is the set of packets defined by a FlowMonID a subset of those defined by a Flow Label, do they have some overlap, or are they completely disjoint? (Writing out the relationship in more detail might help clarify why a new label is indeed needed for non-experts.) It seems like a shame to redefine yet another flow field. [GF]: Yes, Flow Label and FlowMonID are totally disjoint. Indeed the FlowMonID also enables a finer granularity for the flow definition, while Flow Label is used for ECMP. We can explain by adding some examples. As a nit, given the relation to and possible confusion with Flow Label, perhaps we could rename FlowMonID to something TraceID? [GF]: Good point. This can be something to consider. So, for the purposes of this document, both IP addresses and Flow Label should not change in flight and, in some cases, they could be considered together with the FlowMonID for disambiguation. The restrictions of a controlled domain, wherein there is assumed to be no attacker that can modify these fields, is probably worth noting here. It's in Section 2.1 and the security considerations, in the "harm to measurements" section, but that is somewhat buried at this point in the document, though perhaps worth promoting to some point earlier in the document. [GF]: Agree, we can make the requirement of the controlled domain clearer in the document and it makes sense to mention earlier as well. Section 2.1. This should probably point to the security considerations for more information about controlled domains. [GF]: Sure. Will do. Section 3.1. o Opt Data Len: The length of the Option Data Fields of this Option in bytes. Are there requirements for how long the reserved field in the option data is supposed to be? It seems that this field must consist of all zeroes, but that it can be up to 255 bytes long. Given that the data consists of a FlowMonID (20 bits) and two flags (2 bits), would it be useful to recommend (or require) a size for this? [GF]: It makes sense. We can assign the value based on the design of the Option. Section 5. It is important to highlight that the definition of the Hop-by-Hop Options in this document SHOULD NOT affect the throughput on nodes that do not recognize the Option. This is an interesting requirement. Surely a node that processes the option does more work before forwarding a packet, which seems like it would affect throughput, even if that impact is negligible. Perhaps "SHOULD NOT affect the throughput" could be rephrased as "is designed to minimize throughput impact on nodes that do not support the option"? [GF]: Yes, thanks for the suggestion. I will replace that sentence. Section 5.1. The measurement of the packet loss is really straightforward. The packets of the flow are grouped into batches, and all the packets within a batch are marked by setting the L bit (Loss flag) to a same value. Does this require nodes to batch packets in memory before forwarding? (As written, that seems to be the case, which seems odd.) [GF]: No, as said, the source node is the only one that marks the packets to create the batches. The intermediate nodes only read the marking values. I will modify this sentence to avoid confusion. The source node can switch the value of the L bit between 0 and 1 after a fixed number of packets or according to a fixed timer, and this depends on the implementation. Using a timer for this seems like a very error or noisy implementation approach. Beyond having tightly synchronized clocks, which is already a challenging requirement, is the idea that using a counter is somehow more complex than a timer? (If there's no benefit to using a timer, and it only introduces operational challenges, I'd recommend just removing the suggestion altogether, but I may be missing something.) [GF]: Both can be used. As explained in RFC 8321, using a fixed timer for the switching offers better control over the method, indeed the length of the batches can be chosen large enough to simplify the collection and the comparison of the measures taken by different network nodes. In section 3.2 of RFC 8321, it is also highlighted that you do not necessarily need tightly synchronized clocks to apply the methodology. In a few words this implies that the length of the batches MUST be chosen large enough so that the method is not affected by those factors. There does not seem to be enough guidance here to enforce this MUST, especially given the different factors that affect batch size. What happens if this MUST is violated? (Perhaps downgrading to a SHOULD would be better.) [GF]: Agree. I will also add a pointer to section 3.2 of RFC 8321, where it is possible to find the mathematical formulation for this. Section 5.2. How do nodes know if they should measure delay using the single- or double-marking methodology? Is that determines by some per-domain policy? [GF]: Yes, we are working on companion documents on the control plane mechanisms, e.g. draft-ietf-idr-sr-policy-ifit, draft-chen-pce-pcep-ifit. The most efficient and robust mode is to select a single double-marked packet for each batch, in this way there is no time gap to consider between the double- marked packets to avoid their reorder. I'm having a hard time understanding this guidance. How exactly does one select a single packet? Is it done at random, or is there another way? (The figures seem to suggest that the packet is picked from the "middle" of a batch.) [GF]: Yes it is usually in the middle of a batch. In section 3.2 of RFC 8321 it is called "available counting interval" of a batch. I think we can add more details in the next version. Section 5.3. The FlowMon identifier field is to uniquely identify a monitored flow within the measurement domain. The field is set at the source node. The FlowMonID can be uniformly assigned by the central controller or algorithmically generated by the source node. The latter approach cannot guarantee the uniqueness of FlowMonID but it may be preferred for local or private network, where the conflict probability is small due to the large FlowMonID space. What happens when all values in the FlowMonID space are consumed? Are old flows discarded or overwritten? I would imagine there's some way IDs are recycled given the finite 2^20 space, but that's not discussed. [GF]: Agree, this is a consideration we can add. A centralized controller can keep track of these, while if they are pseudo randomly generated by the source it is harder. Anyway, we can add more considerations. Section 5.3.1. This seems like text that should be moved to the security considerations. In doing so, it can also be trimmed. (I would claim that the 32-bit FlowMonID example is irrelevant given that these labels are 20 bits long, for example.) [GF]: Yes, it can make sense. Of course I will remove the statement about the 32-bit FlowMonID. Section 6. Moreover, Alternate Marking should usually be applied in a controlled domain and this also helps to limit the problem. Does this mean to suggest that Alternate Marking can be used in networks where attackers exist? If so, comments above regarding the integrity of these fields should be addressed, I think. [GF]: We will definitely revise the security section. The precondition for the application of the Alternate Marking is that it MUST be applied in a controlled domain. The privacy concerns of network measurement are limited because the method only relies on information contained in the Option Header without any release of user data. Although information in the Option Header is metadata that can be used to compromise the privacy of users, the limited marking technique seems unlikely to substantially increase the existing privacy risks from header or encapsulation metadata. The QUIC working group spent a _long_ time trying to understand the privacy implications of a single latency bit. I'd encourage the authors here to review the history of that discussion, and then revisit this paragraph. While privacy implications may not seem obvious, I think it's a mistake to say that it is unlikely to introduce any new sort of attack vector. [GF]: Sure, I know the discussion on the QUIC Spin Bit since I'm also active on that. I will surely improve this part and revise this paragraph. The strong requirement of the controlled domain also helps to mitigate the privacy concerns. The Alternate Marking application described in this document relies on an time synchronization protocol. Thus, by attacking the time protocol, an attacker can potentially compromise the integrity of the measurement. This seems somewhat buried, and probably worth promoting to the introduction. [GF]: Ok will do. Editorial comments: - Some language is a bit informal, e.g., "Anyway, ...". I recommend removing such phrasings throughout. [GF]: Ok - "Alternate Marking" and "alternate marking" are inconsistently capitalized. Is that intentional? [GF]: Ok. We will use a consistent notation. - OAM is undefined in Section 4 -- perhaps we can spell it out? (I assume it's Operations, Administration, and Maintenance.) [GF]: Ok will do. -- last-call mailing list last-call@xxxxxxxx https://www.ietf.org/mailman/listinfo/last-call