[Last-Call] Secdir last call review of draft-ietf-ippm-ioam-data-integrity-12

Benjamin Kaduk via Datatracker <noreply@xxxxxxxx> · Sat, 28 Sep 2024 15:17:08 -0700

Reviewer: Benjamin Kaduk
Review result: Has Issues

# secdir review of draft-ietf-ippm-ioam-data-integrity-12
CC kaduk

I have reviewed this document as part of the security directorate's ongoing
effort to review all IETF documents being processed by the IESG. These comments
were written primarily for the benefit of the security area directors. Document
editors and WG chairs should treat these comments just like any other last call
comments.

This is a re-review; I had previously reviewed version -07, and mostly am
looking at the diff from that revision (though in some sections that got many
changes I just read the current version without trying to ascertain what
exactly changed).

I see now as I am completing this review that Justin had responded to my
previous review comments back in January
(https://mailarchive.ietf.org/arch/msg/ippm/32pZbJc-8aEf6_MpH_2Sr4-sBSc/) but I
did not respond at that time.  I am not entirely sure what happened there; my
apologies for the lack of timely response (I will go ahead and reply now to
those items that still need a response).

The changes from -07 to -12 have made big improvements on a number of the
points raised by my initial review, but there are a few issues that are
critical to the security properties of the overall system but do not seem to be
completely addressed yet (I suspect that some of these were topics where text
changes were pending me responding to the email thread from the previous
review).

Of particular note (these topics will be covered in more detail below), that
response does clarify that we don't want to trust nodes (other than verifiers)
ver much and seems to support the idea that each node will manage their own
nonces (vs reusing the nonce from a received packet).  However, this does not
seem to have been fully reflected in the document yet.  (The proposal to
include the previous tag in the next node's AAD is sound.)

In several places in this review I will quote portions from my original review
of the -07.  This is just intended to set some context for my subsequent
comments; in nearly all cases the comment made on the -07 no longer applies
as-is to the -12.

## Discuss

### Key management

When reviewing the -07, I wrote:

> While the specifics of key distribution and management will inherently be out
of scope for this specification, I think we do need to settle a core question:
will each IOAM Node have its own unique key for signature generation, or do we
expect some level of key reuse across machines within a given domain?  The
latter scenario intuitively seems like it would make for a simpler deployment,
but would place some quite stringent limitations on what cryptographic
mechanisms we can use at runtime.  NIST's key usage limitations for GCM, in
particular, might actually prove prohibitive for the "one key shared across
nodes" option in practice.

The responses from Justin (both on the ability to use the deterministic nonce
construction and on the trust model) seem to confirm that we expect each node
to have their own unique key (shared between it and any Validators in the
domain but not to any other nodes).  This uniqueness of key (in terms of nodes
that are populating ICVs in packets) is vital to the safety of the system and
needs to be clearly stated as an explicit requirement for using the mechanism. 
Currently I see some brief mentions in §5.4 ("assumes that symmetric keys have
been distributed to the Validator only") and §7 ("need to be exchanged in a
secure way between the nodes involved with integrity protection of
IOAM-Data-Fields").  The former weakly implies, by using "keys" plural, that
each node gets their own key, but the latter implies an exchange of keying
material between multiple nodes that are applying integrity protection, which
(unless it's just between each node and the Validator(s)) is exactly what we
don't want (in particular, it seems that it would violate the "don't trust the
nodes" goal, because if two non-Validator nodes share a key one can
successfully impersonate the other to the Validator).

So, I propose a new subsection for "key and nonce management" that covers the
requirements and assumptions while avoiding any discussion of the specific key
exchange method that might be used.  This would probably look something like:

> The Integrity Protection Method defined in this document leverages symmetric
keys.  In order to use this mechanism, it is required that each encapsulating
node and each transit node that updates the ICV has its own unique symmetric
key for applying integrity protection.  Additional assumptions about these keys
apply: - that each key is securely generated - that each key is securely
distributed to only the corresponding encapsulating/transit node and any
Validator that needs to validate messages protected by that key > > The details
of key generation and distribution are outside the scope of this document. > >
In addition to key management, the per-message nonces used with GMAC need to be
managed to prevent (key,nonce)-pair reuse.  Since reuse of a nonce with a given
key allows forgery of arbitrary ciphertexts with valid authentication tag, it
is extremely important to have high confidence in nonce non-reuse.  While this
document does not prescribe specific nonce-generation mechanisms, it does
mention a non-exhaustive list of possibilities that includes: an integer
counter, a linear feedback shift register, and a 64-bit timestamp.  With either
of the former two mechanisms, there is some current state (counter or feedback
register state) that must be retained on durable storage such that a power
interruption or system crash does not lead to nonce reuse.  (For efficiency,
implementations sometimes commit to durable storage a range of nonce values
that can be used without a blocking write to durable storage, to amortize the
cost of the write, which results in some nonce values going unused after a
crash or power interruption but not a catastrophic loss of security.)  When
timestamps are used for nonce generation, the system requires a time source
that is strictly monotonic (i.e., never jumps backwards or repeats a value even
in the face of leap seconds, daylight-saving time, or other potential
disruptions).  If a remote time source is used, it must be authenticated in
order to provide these properties.  Note that the time source does not need to
be at all accurate in order to provide these properties! > > The integrity
protection mechanism defined in this document relies only on the uniqueness of
the nonce input to GMAC to provide its security properties.  However, use of a
counter or timestamp does reveal some information about the rate and/or times
at which packets are protected, and the considerations of RFC 9416 apply; in
particular, some scenarios may benefit from using an unpredictable nonce, as is
provided by the linear feedback shift register. > > The details of nonce
management are outside the scope of this document. > > With the deterministic
nonce construction used by the Integrity Protection Method defined in this
document, a given key can be used to protect up to 2^64 messages before
encountering the catastrophic (key,nonce) reuse problem.  Accordingly, the key
generation/distribution mechanism will need to be able to rotate any keys that
approach this limit before the limit is reached.  This protocol does not define
any specific "key identifier" field, but a seamless key rotation might be
achieved by, for example, associating a new node identifier to the new key and
informing consumers of node identifiers that the identifiers in question
reflect the same underlying device.

This subsection might fit as the first subsection of §5, or perhaps be a
subsection under §7 (which might then motivate further division of the security
considerations content into subsections).  It seems to have expanded a bit
since I first started putting it together, which probably impacts where it fits
best.

Note that this also addresses some lingering aspects of my previous comment
about "GCM key usage limitations" (in particular, specifically documenting that
there is a potential need for key rotation).

### Signature as nonce for transit nodes

In my review of the -07 I wrote:

> The specification of the actual cryptographic protection algorithm in §5
includes a provision (step 2) for a transit node to compute a new signature
that accounts for the additions it has made to the IOAM data.  It seems to be
saying that the corresponding cryptographic computation involves using the
received signature value as the nonce input for producing this new signature.
While SP800-38D does seem to permit using multiple IV lengths with a given key,
using the received signature as a nonce seems to set us up for using nonces of
length other than 96 bits, which (per {{GCM-Key-usage-limitations}}) strongly
limits how much traffic we can send in a given key.  It also would entail using
data received from the network as the nonce for a given node's private key,
which seems like it would make it easy for an attacker to induce (key,nonce)
reuse.  (This might be mitigated if the transit nodes diligently validate the
received signature prior to using it as a nonce, but even that would not
obivate the need for a fully reliable replay detection mechanism for the
lifetime of the key, which seems prohibitively expensive.) > [... more about
validation being expensive and a potential alternate option with diferent
tradeoffs]

In §5.2 where we discuss transit node behavior, we now say that the received
ICV field is used in the input AAD for the transit node's GMAC calculation,
which is a good way to ensure that the encapsulating (or previous transit)
node's integrity protection tag gets validated by the ultimate validator of the
packet.  But we seem to say that the "received Nonce field [is] provided to
GMAC" for the transit node's GMAC invocation, and we even go further and say
that "a transit node MUST NOT modify [...] the Nonce-Length field, and the
Nonce field in the IOAM Integrity Protection header."  As an implementor, the
only way I could plausibly interpret that combination of statements is that I
should have the transit node use the nonce value from the received packet as
the input nonce for GMAC, but that is trivially insecure unless the transit
node is validating the received ICV before applying its own GMAC (since an
attacker could just force a specific nonce value and thus nonce reuse on the
transit node), and in general remains insecure even in the face of such
validation absent other measures (since an attacker would be able to replay a
valid packet/ICV under conditions that cause the transit node to generate a
different AAD input for the (key,nonce) pair).  (For clarity: I think there are
reasonable solutions that do not involve the transit node validating the
received ICV and only applying checks to the received nonce, but I needed to
qualify the "trivially insecure" statement somehow.) The most secure option is
to have the transit node generate and use a new nonce in the same manner as an
encapsulating node, but then we have the problem that both the nonces (from
encapsulating node and transit node) need to be included in the packet in order
for the final ICV to be validatable.  I don't know if there's scope for such
expansion in the conveyed nonce (one might potentially be able to increase the
Nonce-Length and concatenate them, if we can hold to the invariant that nonces
are 96-bits).

If we want to have the transit nodes use the nonce from the received packet
then we need to lay out some specific requirements that make that safe, which
would include at a minimum that the encapsulating and transit nodes have
distinct keys, that the scheme used to create the invocation field of the nonce
is known to the transit node, and that the transit node applies some validation
checks on the received nonce in order to avoid nonce reuse.  The nature of such
checks depend on the nonce-generation scheme, of course, but for a timestamp or
counter would include enforcing strict monotonicity (with optional
lookahead/behind window to specifically track a fixed number of possible nonce
values and tolerate some level of packet reordering)

I will note that a scheme based on HMAC or KMAC rather than GMAC would not be
as impacted by this issue, since nonce reuse is not fatal to those MACs in the
same way (we could probably just use the encapsulating node's nonce and
tolerate some degree of replay), but that is a big change and would make it
impossible to benefit from hardware GCM implementations, so we may be stuck
working around GMAC's limitations.

### Bit-level hash inputs (Protected flags for trace and DEX option-types)

In my review of the -07 I wrote:
> I think we need to provide more clear guidance on how to handle the protected
flags for the trace and DEX option-types.  First, a note in passing that
interoperability does require locking in which flag bits are (not) covered by
the MAC at the time the protected option type is specified (i.e., now), which
blocks off future extensions since any new flag bits cannot be integrity
protected using this mechanism.  That said, it would be possible to divide the
unallocated flags range into a protected and unprotected portion, to leave a
little bit of flexibility for the future. > My main concern here, though, is to
concretely specify, at the bit level, what the input to the MAC (or hash)
function is -- do we mask out the unprotected bits (if so, to 0s or 1s?), or do
we literally just extract the two bits in question and make the bitstream input
to the MAC (or hash) be not byte aligned?  I strongly suggest the former, since
implementation handling for non-octet inputs to hash functions is very poor,
but reading the current text I would conclude that I must attempt to implement
the latter.  (I note on re-read that SP800-38D does require the inputs to have
bit lengths be a multiple of 8, so if we decide to use GMAC directly rather
than hash-and-MAC, we would need to specify padding if we opt for the latter
approach.)

Justin's reply indicated that the intention was to mask out (to zero) bits not
covered by the integrity protection, which is fine from a perspective of fully
specifying what we want to see and avoiding non-byte-aligned cipher inputs, but
does not seem to be reflected in the -12.  Maybe this is because protection of
flags was removed entirely and thus the behavior does not need to be specified
at all?  Though I do not see any results in the mail archive for a search on
"draft-ietf-ippm-ioam-data-integrity flags" that would line up with the right
timeline for that.  So I'm curious about why/whether we can get away with not
protecting any flags (including ones yet to be defined).  The following
paragraphs are predicated on an assumption that we do integrity protect at
least some flag bits, and are irrelevant if no flag bits will need protection.

The topic of how to handle flag bits that are currently undefined but get
defined in the future also remains as an item to highlight in my current
review.  I see Justin's indication that it's a hard problem (and to large
extent agree with the analysis), but leaving things undefined here still feels
like it's setting us up for bigger problems down the road.

Thinking through that a little more now, I suppose that it would technically be
possible to define a new flag, roll out implementation support to all machines
in the domain but without making use of the behavior that sets bits to one, and
then start to gradually enable the other behavior while monitoring for
breakage, but that's a very heavy deployment burden and requires all systems to
be updated before any systems can make use of the new feature, whereas defining
now which bits are/aren't covered would let integrity protection continue to
work with only the machines that need to use the new feature needing to get
updates.

I have extremely high confidence that the IESG will ask about protocol updates
and interoperability if we don't have some text in place by the time they see
the document, so I would suggest putting in some text with at least your
analysis of the issues and motivation for taking the current approach, even if
you don't want to adopt my proposal.  (In particular, the current state of
affairs is implicitly applying a requirement that future protocol flags cannot
provide context or meaning to the IOAM-Data-Fields, which seems like a fairly
strong limitation on protocol evolution and would force definition of a new
Option-Type in order to allow such protocol behavior.)

## Comments

### Threat model

In my review of the -07 I wrote:
> §1 lays out the scenario as having an IOAM-Domain that's under a single
administrative control but invokes the possibility of data collected in
untrused or semi-trusted environments as a motivation for integrity protection.
 Is this just a risk that nodes which are supposed to be under the domain's
administrative control get compromised, or are we intending to consider a
subtler scenario with semi-trusted entities being authorized parts of the
administrative domain directly, or something else?

The situation has improved since then, with §5.4 gaining a discussion of "Zero
Trust" and that the Validator is the only point of trust, but I don't think
this text is fully aligned with general usage of "trust" or "zero trust". 
(Also, I think "Zero Trust" is not a good name for the Integrity Protection
Method Suite, since it's not something unique about this mechanism.)  In
particular, I think that we still trust encapsulating and transit nodes to
behave properly while applying ICVs, but that we do not rely on their behavior
for anything other than validating the ICVs they produce, and we do not have to
trust other entities in the system "at all" (not technically true but
informally okay for this discussion).  So I would propose something more like:

% The Validator is the key trusted entity in this system, as it has access to
all of the keying material in use and makes the final determination of whether
an ICV is valid.  A misbehaving Validator would be able to use those keys to
forge or modify IOAM data as if it passed through the encapsulating or transit
nodes in question, and could render incorrect assessments of an ICV's
(in)validity, and we have no recourse or means to detect such misbehavior.  In
contrast, we only trust non-IOAM nodes to convey packets without manipulation
and can detect such modification by validating received ICV values; and we
trust IOAM encapsulating and transit nodes to honestly apply integrity
protection to packets they process.  Misbehaving encapsulating or transit nodes
could forge or drop packets they process but cannot impersonate other IOAM
nodes or modify integrity-protected IOAM data produced by other nodes without
being detected by the Validator.  In particular, a transit node is limited in
what forgery can be made without detection because the Validator will validate
the encapuslating node's ICV as part of validating the final ICV, so
modification to content protected by the encapsulating node would be detected
at the time of validation.

### Separation of layers

In my review of the -07, I wrote:
> While it's generally reasonable (as §3 does) to require the lower layer
protocol to handle threats at their own layer, I would probably call out that
since IOAM is defined as data fields rather than a dedicated packet structure,
we also rely on the lower layer to provide integrity protection for whch data
fields (that is, IOAM Option-Types) are present in a given packet.

but I'm not sure my intent came through properly.  Basically, this is just
saying that since each integrity-protected IOAM Option-Type is a distinct
IOAM-Option with its own ICV, there is no protection for the set of which IOAM
Option-Types are expected to be present in a packet -- an attacker could just
strip one out or replace it with an unprotected analogue but the unmodified
Option-Types would still validate.  So we rely on a different layer (possibly
out-of-band communication between encapsulator and Validator) to ensure that
the set of IOAM Option-Types which are present get some integrity protection. 
We could say as much in the security considerations, e.g.:

% Because each integrity-protected IOAM Option-Type is a distinct IOAM-Option
with its own ICV, there is no mapping or listing of which IOAM Option-Types are
expected to be present in a packet.  Accordingly, an attacker with ability to
modify the packet contents could just remove an integrity-protected IOAM
Option-Type or replace it with an unprotected analogue, but the unmodified
integrity-protected Option-Types would still validate successfully.  If there
is a need to provide integrity protection for which IOAM Option-Types are
present in a packet, that protection must be supplied at a different layer.

### Mutable fields to skip for signing

In my review of the -07, I wrote:
>  The discussion in §5 very quickly glosses over "IOAM-Data-Fields supposed to
be modified by other IOAM nodes on the path MUST be excluded from the
signature".  This is actually a critical point for constructing and validating
ciphertexts, and seems like it would merit a longer treatment.  Most notably, I
would probably want to have a central table (or maybe add to the IANA
registry?) to indicate "does this field get skipped for integrity protection:
Y/N" to try to leave out any guesswork by the implementor as to what is mutable
or not.

and Justin proposed to add a table to this effect.  It looks like maybe he was
waiting for a reply from me about whether to just do the table or also to
shuffle around or remove some other content.  Some of that content has been
relocated from -07 to -12 already, though, so I am not entirely sure of the
current expected status.  I think the current content in the -12 is mostly
talking about which header fields are included in the integrity protection,
whereas I wanted to both list the header fields and the specific data fields
that are (not) covered by the integriity protection (i.e., those that are
mutable or written by other nodes, but to specifically list them out in a table
rather than leaving it implicit).

### Take action based on IOAM data

In the -07 we had some fairly strong language about "Each node that takes
actions triggered by fields in the IOAM Integrity Protected Option-Type header
MUST act as a Validator" that doesn't seem to have an analogue in the -12.  But
that may well be ok, as it's up to the consumer of the data whether they need
the protection provided by the integrity protection method or are willing to
treat it as functionally unprotected for the purpose to which it is being put. 
We do, however, probably want to say (in the security considerations section)
that nodes which consume the IOAM data without validating the ICV do not get
the security benefit that the integrity protection mechanism provides.

### Selection of header fields

While I agree that NodeLen can be computed from the IOAM-Trace-Type, adding an
extra byte of AAD input has an extremely low computational cost, no wire
protocol overhead, and substantially improves the robustness of implementations
(by forcing them to directly verify the computed length cryptographically
instead of hoping that different implementations implement the right
computation from the (cryptographically validated) trace type.  I see that, as
a five-bit field, adding it to the ADD would require specifying the bit-level
padding, but IMO that is a price worth paying in order to get the property that
"you have to have the right value in order to cryptographically interoperate"
(a philosophy that has served modern do novo security protocol designs like
QUIC very well).

### Decapsulation

In §5.3 we have some text:

> The decapsulating node MUST NOT add the IOAM Integrity Protection header.
Also, the decapsulating node MUST NOT modify the Method-ID field, the
Nonce-Length field, and the Nonce field in the IOAM Integrity Protection header.

which is roughly analogous to what we say about transit nodes.  But some of
these listed actions don't seem to make any sense for a decapsulating node and
thus may not need to be listed?  E.g., if the decapsulating node is removing
the whole header then what is the concern in potential modification to those
fields?

### All or none for IOAM Integrity Protection

Since the -07 we have some new text in the security considerations:

> When enabled, the integrity protection MUST be applied to the entire IOAM
set, not a subset.

The motivation for having this strict of a requirement is only partially clear
to me.  At the protocol level, the integrity protected Option-Types are
distinct from the unprotected ones and thus it is easy to identify which
packets/options to validate or not validate upon receipt.  Now, as I mentioned
previously there is no integrity protection for which option-types are present
in a given packet, so the integrity-protected option-types could be stripped
entirely or even replaced with the unprotected analogue (though of course with
the quoted MUST only the "stripping" attack is possible, not replacement), and
any Validator would need to have advance knowledge of which option-types should
be present in order to detect such modification.  If the Validator is already
going to need that kind of knowledge, it doesn't seem like a huge stretch for
it to also know about a set of Option-Types that should and a set that should
not be integrity protected, and such a scheme would have the same level of
protection against tampering for the integrity-protected Option-Types that the
current state of affairs has.  Following this "MUST" directive, we do go on and
give some partial description of the problem it is trying to solve (in
particular, to the subset of issues where a transit node could pose as an
encapsulating node) and propose the mitigation of having the Validator know
which Namespace-IDs get integrity protection.  However, this is both an
incomplete description of the problem and an incomplete mitigation -- the
Validator would also need to know which Option-Types are supposed to be added
by which encapsulating node in order to be able to detect all the different
modification scenarios.

Having said all that, it seems that there is some possibility for having an
even more flexible policy still: if an encapsulating node is applying a known
policy to only integrity protect a subset of traffic, adherence to that policy
can be statistically verified at the Validator, and similarly statistical
analysis can be performed to attempt to detect whether the protected and
unprotected data are qualitatively different, as would happen if the IOAM data
was tampered with but only the unprotected packets were tampered with.  So I
think we need to have both a more thorough description of the problem statement
in this space and some more exposition around the justification for imposing
the chosen solution (whether it ends up remaining this "all or nothing"
requirement or some more flexible option) -- from where I sit we would get more
value out of putting IOAM on all packets and only putting integrity protection
on some of them (or some of the Option-Types within them), than we would by
only putting IOAM on some packets and integrity protecting all of those that we
do.  The corresponding cost in terms of a weakening of what statements the
Validator can make about traffic non-modification may be tolerable.

### Weaken recommendation to use 64-bit timestamp as nonce

In §7 (in the context of replay protection) we say: "a suggestion would be to
put a 64-bit timestamp in the Invocation field of the nonce", which is probably
a stronger recommendation than is defensible (see my discussion in the DISCUSS
point); I'd go with just "a possibility" here.

## NITS

###

In §5.1 I'd s/mandatory requirements/the mandatory requirements/ in the first
line.

### What is meaning?

As a general remark, I'd try to avoid saying that (absent some other
context/metadata) a field is "meaningless" since there's probably enough
lingering context to imply a "well, it's either X or Y but we're not sure" type
scenario.  My preference would be for a phrasing like "cannot be reliably
interpreted".  But that's largely a matter of style.

###

In §5.1.2, antepenultimate ¶, s/a E2E/an E2E/ -- whether you read it as "end to
end" or "ee two ee", it's still a vowel sound and needs the 'n'.

## Notes

This review is formatted in the "IETF Comments" Markdown format, see
https://github.com/mnot/ietf-comments.

-- 
last-call mailing list -- last-call@xxxxxxxx
To unsubscribe send an email to last-call-leave@xxxxxxxx