Re: [Last-Call] [MBONED] Tsvart last call review of draft-ietf-mboned-driad-amt-discovery-09

"Holland, Jake" <jholland@xxxxxxxxxx> · Tue, 3 Dec 2019 17:20:04 +0000

Hi Bernard,

Thanks for your comments.  I have a few responses and a few
clarifying questions inline with [JH].

On 2019-11-30, 17:42, "Bernard Aboba via Datatracker" <noreply@xxxxxxxx> wrote:
> This draft is ready for publication from a transport point of view, with
> the exception of a few (relatively minor) issues: 
> 
> Section 2.5.4.1
> 
> "   The RECOMMENDED timeout is a random value in the range
>    [initial_timeout, MIN(initial_timeout * 2^retry_count,
>    maximum_timeout)], with a RECOMMENDED initial_timeout of 4 seconds
>    and a RECOMMENDED maximum_timeout of 120 seconds.
> "
> 
> [BA] The draft provides a justification for the initial_timeout value
> of 4 seconds, but not for the maximum_timeout value of 120 seconds, 
> which seems somewhat high.  It is my suspicion that the value is set
> this high to allow for robustness in dealing with potential routing 
> transients. It would be helpful to state the reasoning. 

[JH] I can add the text from Section 5.2.3.4.3 of RFC 7450 (referenced
from the next paragraph), which contains a similar equation with that
justification for the 120 second timer:
https://tools.ietf.org/html/rfc7450#section-5.2.3.4.3
"
   a RECOMMENDED maximum_timeout of 120 seconds (which is the
   recommended minimum NAT mapping timeout described in [RFC4787]).
"

Will that address this concern?

Note the same maximum appears in section 2.7, and the reasoning is
similar, since this is all part of the AMT discovery process, and
thus subject to similar reasoning as the discovery process in
RFC 7450.

Do you think the same text is necessary in both places? (Or
necessary at all, given the reference to a very similar equation
in the following paragraph?)

I've provisionally added it to both spots in my local copy, but
please let me know if you think it should be different.

> Section 2.5.4.2
>
> "  In some gateway deployments, it is also feasible to monitor the
>    health of traffic flows through the gateway, for example by detecting
>    the rate of packet loss by communicating out of band with receivers,
>    or monitoring the packets of known protocols with sequence numbers.
>    Where feasible, it's encouraged for gateways to use such traffic
>    health information to trigger a restart of the discovery process
>    during event #3 (before sending a new Request message).
>
>    However, to avoid synchronized rediscovery by many gateways
>    simultaneously after a transient network event upstream of a relay
>    results in many receivers detecting poor flow health at the same
>    time, it's recommended to add a random delay before restarting the
>    discovery process in this case.
>
>    The span of the random portion of the delay should be no less than 10
>    seconds by default, but may be administratively configured to support
>    different performance requirements."
>
> [BA] There is good reason to be concerned about causing synchronized
> rediscovery as a result of a transient network event, if "poor flow health"
> is diagnosed too readily. As a result it would be useful to have more
> specific advice on the definition of "poor flow health" as well as 
> how to calculate the "random delay". 
>
> My assumption is that we are talking about *major* and *sustained*
> loss here (e.g. a period larger than most routing transients), as well 
> as a *substantial* delay (to avoid instability). 

[JH] I agree with this in principle and tried to fix it in a rev several
versions ago, but I ended up deciding to leave it this way, somewhat
reluctantly.  I think the right answer depends too strongly on the
specifics of the situation to provide much in the way of concrete
advice, at least that I could think of, beyond a rough pointer to the
problem.

I think even "major and sustained" might be too situational, because
I think it depends on the network and the service (for example, even
minor and sustained would in some cases be worth changing relays,
especially if there's a history suggesting something better is
expected).

I agree that the text is a bit weak here, and that suggests it
should be possible to improve, but I never was happy with any of the
ideas I came up with--nothing I could find seemed both generic enough
to be generally applicable and specific enough to be useful.

If you think it's helpful, I can add something like "The specifics
of the health monitoring logic are out of scope for this document.",
or I'd be happy to accept text here if anyone has better suggestions,
but nothing I came up with seemed to me like it made any material
improvement, and that being the case concluded that shorter is
better.

(I also thought it might be best to just cut this section, but decided
against that because I thought it better to acknowledge and encourage
this where it's feasible.  Maybe that's a mistake?  My not-very-firm
judgement call was that leaving this in is better than nothing, but
I'll take advice here.)

Anyway, I haven't made any changes to my local copy yet to address
this point.  Hopefully this response lays out my current position.

Please let me know if you have any further comments about this.  I'd
be happy to see it improve and grateful for suggestions on how to do
so, but am willing to ship it as it stands, absent a more specific
suggestion or a better understanding of the problem that needs solving
in the text.

> Concerns unrelated to Transport
>
> Security
>
> Section 6.2
>
>    "There must be a trust relationship between the end consumer of this
>    resource record and the DNS server.  This relationship may be end-to-
>    end DNSSEC validation, a TSIG [RFC2845] or SIG(0) [RFC2931] channel
>    to another secure source, a secure local channel on the host, DNS
>    over TLS [RFC7858] or HTTPS [RFC8484], or some other secure
>    mechanism."
>
> [BA] This paragraph is mixing e2e security mechanisms (DNSSEC) with
> mechanisms such as DoT and DoH. The threats addressed by each mechanism
> are different (e.g. RR modification versus snooping) so it would be helpful
> to be clear about what the threat model is.  Is there a privacy concern
> relating to unauthorized snooping of AMTRELAY RRs? Or is the issue more
> modification of the RRs?  

The issue is modification of the RRs.  (I assume an adversary who can
observe the DNS request and poses a privacy threat is also likely
positioned to observe the AMT traffic and its embedded subscriptions,
which is already a worse privacy problem than the source-specific
discovery request and is a pre-existing issue when using AMT, not added
by this doc.)

The next paragraph in the same section (I thought) explained the threat
model that this section was trying to address:
"  If an AMT gateway accepts a maliciously crafted AMTRELAY record, the
   result could be a Denial of Service, or receivers processing
   multicast traffic from a source under the attacker's control."

Do you have a suggestion for improving on that explanation?  I'm not
sure where this fell short.  Do I need to spell out more about the
possible consequences of accepting traffic from a source under an
attacker's control?

> Overall utility
>
> [BA] It is not clear to me why the AMTRELAY RR is needed, given that
> Section 2.3.1 makes it clear that querying this record is a last
> resort: 
>
> ... <cut: quote of 5 preconditions from 2.3.1> ...
>
> In particular, DNS-SD RRs can easily be added with DNS service 
> providers, while this is not necessarily the case for a new
> AMTRELAY RR.  So are there really situations in which it was not
> feasible to add DNS-SD RRs, but using the AMTRELAY RR is more
> convenient/easier to deploy? 

[JH] I believe this is the typical case today, and is the core
motivation for writing this doc in the first place.  I'm a bit
troubled that the rest of the doc didn't get this point across,
because I believed it to be a central theme of several of the
existing sections, most particularly sections 2.1 and 2.2, as
well as section 1.

The core issue is that the sending networks (for example those
listed in section 3.2) know about provisioned AMT relays that can
forward their traffic, but the receiving networks (for example
those listed in section 3.1) don't know about those relays without
a new discovery mechanism (currently provided only by this new
AMTRELAY record).

In particular: the DNS-SD service is not source-specific, and
although it should be preferred where available for the reasons
given in section 2.3.1, any network that can supply a valid relay
via DNS-SD (one that can receive and forward multicast traffic
from the given source) either has native multicast connectivity
to the source (like perhaps you could do if the receive network was
directly connected to the send network, rather than only reachable
across the internet), or has an upstream AMT ingest point that
relies on the AMTRELAY discovery (which today would be almost all
networks that are not walled gardens, with the exception of i2).  I
had thought this explanation was more or less covered by section 2.2.

One day, I do hope the AMTRELAY record can be abandoned because
there will be a native multicast backbone available everywhere.

However, as a transition technology until that time, some mechanism
for automatically connecting the multicast-enabled receiver islands
to the multicast-enabled sender islands in a source-dependent way is
necessary, which is what this document is trying to define, and which
has previously been missing.

I hope that clarifies things, and please let me know if you can
suggest any place to add text that would have made this more clear on
the first reading.

Thanks and regards,
Jake

-- 
last-call mailing list
last-call@xxxxxxxx
https://www.ietf.org/mailman/listinfo/last-call