Re: [Last-Call] Opsdir last call review of draft-ietf-v6ops-slaac-renum-03

Fernando Gont <fgont@xxxxxxxxxxxxxxx> · Thu, 10 Sep 2020 06:00:02 -0300

Hi, Jürgen,

Thanks a lot for your comments! In-line....

On 9/9/20 19:05, Jürgen Schönwälder via Datatracker wrote:
[....]

Perhaps indicate a bit earlier what unacceptably long means, i.e. we
are talking about days and weeks.

This is a bit subjective. If I'm sitting on my computer doing e.g. 
video-conferencing (i.e., anything interactive), probably anything over 
a few minutes would be unacceptable. In a more general case, what's 
acceptable is a function of how often the problem happens and whether 
there's any ongoing interactive usage -- and that's still subjective.

The scenarios described read a bit
like somewhat rare events and hence it is useful for the reader to
have an idea what unacceptably long means in such events.

I wondering if adding something like:
" Any definition of what is considered 'acceptable' here would be 
subjective, and would probably also depend on how often these 
flash-renumbering events occur, whether the affected hosts are employing 
any interactive applications, and other parameters. However, one rough 
estimate would be that hosts should be able to deal with 
flash-renumbering events with a similar timeliness with which they can 
deal with failing default routers."

would help?

(BTW, I find
the scenario not described at the beginning where a router announces
SLAAC lifetimes that are not synchronized with obtained prefix
lifetimes operationally the more tricky problem since this can lead to
regular failures.)

Fair enough. How about adding this to the bulleted-list:

" o A router (e.g. Customer Edge router) may advertise autoconfiguration 
prefixes corresponding to prefixes learned via DHCPv6-PD with constant 
PIO lifetimes that are not synchronized with the DHCPv6-PD lease time 
(as required in Section 6.3 of [RFC8415]). While this behavior violates 
the aforementioned requirement from [RFC8415], it is not an unusual 
behavior, particularly when e.g. DHCPv6-PD is implemented in a different 
software module than the SLAAC router component.".

?

Section 2.2 seems to confuse soft-state (this is what a learned IPv6
prefix is for me) with certain protocol timers. There are many places
where protocols use soft-state and implementations use timers to purge
or refresh soft-state. That timers generally do not go off in normal
conditions is not really correct in this context, DHCP leases are
renewed when their lifetime expires, a normal operation. 

Normally, you renew the lease before the lease expires.

IP address
mappings to Ethernet addresses expire when their lifetime timer goes
off. 

This one is not the necessarily the best example ;-) (while RFC1122 
requires that, IIRC in many implementations the entry is refreshed when 
referenced, and it only expires when not referenced/refreshed frequently 
enough).

But I do see where you are going and I realize that the text is a bit 
sloppy in this respect. How about tweaking the text as follows:

---- cut here ----
   Many protocols, from different layers, normally employ timers for 
fault isolation/recovery.  The
   general logic is as follows:

   o  A timer is set with a value such that, under normal conditions,
      the timer does *not* go off.

   o  Whenever a fault condition arises, the timer goes off, and the
      protocol can perform fault recovery

   For example, when implementing reliability mechanisms, a timer is 
normally set when a packet is transmitted and, unless a response is 
received before the timer goes off, a fault recovery action (such as 
packet re-transmission) is triggered.
---- cut here ----

?

One might also look at this same issue as the timer implying a sensible 
period of time where information should be refreshed, as you correctly 
point out, though.

(I guess the only difference is that when looking at this form the 
soft-state angle, you're mostly considering the case where information 
changes, whereas when looking at this from the fault-recovery pov, 
you're mostly thinking about failures, rather than updates).

Switches purge forwarding state regularly when forwarding entries
expire. Cached DNS name to IP resolutions expire. The only problem
here seems to be that a lifetime of 7 days / 30 days is a bit
ridiculous.

Agreed.

Is anyone shipping the RFC 4861 defaults? 

Yes, unfortunately. Some implementations override the RFC4861 defaults. 
Still, RFC4861 defaults are extremely common and widespread.

The few
implementations I have seen do use a bit more reasonable defaults.  I
think this section should be rewritten to replace the "timer going off
is associated with a failure" text with a discussion of	soft-state in
other protocols. (Section 2.2 is why I ticked 'has issues'.)

As a second alternative to what I've suggested above:

---- cut here ----
   Many protocols, from different layers, normally employ timers for a
   variety of purposes, such as in fault isolation/recovery mechanisms,
   and in the maintenance of data structures that contain bindings of
   some sort (e.g., the IPv6 Neighbor Cache [RFC4861]).

   In the case of fault recovery/isolation, the general logic is as
   follows:

   o  A timer is set with a value such that, under normal conditions,
      the timer does *not* go off.

   o  Whenever a fault condition arises, the timer goes off, and the
      protocol can perform fault recovery

    For example, when implementing reliability mechanisms, a timer is
    normally set when a packet is transmitted and, unless a response is
    received before the timer goes off, a fault recovery action (such as
    packet re-transmission) is triggered.

    On the other hand, when maintaining bindings in data structures, 
timers are usually selected in a way that any bindings that become stale 
are updated in a timely manner.
---- cut here ----

?

Isn't a part of the solution (other than moving to less ridiculous
default) that SLAAC hosts experiencing connectivity problems should
try to validate the prefix that they have learned (and if the
validation fails move to a newly learned prefix)?

Yes, indeed. That's what we are pursuing in draft-ietf-6man-slaac-renum. 
(see Section 4 of this (draft-ietf-v6ops-slaac-renum-03) document).

draft-ietf-v6ops-slaac-renum-03 contains the problem statement and 
*operational* mitigations only.

Involving the hosts
in a resolution of the problem may be	more robust than expecting that
something in the network takes care of invalidating stale soft-state.

I agree 100%. That is and has been, indeed, the motivation for pursuing 
draft-ietf-6man-slaac-renum.

Thanks!

Regards,
--
Fernando Gont
SI6 Networks
e-mail: fgont@xxxxxxxxxxxxxxx
PGP Fingerprint: 6666 31C6 D484 63B2 8FB1 E3C4 AE25 0D55 1D4E 7492

--
last-call mailing list
last-call@xxxxxxxx
https://www.ietf.org/mailman/listinfo/last-call