Re: Call for action vs. lost opportunity (Was: Re: Renumbering)

Mark Andrews <Mark_Andrews@xxxxxxx> · Fri, 14 Sep 2007 14:06:39 +1000



> 
> >>> To my small mind, forcing a new DNS lookup in the event of a
> >>> TCP session failure and restart would be a good thing.
> >>>       
> >> perhaps, but it won't work reliably as long as there can be more than
> >> one host associated with a DNS name, nor will it work as long as DNS
> >> name-to-address mapping is used to distribute load over a set of hosts.
> >>     
> >
> > 	We already have the DNS hooks to distingish services from
> > 	hosts.  We had them for the last 8 years.
> >   
> Yes but SRV records weren't really meant to handle this case either. 
> And they actually can make applications less reliable because they
> introduce a new dependency on DNS (another lookup that can fail, in a
> different zone and potentially on a different server, another piece of
> configuration data that can be incorrect.)  What we'd really need is a
> RR type specifically intended to map service names onto instance
> ID+address pairs, and also a special query type that wasn't defined to
> return all of the matching RR records, but would instead return a random
> subset or a subset based on heuristics, and finally an instance ID to
> address mapping service.  But arguably DNS isn't the right place to do
> that at all - there should instead be a generic referral service at
> layer 3 or 4.
> 
> Of course, part of the reason that people started using A records to
> refer to multiple hosts was that a number of applications "just worked"
> when they did that.  And I remember when people used to object loudly to
> such things, and insist that a DNS name and a host name had to be the
> same thing.  Anyway, this kind of overloading of A records has been such
> a widespread practice for so long that I don't see it changing.  And
> it's not as if we came up with a better way of doing things for IPv6
> addresses.
> >> in other words, doing another DNS lookup of the original DNS name only
> >> looks like a good way to solve the problem if you don't look very deep.
> >>  
> >> now if you somehow got a host-specific (or narrower) identifier as a
> >> result of setting up the initial connection (maybe via a TCP option),
> >> and you had a way to map that host-specific identifer to its current IP
> >> address (assume for now that you're using DNS, though there are still
> >> other problems with that) - then you could do a different kind of lookup
> >> to get the new IP address and use that to do a restart.
> >>
> >> even then, it wouldn't help the numerous applications which don't have a
> >> way to cleanly recover from dropped TCP connections.  (remember,  TCP
> >> was supposed to make sure data were retransmitted as necessary and that
> >> duplicated data were sorted out, provide a clean close, that sort of
> >> thing.   once you expect apps to handle dropped connections they have to
> >> re-implement TCP functionality at a higher layer.)
> >>     
> >
> > 	Applications need to deal with TCP connections breaking for
> > 	all sorts of reasons.  Renumbering should be a relatively
> > 	infrequent event compared to all the other possible ways a
> > 	TCP connection can fail.
> >   
> Mumble.  Seems like the whole point of TCP was to recover from such
> failures at a lower level.  And I remember how people used to say that
> TCP was better than X.25 VCs (in part) because TCP would recover from
> temporary network outages that would cause hangups in X.25.
> 
> I also don't have a lot of faith in "should be", not when I've seen DHCP
> servers routinely refuse to renew leases after very short times, nor
> when I've heard people say that a site should be able to renumber every
> day.  

	So, someone misconfigured something.  Such misconfigurations
	usually get fixed fast.

	Getting the automation to the state where a daily renumber
	is possible is an achievable goal.  If we were doing that
	the long running apps would have been fixed long ago.  The
	fact that they aren't is more a matter of pressure than
	anything else.  That's why I started with a large period
	when I was suggesting that router and firewall vendors
	actually renumber themselves periodically.  It was to keep
	the problem in the management space rather than the application
	space.

	Have each vendor work on their part of the problem is the
	way to go.
 
> I used to try to get people to specify a minimum amount of time that a
> non-deprecated address should be expected to be valid - say a day.  Then
> application writers and application protocol designers would have an
> idea about whether they needed a strategy for recovery from a
> renumbering event, and what kind of strategy they needed.  But the only
> people who seemed to like this idea were application area people. 
> > 	Until applications deal nicely with the other failure modes,
> > 	complaints about renumbering causing problems at the
> > 	application level are just noise.
> >   
> in other words, one design error can be used to justify another?  sort
> of like the blind leading the blind?

	No. People should work on making renumbering work efficiently.

	Using TCP failures at the application level as a excuse to
	no persue making renumbering work cleanly is just that, an
	excuse.
 
> I see a significant difference between a design flaw in a particular
> application that cripples that application, and a design flaw in a lower
> layer that cripples all applications.

	Reconnect is a reasonable strategy for most applications.

	Holding a TCP session open in the presense of ICMP
	host/net unreachable is also a reasonable strategy.

> Keith
-- 
Mark Andrews, ISC
1 Seymour St., Dundas Valley, NSW 2117, Australia
PHONE: +61 2 9871 4742                 INTERNET: Mark_Andrews@xxxxxxx

_______________________________________________

Ietf@xxxxxxxx
https://www1.ietf.org/mailman/listinfo/ietf