Re: Call for action vs. lost opportunity (Was: Re: Renumbering)

Greg Skinner <gds@xxxxxxxx> · Fri, 14 Sep 2007 20:34:34 +0000

On Fri, Sep 14, 2007 at 07:48:45AM -0400, Keith Moore wrote:
> [sorry, lost attribution here]
> > TCP protects you from lots of stuff, but it doesn't really let you
> > recover from the remote endpoint rebooting, for example... 
> well, duh.   if the endpoint fails then all of the application-level
> state goes away.  TCP can't be responsible for recovering from the loss
> of higher-level state.  but we're not talking about endpoint failures,
> we're talking about the failure of the network.  TCP is supposed to
> recover from transient network failures.  it wasn't designed to cope
> with endpoint address changes, of course, because the network as
> designed wasn't expected to fail in that way.

When I was first learning about networking back in the mid-1980s, I
worked on a project involving mobile hosts.  The hosts were permitted
to change their IP addresses, but TCP-level connectivity needed to
remain intact.  The loss of a route to some network (or host within
that network) might trigger an ICMP unreachable, but the applications
(e.g. telnet, ftp) needed to be rewritten not to close in such a
situation.

It seemed like a reasonable thing to do to treat something like a net
or host unreachable as a transient condition, and allow the
application to proceed as if nothing serious had happened.  When
routing connectivity could be restored quickly, the maintained state
at both ends of the TCP connection would allow the application to
proceed normally.  However, this practice doesn't seem to have made it
into the application-writing community at large, because lots of
applications fail for just this reason.  I wonder if even writing a
BCP about this even makes sense at this point, because the application
writers (or authors of the references the application writers use) may
never see the draft, or even be concerned that it's something they
should check for.

> > (And something that's common in today's IPv4 deployments: NAT
> > timeouts. I got bitten by that in Chicago, I think they were only a
> > few minutes in my hotel, drove me insane because anything other than
> > HTTP didn't work for long.)
> given that NATs violate the most fundamental assumption behind IP (that
> an address means the same thing everywhere in the network), it's hardly
> surprising that they break TCP.

After installing a NAT firewall/router, I noticed my ssh connections
would drop when left idle for awhile.  That never happened before -- I
could go away from my machine for hours, and as long as client and
server machines were up, with no network dynamics, everything would
work fine when I returned.  But is it TCP itself that's failing, or
ssh interpreting the timeout as a non-transient condition, and telling
TCP to close?

I think a reasonable compromise for application writers who are
concerned about allocating resources to connections that might really
need to close (e.g. because the remote end really did crash, or there
was a really long timeout), is to allow the user to specify the
behavior for the application to take when a level 3 error condition
occurs.

--gregbo

_______________________________________________

Ietf@xxxxxxxx
https://www1.ietf.org/mailman/listinfo/ietf