Re: Protocol for TCP heartbeats?

Martin Sustrik <sustrik@xxxxxxxxxx> · Wed, 14 Jul 2010 21:23:15 +0200

Ted,

The obvious problem is that heartbeats can thus sit in transmit buffer 
waiting to be delivered. They can even be retransmitted. Etc. In any 
case the functionality they are supposed to provide is pretty heavily 
distorted.
FWIW, I don't think it matters if the keepalives are stuck in a TCP
transmit buffer or in a multi-continent routing loop.  If the
application needs to hear from its peer every N seconds and it doesn't,
they're disconnected.
Yes. That's true for dumb keepalive algorithm as described in the 
previous email.

However, if you want something more sensible (presumably something like 
SCTP's heartbeats) you need to take current RTO into account. That's 
something you can't do on top of TCP.

You're missing my point, perhaps because I'm being unclear.  Let me try
it this way:

If an application needs a heartbeat, it almost always needs to be an
application to application (layer 7 to layer 7) heartbeat.

Imagine you have a perfect TCP heartbeat algorithm: detects the
existence of a running TCP instance on both endendpoints perfectly
accurately at the timescales you care about.  Now one of your
application endpoints deadlocks itself - it hangs, spins, whatever - but
the process is alive and the TCP connections are open.  The application
is not responding.  The TCP timeout won't help you at all; the TCP
connection is fine.

Of course the way to deal with that is a layer 7 heartbeat.

My point is that if you need that layer 7 heartbeat, the layer 4 (TCP)
one doesn't help much.  I can't think of an application that needs the
TCP heartbeat and not the application heartbeat.  (There probably is
one; my point is that needing both is the common case.)

Right. Layer 7 heartbeats are definitely needed to detect the whether 
application is hung up.

However, detecting application hangup is a problem orthogonal to 
detecting the unavailability of network peer. Being able to detect 
network unavialability is valuable in itself (i.e. application can start 
failover procedure in a timely fashion).

Those that need hangup detection can obviously implement heartbeats on 
layer 7, but that's beside the point here.

So, TCP designers could create a highly parameterized heartbeat timer
(every application has its own idea what a timeout is) and put all that
complexity into the TCP protocol.

No complexity is needed IMO. Consider the following:

1. The keepalives are already defined in rfc 1122 (4.2.3.6)

2. There are no interoperability issues. With SCTP-like heartbeat 
mechanism each peer manages its failure detection mechanism itself and 
no extra effort on behalf of the other side is needed. Thus 
implementations with failure detection would work perfectly well with 
implementations with no failure detection.

3. There are no congestion control issues. The keepalives are data and 
thus they should adhere to TCP congestion control mechanism. When the 
peer is unreachable keepalives would back off in a decent manner.

AFAICS the only thing preventing specification of optional TCP heartbeat 
mechanism are the artificial restrictions in rfc 1122 4.2.3.6, such as 
"no less than two hours" rule.

It's interesting to look at the rationale in rfc 1122:

"The TCP specification does not include a keep-alive mechanism because 
it could:  (1) cause perfectly good connections to break during 
transient Internet failures; (2) consume unnecessary bandwidth ("if no 
one is using the connection, who cares if it is still good?"); and (3) 
cost money for an Internet path that charges for packets."

(1) Is exactly what you require for high-availability solutions.
(2) High-availability solution does care.
(3) True, but those that need it are happy to pay the extra cost.

Martin
_______________________________________________
Ietf mailing list
Ietf@xxxxxxxx
https://www.ietf.org/mailman/listinfo/ietf