Re: Last Call: draft-ietf-sipping-overload-reqs (Requirements for Management of Overload in the Session Initiation Protocol) to Informational RFC

Matt Mathis <mathis@xxxxxxx> · Thu, 8 May 2008 17:44:44 -0400 (EDT)

I reviewed draft-ietf-sipping-overload-reqs-02 at the request of the transport 
area directors.  Note that my area of expertise is TCP, congestion control and 
bulk data transport.  I am not a SIP expert, and have not been following the 
SIP documents.

I have serious concerns about this document because it explicitly excludes the 
only approach for coping with overload that is guaranteed to be robust under 
all conditions.  Although I know it is considered bad form to describe 
solutions while debating requirements, I think a sketch of a solution will 
greatly clarify the discussion of the requirements.

The only robust overload signal is the natural implicit signal - silently 
discarding excess requests.  Explicit overload messages (code 503) should be 
optional, and must have an explicit rate limit.  The error message may be 
cached (e.g. in proxies, etc) but must not be required to be cached.  All 
retransmissions in all parts of the protocol must back off exponentially 
(which I am told is already true for SIP).

Sending additional messages to explicitly indicate overload is intrinsically 
fragile.  If the overload management mechanism consumes any shared resource 
that might be needed to complete other calls, then there exists some operating 
point where any additional requests will cause a decline in the number of 
successfully completed calls.  This is likely to be regenerative, with each 
successive error using more resources and preventing more calls, until the 
throughput crashes to zero.  This phenomena was readily apparent in all of the 
plots shown in the tsvwg meeting at IETF 71.

Note that if the explicit overload management mechanism is very complicated, 
the situation that triggers this failure might also be very complicated. 
Asserting that this hazard does not exist is probably equivalent to proving 
that explicit overload notifications never cause additional calls to fail, for 
all combinations of implementations under all operating conditions.  It would 
not be an easy task to prove that the standards are sufficient to guarantee 
this for all possible implementations.

My specific objections to the document are as follows: Requirement 6 calls for 
explicit overload messages and forbids silently discarding requests, since 
they are not unambiguous in their meaning.  Requirement 15 seems to provide a 
loophole (allowing complete failures) but seems to forbid using it as the 
preferred mechanism.  Requirement 8 does not make sense without explicit 
notification.  Requirements 7, 8 and 9 should note that they can be (are 
already?)  equivalently satisfied by properly structured exponential 
retransmission backoff timers in SIP itself.

I would like to point out that TCP, IP and several other transport protocols 
have evolved in the same direction as I am advocating for SIP: the only robust 
indication that an error has occurred is connection failure.  Error messages 
are cached and sometimes accelerate timers (e.g. retransmit now, or go to the 
next IP address now), but do not change basic protocol behavior.  Error 
messages are most often rate limited at the sender and the saved error codes 
are used to provide a clue why something failed, but the fact that it failed 
most likely comes from a timer, not the message itself.  The number of error 
massages that are required for correct operation is declining (note that 4821 
makes ICMP can't fragment optional), and may be zero.

Rate limiting all errors messages and treating them as advisory improves 
robustness in several ways: fraudulent messages have less impact, error 
messages can not be used an DDOS attack magnifiers, and overload is addressed 
implicitly by silently discarding requests.

Note that the normal, non-crisis, behavior has not changed significantly: 
error message are sent, cached and reported to the application.  However, in a 
crisis, the error reporting degrades gracefully, while the throughput goes 
flat, without any negative slope.  This is where SIP (and all other protocols) 
should strive to be.

Treating all errors as soft should have been an Internet Architectural 
Principle.

Thanks,
--MM--
-------------------------------------------
Matt Mathis     http://staff.psc.edu/mathis
Work:412.268.3319    Home/Cell:412.654.7529
-------------------------------------------

_______________________________________________
IETF mailing list
IETF@xxxxxxxx
https://www.ietf.org/mailman/listinfo/ietf