Reviews of draft-ietf-bmwg-sip-bench-term-08 and
draft-ietf-bmwg-sip-bench-meth-08
Summary: These drafts are not ready for publication as RFCs. First, some of the text in these documents shows signs of being old, and the working group may have been staring at them so long that they've become hard to see. The terminology document says "The issue of overload in SIP networks is currently a topic of discussion in the SIPPING WG." (SIPPING was closed in 2009). The methodology document suggests a "flooding" rate that is orders of magnitude below what simple devices achieve at the moment. That these survived working group last call indicates a different type of WG review may be needed to groom other bugs out of the documents. Who is asking for these benchmarks, and are they (still) participating in the group? The measurements defined here are very simplistic and will provide limited insight into the relative performance of two elements in a real deployment. The documents should be clear about their limitations, and it would be good to know that the community asking for these benchmarks is getting tools that will actually be useful to them. The crux of these two documents is in the last paragraph of the introduction to the methodology doc: "Finally, the overall value of these tests is to serve as a comparison function between multiple SIP implementations". The documents punt on providing any comparison guidance, but even if we assume someone can figure that out, do these benchmarks provide something actually useful for inputs? It would be good to explain how these documents relate to RFC6076. The terminology tries to refine the definition of session, but the definition provided, "The combination of signaling and media messages and processes that support a SIP-based service" doesn't answer what's in one session vs another. Trying to generically define session has been hard and several working groups have struggled with it (see INSIPID for a current version of that conversation). This document doesn't _need_ a generic definition of session - it only needs to define the set of messages that it is measuring. It would be much clearer to say "for the purposes of this document, as session is the set of SIP messages associated with an INVITE initiated dialog and any Associated Media, or a series of related SIP MESSAGE requests". (And looking at the benchmarks, you aren't leveraging related MESSAGE requests - they all appear to be completely independent). Introducing the concepts of Invite-initiated sessions and non-invite-initiated sessions doesn't actually help define the metrics. When you get to the metrics, you can speak concretely in terms of a series of INVITEs, REGISTERs, and MESSAGEs. Doing that, and providing a short introduction pointing folks with PSTN backgrounds relating these to "Session Attempts" will be clearer. To be clear, I strongly suggest a fundamental restructuring of the document to describe the benchmarks in terms of dialogs and transactions, and remove the IS and NS concepts completely. The INVITE related tests assume no provisional responses, leaving out the effect on a device's memory when the state machines it is maintaining transition to the proceeding state. Further, by not including provisionals, and building the tests to search for Timer B firing, the tests insure there will be multiple retransmissions of the INVITE (when using UDP) that the device being tested has to handle. The traffic an element has to handle and likely the memory it will consume will be very different with even a single 100 trying, which is the more usual case in deployed networks. The document should be clear _why_ it chose the test model it did and left out metrics that took having a provisional response into account. Similarly, you are leaving out the delayed-offer INVITE transactions used by 3pcc and it should be more obvious that you are doing so. Likewise, the media oriented tests take a very basic approach to simulating media. It should be explicitly stated that you are simulating the effects of a codec like G.711 and that you are assuming an element would only be forwarding packets and has to do no transcoding work. It's not clear from the documents whether the EA is generating actual media or dummy packets. If it's actual media, the test parameters that assume constant sized packets at a constant rate will not work well for video (and I suspect endpoints, like B2BUAs, will terminate your call early if you send them garbage). The sections on a series of INVITEs is fairly clear that you mean each of them to have different dialog identifiers. I don't see any discussion of varying the To: URI. If you don't, what's going to keep a gateway or B2BUA from rejecting all but the first with something like Busy? Similarly, I'm not finding where you talk about how many AoRs you are registering against in the registration tests. I think, as written, someone could write this where all the registers affected only one AoR. The methodology document calls Stress Testing out of scope, but the very nature of the Benchmarking algorithm is a stress test. You are iteratively pushing to see at what point something fails, _exactly_ by finding the rate of attempted sessions per second that the thing under test would consider too high. Now to specific issues in document order, starting with the terminology document (nits are separate and at the end): * T (for Terminology document): The title and abstract are misleading - this is not general benchmarking for SIP performance. You have a narrow set of tests, gathering metrics on a small subset of the protocol machinery. Please (as RFC 6076 did) look for a title that matches the scope of the document. For instance, someone testing a SIP Events server would be ill-served with the benchmarks defined here. * T, section 1: RFC5393 should be a normative reference. You probably also need to pull in RFCs 4320 and 6026 in general - they affect the state machines you are measuring. * T, 3.1.1: As noted above, this definition of session is not useful. It doesn't provide any distinction between two different sessions. I strongly disagree that SIP reserves "session" to describe services analogous to telephone calls on a switched network - please provide a reference. SIP INVITE transactions can pend forever - it is only the limited subset of the use of the transactions (where you don't use a provisional response) that keeps this communication "brief". In the normal case, an INVITE an its final response can be separated by an arbitrary amount of time. Instead of trying to tweak this text, I suggest replacing all of it with simpler, more direct descriptions of the sequence of messages you are using for the benchmarks you are defining here. *T, 3.1.1: How is this vector notion (and graph) useful for this document? I don't see that it's actually used anywhere in the documents. Similarly, the arrays don't appear to be actually used (though you reference them from some definitions) - What would be lost from the document if you simply removed all this text? *T, 3.1.5, Discussion, last sentence: Why is it important to say "For UA-type of network devices such as gateways, it is expected that the UA will be driven into overload based on the volume of media streams it is processing." It's not clear that's true for all such devices. How is saying anything here useful? *T, 3.1.6: This definition says an outstanding BYE or CANCEL is a Session Attempt. Why not just say INVITE? You aren't actually measuring "session attempts" for INVITEs or REGISTERs - you have separate benchmarks for them. *T, 3.1.7: It needs to be explicit that these benchmarks are not accounting for/allowing early dialogs. *T, 3.1.8: The words "early media" appear here for the first time. Given the way the benchmarks are defined, does it make sense to discuss early media in these documents at all (beyond noting you do not account for it)? If so, there needs to be much more clarity. (By the way, this Discussion will be much easier to write in terms of dialogs). *T, 3.1.9, Discussion point 2: What does "the media session is established" mean? If you leave this written as a generic definition, then is this when an MSRP connection has been made? If you simplify it to the simple media model currently in the document, does it mean an RTP packet has been sent? Or does it have to be received?. For the purposes of the benchmarks defined here, it doesn't seem to matter, so why have this as part of the discussion anyway? *T, 3.1.9, Definition: A series of CANCELs meets this definition. *T, 3.1.10 Discussion: This doesn't talk about 3xx responses, and they aren't covered elsewhere in the document. *T, 3.1.11 Discussion: Isn't the MUST in this section methodology? Why is it in this document and not -meth-? *T, 3.1.11 Discussion, next to last sentence: "measured by the number of distinct Call-IDs" means you are not supporting forking, or you would not count answers from more than on leg of the fork as different sessions, like you should. Or are you intending that there would never be an answer from more than one leg of a fork? If so, the documents need to be clearer about the methodology and what's actually being measured. *T, 3.2.2 Definition: There's something wrong with this definition. For example, proxies do not create sessions (or dialogs). Did you mean "forwards messages between"? *T, 3.2.2 Discussion: This is definition by enumeration since it uses a MUST, and is exclusive of any future things that might sit in the middle. If that's what you want, make this the definition. The MAY seems contradictory unless you are saying a B2BUA or SBC is just a specialized User Agent Server. If so, please say it that way. *T, 3.2.3: This seems out of place or under-explored. You don't appear to actually _use_ this definition in the documents.You declare these things in scope, but the only consequence is the line in this section about the not lowering performance benchmarks when present. Consider making that part of the methodology of a benchmark and removing this section. If you think it's essential, please revisit the definition - you may want to generalize it into _anything_ that sits on the path and may affect SIP processing times (otherwise, what's special about this either being SIP Aware, or being a Firewall)? *T, 3.2.5 Definition: This definition just obfuscates things. Point to 3261's definition instead. How is TCP a measurement unit? Does the general terminology template include "enumeration" as a type? Do you really want to limit this enumeration to the set of currently defined transports? Will you never run these benchmarks for SIP over websockets? *T, 3.3.2 Discussion: Again, there needs to be clarity about what it means to "create" a media session. This description differentiates attempt vs success, so what is it exactly that makes a media session attempt successful? When you say number of media sessions, do you mean number of M lines or total number of INVITEs that have SDP with m lines? *T, 3.3.3: This would much clearer written in terms of transactions and dialogs (you are already diving into transaction state machine details). This is a place where the document needs to point out that it is not providing benchmarks relevant to environments where provisionals are allowed to happen and INVITE transactions are allowed to pend. *T, 3.3.4: How does this model (A single session duration separate from the media session hold time) produce useful benchmarks? Are you using it to allow media to go beyond the termination of a call? If not, then you have media only for the first part of a call? What real world thing does this reflect? Alternatively, what part of the device or system being benchmarked does this provide insight to? *T, 3.3.5: The document needs to be honest about the limits of this simple model of media. It doesn't account for codecs that do not have constant packet sizes. The benchmarks that use the model don't capture the differences based on content of the media being sent - a B2BUA or gateway, may will behave differently if it is transcoding or doing content processing (such as DTMF detection) than it will if it is just shoveling packets without looking at them. *T, 3.3.6: Again, the model here is that any two media packets present the same load to the thing under test. That's not true for transcoding, mixing, or analysis (such as for dtmf detection). It's not clear that if you have two streams, each stream has its own "constant rate". You call out having one audio and one video stream - how do you configure different rates for them? *T, 3.3.7: This document points to the methodology document for indicating whether streams are bi-directional or uni-directional. I cant find where the methodology document talks about this (the string 'direction' does not occur in that document). *T, 3.3.8: This text is old - it was probably written pre-RFC5056. If you fork, loop detection is not optional. This, and the methodology document should be updated to take that into account. *T, 3.3.9: Clarify if more than one leg of a fork can be answered successfully and update 3.1.11 accordingly. Talk about how this affects the success benchmarks (how will the other legs getting failure responses affect the scores?) *T, 3.3.9, Measurement units: There is confusion here. The unit is probably "endpoints". This section talks about two things, that, and type of forking. How is "type of forking" a unit, and are these templates supposed to allow more than one unit for a term? *T, 3.4.2, Definition: It's not clear what "successfully completed" means. Did you mean "successfully established"? This is a place where speaking in terms of dialogs and transactions rather than sessions will be much clearer. *T, 3.4.3, This benchmark metric is underdefined. I'll focus on that in the context of the methodology document (where the docs come closer to defining it). This definition includes a variable T but doesn't explain it - you have to read the methodology to know what T is all about. You might just say "for the duration of the test" or whatever is actually correct. *T, 3.4.3, Discussion: "Media Session Hold Time MUST be set to infinity". Why? The argument you give in the next sentence just says the media session hold time has to be at least as long as the session duration. If they were equal, and finite, the test result does not change. What's the utility of the infinity concept here? *T, 3.4.4: "until it stops responding". Any non-200 response is still a response, and if something sends a 503 or 4xx with a retry-after (which is likely when it's truly saturating) you've hit the condition you are trying to find. The notion that the Overload Capacity is measurable by not getting any responses at all is questionable. This discussion has a lot of methodology in it - why isn't that (only) in the methodology document? *T, 3.4.5: A normal, fully correct system that challenged requests and performed flawlessly would have a .5 Session Establishment Performance score. Is that what you intended? The SHOULD in this section looks like methodology. Why is this a SHOULD and not a MUST (the document should be clearer about why sessions remaining established is important). Or wait - is this what Note 2 in section 5.1 of the methodology document (which talks about reporting formats) is supposed to change? If so, that needs to be moved to the actual methodology and made _much_ clearer. *T, 3.4.6: You talk of the first non-INVITE in an NS. How are you distinguishing subsequent non-INVITES in this NS from requests in some other NS? Are you using dialog identifiers or something else? Why do you expect that to matter (why is the notion of a sequence of related non-INVITEs useful from a benchmarking perspective - there isn't state kept in intermediaries because of them - what will make this metric distinguishable from a metric that just focuses on the transactions?) *T, 3.4.7: What's special about MESSAGE? Why aren't you focusing on INFO or some other end-to-end non-INVITE? I suspect it's because you are wanting to focus on a simple non-INVITE transaction (which is why you are leaving out SUBSCRIBE/NOTIFY). MESSAGE is good enough for that, but you should be clear that's why you chose it. You should also talk about whether the payload of all of the MESSAGE requests are the same size and whether that size is a parameter to the benchmark. (You'll likely get very different behavior from a MESSAGE that fragments.) *T, 3.4.7: The definition says "messages completed" but the discussion talks about "definition of success". Does success mean an IM transaction completed successfully? If so, the definition of success for a UAC has a problem. As written, it describes a binary outcome for the whole test, not how to determine the success of an individual transaction - how do you get from what it describes to a rate? *T, Appendix A: The document should better motivate why this is here. Why does it mention SUBSCRIBE/NOTIFY when the rest of the document(s) are silent on them. The discussion says you are _selecting_ a Session Attempts Arrival Rate distribution. It would be clearer to say you are selecting the distribution of messages sent from the EA. It's not clear how this particular metric will benefit from different sending distributions. Now the Methodology document (comments prefixed with an M): *M, Introduction: Can the document say why the subset of functionality benchmarked here was chosen over other subsets? Why was Subscribe/Notify or Info not included (or invites with MSRP or even simple early media, etc)? *M, Introduction paragraph 4: This points to section 4 and section 2 of the terminology document for configuration options. Section 4 is the iana considerations section (which has no options). What did you mean to point to? *M, Introduction paragraph 4, last sentence: This seems out of place - why is it in the introduction and not in a section on that specific methodology. *M, 4.1: It's not clear here, or in the methodology sections whether the tests allow the transport to change as you go across an intermediary. Do you intend to be able to benchmark a proxy that has TCP on one side and UDP on the other? *M, 4.2: This is another spot where pointing to the Updates to 3261 that change the transaction state machines is important. *M, 4.4: Did you really mean RTSP? Maybe you meant MSRP or something else? RTSP is not, itself, a media protocol. *M, 4.9: There's something wrong with this sentence: "This test is run for an extended period of time, which is referred to as infinity, and which is, itself, a parameter of the test labeled T in the pseudo-code". What value is there in giving some finite parameter T the name "infinity"? *M, 4.9: Where did 100 (as an initial value for s) come from? Modern devices process at many orders of magnitude higher rates than that. Do you want to provide guidance instead of an absolute number here? *M 4.9: In the pseudo-code, you often say "the largest value". It would help to say the the largest value of _what_. *M 4.9: What is the "steady_state" function called in the pseudo-code? *M 6.3: Expected Results: The EA will have different performance characteristics if you have them sending media or not. That could cause this metric to be different from session establishment without media. *M 6.5: This section should call out that loop detection is not optional when forking. The Expected Results description is almost tautological - could it instead say how having this measurement is _useful_ to those consuming this benchmark? *M 6.8, Procedure: Why is "May need to run for each transport of interest." in a section titled "Session Establishement Rate with TLS Encrypted SIP"? *M 6.10: This document doesn't define Flooding. What do you mean? How is this different than "Stress test" as called out in section 4.8? Where does 500 come from? (Again, I suspect that's a very old value - and you should be providing guidance rather than an absolute number). But it's not clear how this isn't just the session establishment rate test that just starts with a bigger number. What is it actually trying to report on that's different from the session establishment rate test, and how is the result useful? *M 6.11: Is each registration going to a different AoR? (You must be, or the re-registration test makes no sense.) You might talk about configuring the registrar and the EA so they know what to use. *M 6.12, Expected Results: Where do you get the idea that re-registration should be faster than initial registration? How is knowing the difference (or even that there is a difference) between this and the registration metric likely to be useful to the consumer? *M 6.14: Session Capacity, as defined in the terminology doc, is a count of sessions, not a rate. This section treats it as a rate and says it can be interpreted as "throughput". I'm struggling to see what it actually is measuring. The way your algorithm is defined in section 4.9, I find s before I use T. Lets say I've got a box where the value of s that's found is 10000, and I've got enough memory that I can deal with several large values of T. If I run this test with T of 50000, my benchmark result is 500,000,000. If I run with a T of 100000, my benchmark result is 1,000,000,000. How are those numbers telling me _anything_ about session capacity. That the _real_ session capacity is at least that much? Is there some part of this methodology that has me hunt for a maximal value of T? Unless I've missed something, this metric needs more clarification to not be completely misleading. Maybe instead of "Session Capacity" you should simply be reporting "Simultaneous Sessions Measured" *M 8: "and various other drafts" is not helpful - if you know of other important documents to point to, point to them. Nits: T : The definition of Stateful Proxy and Stateless Proxy copied the words "defined by this specification" from RFC3261. This literal copy introduces confusion. Can you make it more visually obvious you are quoting? And even if you do, could you replace "by this specification" with "by [RFC3261]"? T, Introduction, 2nd paragraph, last sentence: This rules out stateless proxies. T, Section 3: In the places where this template is used, you are careful to say None under Issues when there aren't any, but not so careful to say None under See Also when there isn't anything. Leaving them blank makes some transitions hard to read - they read like you are saying see also (whatever the next section heading is). T, 3.1.6, Discussion: s/tie interval/time interval/ M, Introduction, paragraph 2: You say "any [RFC3261] conforming device", but you've ruled endpoint UAs out in other parts of the documents. M 4.9: You have comments explaining send_traffic the _second_ time you use it. They would be better positioned at the first use. M 5.2: This is the first place the concept of re-Registration is mentioned. A forward pointer to what you mean, or an introduction before you get to this format would be clearer. On 1/16/13 3:48 PM, The IESG wrote: The IESG has received a request from the Benchmarking Methodology WG (bmwg) to consider the following document: - 'Terminology for Benchmarking Session Initiation Protocol (SIP) Networking Devices' <draft-ietf-bmwg-sip-bench-term-08.txt> as Informational RFC The IESG plans to make a decision in the next few weeks, and solicits final comments on this action. Please send substantive comments to the ietf@xxxxxxxx mailing lists by 2013-01-30. Exceptionally, comments may be sent to iesg@xxxxxxxx instead. In either case, please retain the beginning of the Subject line to allow automated sorting. Abstract This document provides a terminology for benchmarking the SIP performance of networking devices. The term performance in this context means the capacity of the device- or system-under-test to process SIP messages. Terms are included for test components, test setup parameters, and performance benchmark metrics for black-box benchmarking of SIP networking devices. The performance benchmark metrics are obtained for the SIP signaling plane only. The terms are intended for use in a companion methodology document for characterizing the performance of a SIP networking device under a variety of conditions. The intent of the two documents is to enable a comparison of the capacity of SIP networking devices. Test setup parameters and a methodology document are necessary because SIP allows a wide range of configuration and operational conditions that can influence performance benchmark measurements. A standard terminology and methodology will ensure that benchmarks have consistent definition and were obtained following the same procedures. The file can be obtained via http://datatracker.ietf.org/doc/draft-ietf-bmwg-sip-bench-term/ IESG discussion can be tracked via http://datatracker.ietf.org/doc/draft-ietf-bmwg-sip-bench-term/ballot/ No IPR declarations have been submitted directly on this I-D. |