Seeking input from engineers with expertise in video conferencing and similar delay-sensitive applications

Stuart Cheshire <cheshire=40apple.com@xxxxxxxxxxxxxx> · Mon, 28 Oct 2024 21:20:11 +0200

Hello IETF colleagues,

The IETF has been working on L4S, to reduce packet loss and delay on the Internet, which should be a great benefit for delay-sensitive applications like video conferencing.

In a companion project, we are working on a network measurement tool to report meaningful delay measurements, to validate whether L4S deployments and other similar technologies are actually delivering what they promise.

We are seeking expert feedback on this Internet Draft:

<https://datatracker.ietf.org/doc/html/draft-ietf-ippm-responsiveness>

It will be discussed next Monday at the IETF meeting in Dublin:

<https://datatracker.ietf.org/meeting/121/materials/agenda-121-ippm>

The purpose of this work is to create a repeatable analytical test that can be run to assess how well a network will support delay-sensitive applications like video conferencing. For the test to be useful, the results it reports need to correlate with subjective user experience. I worry that we have not validated this aspect of the test enough.

My understanding is that video conferencing applications accumulate received packets in a playback buffer (to smooth out delay variation), and then determine a time when those packets are decoded to display a frame. Setting the playback buffer too deep results in conversational delay that impacts user experience. Setting the playback buffer too shallow results in lower delay, but risks displaying a frame before all the necessary packets have been received, degrading image (and audio) quality. Thus the playback buffer needs to dynamically adjust to network conditions, to balance between playing early enough to keep conversational delay low, but late enough that a sufficient percentage of packets have been received by the playback time.

How does a video conferencing application compute this ideal playback delay? Is the delay set such that we expect 90% of the necessary packets should have been received? 95%? 99%?

The draft has been through a series of revisions with input from multiple people. It has currently arrived at an algorithm that samples the application-layer round-trip delay over a period of about ten seconds, discards the worst 5% of those measurements, and reports the arithmetic mean of the the best 95%.

Is this a good predictor of video conferencing performance? I fear that our current test may be measuring the exact opposite of what video conferencing cares about. Mean and median mean nothing to video conferencing. If the median round-trip delay is just 1ms then that’s awesome, but it does a video conferencing application no good to decode a frame when it’s got only half the packets (that’s what median means). If the 90th percentile round-trip delay is 500ms, and the application needs to have 90% of the packets before it can usefully decode a frame, then the application needs to wait that long before decoding a frame. It doesn’t matter if half the packets arrive really early, if the remaining necessary packets arrive late. It is the latecomers that determine the playback delay, not the early packets.

Does my reasoning make sense here? What metric would video conferencing applications like to see reported? 90th percentile? 95th percentile? 99th percentile? Something else?

I want to make sure that when we publish this Internet Draft as an IETF RFC it serves its purpose of motivating vendors and operators to tune their networks so that delay-sensitive applications work well. If the test measures the wrong thing, then it motivates vendors and operators to optimize the wrong thing, and that doesn’t help delay-sensitive applications like video conferencing work better.

Please send comments to ippm@xxxxxxxx <mailto:ippm@xxxxxxxx>, or attend IPPM in Dublin to share your thoughts in person.

Stuart Cheshire