Re: Comments: [AVT] Last Call: RTP Payload for Comfort Noise toProposed Standard

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, 30 Apr 2002 James_Renkel@3com.com wrote:

> The problem in the above described situation is that the
> *receiver* won't know this until it receives the packet after the gap,
> which could be a long time, well longer than the depth of the receiver's
> jitter buffer. So, when the receiver's jitter buffer underflows, it has
> no way of distinguishing between:
> 1. the transmitter detected silence and just didn't bother to send any
> packets, and the receiver should play out silence; and
> 2. the network is congested, packets are getting lost, and the receiver
> should interpolate audio in an attempt to preserve audio quality.

By definition, interpolation can only occur across a gap between two
received packets, not between one received packet and nothing.  And,
in practice, interpolation is only effective across relatively short
gaps.  Therefore, if there is a gap that is shorter than the jitter
buffer, you have the packet before and after the gap, you can tell
that one or a few packets were lost by the sequence number difference,
and you can interpolate.

If the jitter buffer runs dry, you have a gap too long to interpolate
over.  One technique for that situation is to fade to silence (either
absolute silence or comfort noise).  If the lack of packets was
intentional due to VAD at the sender, then the last packet will be, or
at least should be, very near silence so there is not much fading to
do.  I don't see any different behavior you can take independent of
whether you know the lack of packets was intentional or not.

Sure, if CN is sent, then you know silence is intended and you know
what level of noise you should produce.  But if CN is not being sent,
then you need to be prepared to handle equally a case where packets
are lost or where a poor-quality sender stops sending when the sound
level has not yet decayed all the way.

> I hope you can all agree with me that action 2., above, is common practice
> whether explicit VAD and CN is being used and not. Beyond that, many would
> say that action 2. is extremely desirable, that the technique used to
> accomplish it is a key differentiator of their product(s), and that for
> the general good of VoIP maybe should be considered a recommended practice.

I fully agree.

> But the general tone of your comment above, and elsewhere in the
> same e-mail, lead me to believe that you (and possibly others) do
> not support this, that you support *always* simply playing out
> silence if a packet is not available for playout at the required
> time (when the jitter buffer underflows).

Absolutely disagree.  I strongly believe that protocol specifications
should not say how to implement the protocol.

> That's fine and dandy as your personal view. But the suggested language of
> the section of the RFC that you wrote would "standardize" this behavior in
> the face of extensive use of exactly the opposite behavior.

I did not intend any words that I wrote to imply that.  Where do you
think the text does so?

> The purpose of the comfort noise coding is *exactly* to allow the receiver
> to distinguish between cases 1. and 2., above.

Yes, and that is why this generic CN payload format has been defined.
This discussion is about whether the lack of use of CN implies that VAD
and DTX will not be used.  I want to state clearly in my message and
in the specification that this is false.

> True, if packets are lost
> they could just as well have been CN packets as not (But if the last packet
> not lost was a CN packet, the receiver would interplotate comfort noise.).
> True, CN packets consume more bandwidth that sending nothing (But less than
> sending CODEC encoded near-silence.). Ya want to eliminate that bandwidth
> at a potential loss of audio quality when packets are lost, fine, don't
> implement or advertise support of CN.

I would assume that most implementations would not send CN in every
frame time, just one at the end of speech.  That CN might occasionally
be lost.  A good implementation should still produce an acceptable
result.

> I think before this RFC can go forward, we need to clear this up. I think
> the best we can and must say is that if packets aren't received in time,
> the result is receiver implementation independent (Interpolate if ya want;
> play silence if ya want; play "Yankee Doodle" if ya want. Let the
> marketplace decide if they like interpolation, silence, or "Yankee Doodle"
> better.). I don't think we can say, or imply, or leave open to
> interpretation sans a statement to the contrary, that the intended action
> when packets are not received in time is to *always* play silence.

I agree.


On Tue, 30 Apr 2002 Leland_Thompson@3com.com wrote:

> It seems that any protocol that actively communicates state transition
> information to a system should theoretically, in general, notify the system
> at the start of the event not at the end of the event.  If a state
> transition has occurred, I may need to take some action or do something
> differently.  With Silence Suppression, this is obviously the case.
>
> For instance, of particular concern is not knowing when transitions
> actually occur, but just as importantly, now having the possiblity that
> significant time may elapse without knowing the actual state of the system.
> This last issue can cause other issues.  For instance, delays in accurate
> state information create additional problems if a system can end in a state
> that is not known to all parties.  The possibility of not having a
> transition to speech would cause the state information from the previous
> transmission (the silence transition) to be lost, because your method relys
> on receiving the next Voice packet, which never occurred.

We are not discussing theory.  You may object to the definition of the
Marker bit in RTP indicating the start rather than the end of a
talkspurt.  We debated this question at length when the protocol was
designed years ago.  It is unlikely to be changed now.

The draft in question is defining a CN payload type precisely so that
more information about the silence transition can be conveyed to the
receiver.  Good quality senders will implement CN.  A robust receiver
must behave well without it.

It is very unlikely that you would get agreement from the working
group to change the base specification of RTP to say that VAD and DTX
may not be used without CN.

If necessary, we can have more discussion about signaling so that a
receiver can refuse to accept a call that would use VAD and DTX
without CN.

> What happens during the umpteen frame periods that we didn't correctly
> identify the silence period?  How does it impact the speech signal/voice
> quality?
> How is this error reflected by the system in the form of statistics,
> counters, etc?  Are the statistics accurate anymore?

Yes.  The receiver reports the highest number received.  The sender
knows it sent a higher number.

> If one allows the TimeStamp information along with the Sequence Number to
> together tell an RTP Decoder (Receiver) when a Loss Event is really just a
> Silence Period, one is presented with the following delemas.
>
> 1)
>     -What is one to do during the first audio frame time when data is not
> present?  In absence of a valid CN/SID frame, most (some) compliant
> implementations will transition to a Loss State which will cause an
> Interpolation of the Codec's decoder to occur.
>     - There is no reason to believe, yet, that Comfort Noise Generation
> should be activated.
>     - Furthermore, if one where to activate CNG, what is to be generated?
> You don't even have a minimal noise level to try and match just the back
> ground noise level of the channel, let alone the spectral information that
> might be typically present.

I discussed this above.

> 2)
> What happens toward the end of a session where an RTP Encoder (Transmitter)
> has transitioned to silence, however, the RTP Decoder (Reciever) thinks
> this may be a loss event, and the call ends without the RTP Decoder ever
> seeing another RTP Packet, which would have told him "BIG Change in
> TimeStamp, Little change in Seq Num".  The state transition information is
> lost, and now inaccurate statistics could be stored for this call because
> of it.  Would this scenario have a potential impact to perceived Quality of
> Service for this connection?  Absolutely it might!

The quality of this last silence is no different than the quality of a
silence earlier in the call.

> Today there are real implementations of VOIP GWs that operate in real
> Carrier Networks that monitor Quality of Service (QOS) in the form of
> Excessive Packet Loss indicators for TRAPS and Alarms within a Network
> Operations Center (NOC).  It is theoretically very important, therefore, to
> actively and accurately monitor state transitions within the system that
> would possibly cause a fault or alarm.

A complete RTP implementation will also be sending RTCP Sender Reports
that would let the receiver know, during a very long silence, whether
or not some packets had been transmitted and lost.

Even if you have CN at the end of a talkspurt, the receiver has no
idea whether it has lost some packets after that point if it receives
nothing more.  It is impossible to answer the question: "Did you
receive my last packet?"

> Silence Indication Descriptions in
> the form of CN or SID frames are incredibly important in order to robustly
> detect these state transitions at the point (time) of occurance.

Great!  Use CN.  That is why the format is proposed.

>  I
> strongly recommend we rethink my original statements about RTP Decoding and
> how higher level protocol negotiations (i.e.  SIP - SDP, H.323/H.245, etc)
> really may only make sense in establishing what an RTP Encoder (transmitter
> - to packet network) does.
>
> Therefore, if CN is not negotiated as supported, it should not be activated
> or used.

I agree, CN should (must) not be used if it has not been negotiated.
The only reason why your previous proposal would be possible with CN
is that it has a static payload type.  All new codec assignments have
dynamic payload types, so receiving one of those encodings when its
use has not been negotiated will not work.

>  VAD should only allowed when negotiated as supported and the
> implementation of an IETF - CN (silence indication method) should comply to
> a clearly identifiable transition of state as close to the actual state
> transition as possible while communicating all the relavent information to
> make Comfort Noise Generation (CNG) possible.

Implementation agreements may make this recommendation.  The RTP
protocol specificatin will not require it.

							-- Steve


[Index of Archives]     [IETF Annoucements]     [IETF]     [IP Storage]     [Yosemite News]     [Linux SCTP]     [Linux Newbies]     [Fedora Users]