[PATCH] add xrdp sink

patrakov@xxxxxxxxx (Alexander E. Patrakov) · Sat, 31 May 2014 02:05:10 +0600

30.05.2014 18:43, Tanu Kaskinen wrote:
>> 2 - esound sink and source as Alexander suggests(source not complete).
>> 3 - RTP over unix domain socket(module-rtp-send not complete as
>> Laurentiu Nicola says).
>>
>> I'm ok with 2 or 3, but I want to make sure it's the best decision
>> long term.  I think there will be a lot of users using PA this way.
>
> I don't know the details of any of the three protocols (custom xrdp,
> esound or rtp), so I don't have any opinions like "you really should use
> X" or "you really shouldn't use Y".

OK, here are some bad words about the protocols.

The main reason why I am currently against the current custom protocol is:

Any custom protocol will likely evolve, and, with the current inability 
to build out-of-tree modules, it means that future versions of both xrdp 
and PulseAudio will have to deal somehow with any resulting version 
mismatch. The current protocol doesn't provide any versioning, though, 
and that's a problem _if_ the custom protocol (as opposed to a suitable 
but set-in-stone standard protocol) is accepted as the way forward.

The second reason was (see below for factors that amend it):

The current custom protocol is essentially a copy of the esound protocol 
with minor variations. All criticisms that apply to module-esound-sink 
will also apply to the current module-xrdp-sink. Conversely, if any 
current criticisms on module-esound-sink actually don't apply in this 
use case to module-xrdp-sink, then they are irrelevant for 
module-esound-sink, too.

...which Tanu has worded in a more positive way:

> If the esound protocol "deficiencies" (that I'm not familiar with) don't
> really matter in case of XRDP, and there's not a lot of mandatory extra
> cruft in the protocol that isn't necessary with XRDP, then reusing the
> esound protocol sounds like a good idea.

Note that I don't propose to implement the whole esound protocol - just 
enough to interoperate with PulseAudio and maybe the most common clients.

The claimed deficiencies of the esound sink are high latency and even 
worse latency estimation, i.e. a/v sync issues. However, there is 
something strange (possible bug, found by code inspection, I have not 
tested anything yet) in module-esound-sink.c. It creates a socket, 
relies on the socket buffer becoming full for throttling the sink 
rendering process, but never sets the SO_SNDBUF option, either directly 
or through the helper from pulsecore/socket-util.c. And the default is 
more than 256 KB! So no wonder that the socket accumulates a lot of 
sound data (and thus latency) before throttling.

As for the bad latency estimation, I think this applies only to 
networked connections. Indeed, the esound protocol has a request for 
querying the server-internal latency, and PulseAudio issues it. The 
total latency consists of the amount of the sound data buffered in the 
esound server, the network, and locally in the client. The only unknown 
here is the network: the server-internal latency can be queried, and the 
amount of locally-buffered data is known via SIOCOUTQ. But for local 
connections, the amount of data buffered by the network is zero, so this 
criticism also seems unfounded in the XRDP case.

Now let's compare the protocols.

As Tanu has already mentioned, there is an important difference between 
the custom protocol and the esound protocol. Namely, the clock source. 
module-esound-sink uses the remote clock source: it writes to the socket 
as quickly as possible until its buffer fills up, and unblocks when 
esound (or xrdp) reads some data out. module-xrdp-sink uses the local 
clock to move samples to the socket (sleep, write, sleep, write, and so 
on), and assumes that xrdp will read the samples out quickly enough so 
that the writes never block.

I do not know what provides this guarantee. For it to be true, there 
should be "something" somewhere that measures the rate at which the 
sound samples are arriving, and compensates for the clock drift between 
the local system and the remote sound card. I.e. let's suppose that the 
remote system thinks that the fragments being sent out are 29.99 ms 
apart, and not 30 ms as the local system thinks. The difference will 
accumulate, and, unless some samples are dropped or the stream is 
resampled by a factor of 30/29.99, there will be something like a 
blocked socket or overfilled buffer. The same "need to have an adaptive 
resampler" problem apples to RTP or to any other protocol that relies on 
the local clock.

If the wanted semantics is "remote soundcard clock is the master clock", 
then the esound protocol will be suitable. If "local clock is the master 
clock" is actually wanted, then any of the three protocols would somehow 
work (and with esound protocol, the local clock would be inside xrdp 
server then).

Now let's turn to protocol elements.

The custom protocol has an explicit opcode for pausing the stream. This 
was one of the reasons that lead to its creation. I don't know yet 
whether PulseAudio would suspend the esound-protocol stream, but if 
necessary, this could be added. The possible implementation alternatives 
are to either disconnect until it has something else to play (which 
PulseAudio certainly does not do), or to simply stop the data flow 
(which I have to test yet). In the second case, xrdp could detect the 
pause by observing that it can read nothing out of the socket for a 
sufficiently long time.

The esound protocol has only three protocol elements that one would need 
to implement in xrdp: cookie-based authentication, latency request and 
audio stream playback. Cookie-based authentication is stupid but easy, 
so should not be a problem. Latency request is actually a good thing, it 
allows PulseAudio to report to the client how long it would take tor the 
last-written sample to reach the playback device. Without this request 
(e.g. with the original custom protocol) or any other way to query or 
influence the latency, a/v synchronization is impossible. And audio 
stream playback means just taking audio samples from the socket when 
they are needed (but not earlier than that). So it should all be quite 
easily doable.

RTP is a unidirectional packet-based protocol. As such, it does not have 
any way to query the latency. It does not have any useful way to 
influence the latency at the receiver, either. As such, PulseAudio does 
not have any means for offering accurate latency reports, and a/v 
synchronization is impossible.

The RTP protocol elements that are not repeated between packets, besides 
the actual audio data, are the packet sequence number and the timestamp. 
In the xrdp case the sequence number is probably not interesting, as it 
just increases for each packet by one. It can be useful for packet loss 
detection, but packets are not lost in a unix-domain socket if they are 
read out of the socket in a timely manner. The timestamp starts from 0 
and is incremented by 1 for each audio sample. It is useful for 
reconstructing the exact duration of silence represented by not 
transmitting any packets. Its relation to the wall clock is conveyed in 
the SDP announced via the SAP port, by means of the NTP-style timestamp 
of the start of the transmission, with one-second precision. So this is 
not useful for determining when exactly, according to the wall clock, 
this packet should be played.

Based on the above, I think that among the three protocols discussed, 
the esound protocol, if any (this is important!), is the way to go.

-- 
Alexander E. Patrakov