Less CPU overhead with a new protocol channel mechanism

david.henningsson@xxxxxxxxxxxxx (David Henningsson) · Fri, 04 Apr 2014 11:46:19 +0200

In low latency scenarios, PulseAudio uses up quite a bit of CPU. A while
ago I did some profiling and noticed that much of the time was spent
inside the ppoll syscall.

I couldn't let go of that problem, and I think optimising PulseAudio is
a good thing. So I went ahead and did some research, which ended up with
a lock-free ringbuffer in shared memory, combined with eventfds for
notification. I e, I added a new channel I called "srchannel" in
addition to the existing "iochannel" that usually uses UNIX pipes.

When running this solution with my low-latency test programs, I ended up
with the following result. The tests were done on my core i3 laptop from
2010, and I just ran top and tried to take an approximate average.

Reclatencytest: Recording test program. Asks for 10 ms of latency, ends
up with a new packet every 5 ms.

With iochannel:
Pulseaudio main thread - 2.6% CPU
Alsa-source thread - 1.7% CPU
Reclatencytest - 2.6% CPU
Total: 6.9% CPU

With srchannel:
Pulseaudio main thread - 2.3% CPU
Alsa-source thread - 1.7% CPU
Reclatencytest - 1.7% CPU
Total: 5.3% CPU

I e, CPU usage reduced by ~25%.

Palatencytest: Playback test program. Asks for 20 ms of latency (I tried
10 ms, but it was too unstable), ends up with a new packet every 8 ms.

With iochannel:
Pulseaudio main thread - 2.3% CPU
Alsa-sink thread - 2.2% CPU
Palatencytest - 1.3% CPU
Total: 5.8% CPU

With srchannel:
Pulseaudio main thread - 1.7% CPU
Alsa-sink thread - 2.2% CPU
Palatencytest - 1.0% CPU
Total: 4.9% CPU

I e, CPU usage reduced by ~15%.

Now, this is not all there is to it. In a future generation of this
patch, I'd like to investigate the possibility we can have the client
listen to more than one ringbuffer, so we can set up a ringbuffer
directly between the I/O-thread and the client, too. That should lead to
even bigger savings, and hopefully more stable audio as well (less
jitter if we don't pass through the non-RT main thread).

As for the implementation, I have a hacky/drafty patch which I'm happy
to show to anyone interested. Here's how the patch works:

Setup:

1) A client connects and SHM is enabled like usual. (In case SHM cannot
be enabled, we can't enable the new srchannel either.)
2) The server allocates a new memblock for the two ringbuffers (one in
each direction) and sends this to the client using the iochannel.
3) The server allocates two pa_fdsem objects (these are wrappers around
eventfd).
4) The server prepares an additional packet to the client, with a new
command PA_COMMAND_ENABLE_RINGBUFFER.
5) The server attaches the eventfds to the packet. Much like we do with
pa_creds today, file descriptors can be shared over a socket using the
mechanism described e g here [1].
6) The client receives the memblock and then the packet with the eventfds.
7) Both client and server are now enabling the ringbuffer for all
packets from that moment on (assuming they don't need to send additional
pa_creds or eventfds, which have to be sent over the iochannel).

The shared memblock contains two ringbuffers. There are atomic variables
to control the lock-free ringbuffer, so they have to writable by both
sides. (As a quick hack, I just enabled both sides to write on all
memblocks.)

The two ringbuffer objects are encapsulated by an srchannel object,
which looks just like the iochannel to the outside world. Writing to an
srchannel first writes to the ringbuffer memory, increases the atomic
"count" variable, and signals the pa_fdsem. On the reader side that
wakes up the reader's pa_fdsem, the ringbuffer's memory is read and
"count" is decreased.

The pstream object has been modified to be able to read from both an
srchannel and an iochannel (in parallel), and writing can go to either
channel depending on circumstances.

Okay, so this was a fun project and it seems promising. How do you feel
I should proceed with it? I expect a response from you, perhaps along
some of these lines:

 1) Woohoo, this is great! Just make your patches upstreamable and I
promise I'll review them right away!

 2) Woohoo, this is great! But I don't have any time to review them, so
just finish your patches up, and push them without review!

 3) This is interesting, but I don't have any time to review them, so
put your patches in a drawer for the forseeable future.

 4) This is interesting, but some reduced CPU usage in low latency
scenarios isn't worth the extra code to maintain. (And the extra 64K per
client, for the ringbuffers.)

 5) I think the entire idea is bad, because...

-- 
David Henningsson, Canonical Ltd.
https://launchpad.net/~diwic

[1] http://keithp.com/blogs/fd-passing/