On Fri, 2014-04-04 at 11:46 +0200, David Henningsson wrote: > In low latency scenarios, PulseAudio uses up quite a bit of CPU. A while > ago I did some profiling and noticed that much of the time was spent > inside the ppoll syscall. > > I couldn't let go of that problem, and I think optimising PulseAudio is > a good thing. So I went ahead and did some research, which ended up with > a lock-free ringbuffer in shared memory, combined with eventfds for > notification. I e, I added a new channel I called "srchannel" in > addition to the existing "iochannel" that usually uses UNIX pipes. What does "sr" in "srchannel" mean? > When running this solution with my low-latency test programs, I ended up > with the following result. The tests were done on my core i3 laptop from > 2010, and I just ran top and tried to take an approximate average. > > Reclatencytest: Recording test program. Asks for 10 ms of latency, ends > up with a new packet every 5 ms. > > With iochannel: > Pulseaudio main thread - 2.6% CPU > Alsa-source thread - 1.7% CPU > Reclatencytest - 2.6% CPU > Total: 6.9% CPU > > With srchannel: > Pulseaudio main thread - 2.3% CPU > Alsa-source thread - 1.7% CPU > Reclatencytest - 1.7% CPU > Total: 5.3% CPU > > I e, CPU usage reduced by ~25%. > > Palatencytest: Playback test program. Asks for 20 ms of latency (I tried > 10 ms, but it was too unstable), ends up with a new packet every 8 ms. > > With iochannel: > Pulseaudio main thread - 2.3% CPU > Alsa-sink thread - 2.2% CPU > Palatencytest - 1.3% CPU > Total: 5.8% CPU > > With srchannel: > Pulseaudio main thread - 1.7% CPU > Alsa-sink thread - 2.2% CPU > Palatencytest - 1.0% CPU > Total: 4.9% CPU > > I e, CPU usage reduced by ~15%. > > Now, this is not all there is to it. In a future generation of this > patch, I'd like to investigate the possibility we can have the client > listen to more than one ringbuffer, so we can set up a ringbuffer > directly between the I/O-thread and the client, too. That should lead to > even bigger savings, and hopefully more stable audio as well (less > jitter if we don't pass through the non-RT main thread). > > As for the implementation, I have a hacky/drafty patch which I'm happy > to show to anyone interested. Here's how the patch works: > > Setup: > > 1) A client connects and SHM is enabled like usual. (In case SHM cannot > be enabled, we can't enable the new srchannel either.) > 2) The server allocates a new memblock for the two ringbuffers (one in > each direction) and sends this to the client using the iochannel. > 3) The server allocates two pa_fdsem objects (these are wrappers around > eventfd). > 4) The server prepares an additional packet to the client, with a new > command PA_COMMAND_ENABLE_RINGBUFFER. Is this negotiation done in a way that allows us to cleanly drop support for srchannel later if we want? Let's say that this is implemented in protocol version 32 and for some reason removed in 33. If the server uses protocol version 32 and the client uses version 33, can the client refuse the srchannel feature and fall back to something else? Or vice versa, if the server uses version 33 and the client uses version 32, can the server refuse this feature and fall back to something else? I'm just thinking that this might not be the final solution for IPC, someone might implement a "kdbus channel", for example. > 5) The server attaches the eventfds to the packet. Much like we do with > pa_creds today, file descriptors can be shared over a socket using the > mechanism described e g here [1]. > 6) The client receives the memblock and then the packet with the eventfds. > 7) Both client and server are now enabling the ringbuffer for all > packets from that moment on (assuming they don't need to send additional > pa_creds or eventfds, which have to be sent over the iochannel). > > The shared memblock contains two ringbuffers. There are atomic variables > to control the lock-free ringbuffer, so they have to writable by both > sides. (As a quick hack, I just enabled both sides to write on all > memblocks.) > > The two ringbuffer objects are encapsulated by an srchannel object, > which looks just like the iochannel to the outside world. Writing to an > srchannel first writes to the ringbuffer memory, increases the atomic > "count" variable, and signals the pa_fdsem. On the reader side that > wakes up the reader's pa_fdsem, the ringbuffer's memory is read and > "count" is decreased. How does rewinding work with the ringbuffers? Is this a relevant question at all, or is this just a channel for sending packets just like with iochannel-backed pstream (I'm not terribly familiar with how the current iochannel and SHM transport work)? > The pstream object has been modified to be able to read from both an > srchannel and an iochannel (in parallel), and writing can go to either > channel depending on circumstances. Do you expect that control data would go via iochannel and audio data would go through srchannel, if you implement streaming directly to the IO thread? Or would srchannel be used for both, and iochannel only used when special tricks are needed (e.g. sending creds)? > Okay, so this was a fun project and it seems promising. How do you feel > I should proceed with it? I expect a response from you, perhaps along > some of these lines: > > 1) Woohoo, this is great! Just make your patches upstreamable and I > promise I'll review them right away! > > 2) Woohoo, this is great! But I don't have any time to review them, so > just finish your patches up, and push them without review! > > 3) This is interesting, but I don't have any time to review them, so > put your patches in a drawer for the forseeable future. If this doesn't cause significant commitments to supporting this mechanism forever in the native protocol, I'm fine with finalizing the patches and posting them to the mailing list. If there aren't reviews in a while, just push them. (I won't prioritize these patches over the older queued patches, so I won't get around to reviewing these any time soon.) -- Tanu