On 2014-11-05 00:25, Peter Meerwald wrote: > > preliminary benchmarking on Intel i5-2400S, 64-bit, Linux 3.13: > > running 'paplay --latency-msec=10 stereo_48KHz.wav', output on internal > soundcard (Intel HDA), measuring the maximum CPU% in top for the pulseaudio > and paplay > > code flags PA paplay > master 6d1fd4d1 -O2 < 14.0% < 3.7% > master 6d1fd4d1 -O2 -DNDEBUG < 13.3% < 3.3% > proposed v3 -O2 < 8.3% < 1.3% > proposed v3 -O2 -DNDEBUG < 7.6% < 1.3% Cool stuff! Seems we can get the low-latency story somewhat better, and perhaps even more so if we can communicate directly with the I/O threads through srbchannels, but that's a different project... But to focus on 6.0 release; which of these patches have large enough performance impact vs risk of regression to try to squeeze them into 6.0 rc1, and which ones can just be deferred to the -next branch? Also, is v4 going to be 90 patches...? ;-) > > ARMv7 benchmarking soonish > > > this patch series aims to save memory allocations and some system calls > related to PA's client/server protocol implementation > > v3 adds inlining and saves a snd_pcm_avail(), v2 code is largely unchanged > (minibuffers are increased and better used) > > > patches 1 to 5 ('tagstruct:') introduce a new tagstruct type _APPENDED > which can hold tagstruct data up to a certain size; tagstructs are now > kept in a specific free-list -- this typically replaces two malloc()/free()s > with one flist push()/pop() > > patches 6 to 9 ('packet:') make packets fixed-size (typically); packets are > kept in a specific free-list -- this replaces one malloc()/free() with one > flist push()/pop() > > patches 10 to 14 ('pstream:') allows to send tagstructs directly to a pstream > without encapsulation in a packet -- this saves one flist push()/pop() > > patches 15 and 16 ('pstream') often save a read() call by reading more than > just the descriptor (up to 40 bytes, e.g. description (20 bytes) + shm > info (16 bytes)); the idea is similar to b4342845d, "Optimize write > of smaller packages", but for read -- this trades some extra memcpy() for > a read(); in v3 the buffer size has been increased to 256 bytes > > patch 17 ('iochannel') fixes a strange behaviour in iochannel/mainloop that > deleted the input_event with every read which caused a rebuild of the pollfds > for every read()! > > patches 18 to 20 ('queue', 'pstream') aim to combine two (v3: or more) write items > into one minibuffer by peeking ahead in the send queue > > patch 21 stop calling mainloop's defer_enable() after queuing a SHMRELEASE; this > increases the chance that items can be combined (i.e. by patch 20) > > patch 22 inlines pa_run_once() as this function came out high in profiling > > patches 23 and 24 ('rtpoll') are cleanup > > patch 25 ('mainloop') only clears the wakeup pipe when poll() indicates that > the pipe is readable; if the only ready file descriptor is the wakeup pipe, > searching io_events can be avoided > > patch 26 and 27 ('flish') removes the volatile annotation and makes flist_elem attributes > non-atomic -- needed? > > v3 material: > > patches 28 to 31 annotates some branches in and saves two rtclock() calls > > patch 32 ('resampler') is cleanup > > patch 33 ('build-sys') adds --disable-statistics to configure > > patches 34 to 37 make several hot functions inlinable; API function in pulse/ > do lot of error checking which is unnecessary in the core; worse, checking does NOT > go away with NDEBUG > > patch 38 ('resampler') precomputes the maximum block size in frames > > patches 39 to 42 ('mix) makes functions inlineable and cleanup > > patches 43 and 44 makes volume-related function inlineable > > patch 45 and 46 ('iochannel', 'asyncmsgq') drop dead code > > patch 47 fixes sink_input_pop_cb() to return the entire memchunk (as per specification) > > patch 48 saves one call to snd_pcm_avail() by computing left_to_play -- this patch > has probably THE BIGGEST impact > > patches 49 to 51 are cleanup and refactoring > > > summary: > > with these patches typical playback (i.e. after setup) runs without any malloc()/free() > thanks to the use of free-lists; the number of memory management operations is reduced > > many hot function have been made inlineable, redundant checks can be dropped by > compiling with NDEBUG=1 > > read() and write() syscalls are saved by combining data into minibuffers > > one call to snd_pcm_avail() is saved per mmap_write() > > > Peter Meerwald (51): > tagstruct: Distinguish pa_tagstruct_new() use cases > tagstruct: Replace dynamic flag with type > tagstruct: Get rid of pa_tagstruct_free_data() > tagstruct: Add type _APPENDED > tagstruct: Use flist to potentially save calls to malloc()/free() > packet: Hide internals of pa_packet, introduce pa_packet_data() > packet: Make pa_packet_new() create fixed-size packets > packet: Introduce pa_packet_new_data() to copy data into a newly > created packet > packet: Use flist to save calls to malloc()/free() > pstream: Unionize item_info > pstream: Add pa_pstream_send_tagstruct() > pstream: #define PA_PSTREAM_SHM_SIZE > pstream: Duplicate assignment, write.data is always NULL > pstream: Only reset memchunk if it has been used > pstream: Split up do_read() > pstream: Use small minibuffer to combine several read()s if possible > iochannel: Fix channel enable > queue: Add pa_queue_peek() function > pstream: Add helper functions reset_descriptor(), shm_descriptor() > pstream: Peek into next item on send queue to see if it can be put > into minibuffer together with current item > pstream: Don't call defer_enable() on SHMRELEASE > once: Inline functions > rtpoll: Fix condition for DEBUG_TIMING output > rtpoll: Drop extra wait_op argument to pa_rtpoll_run() > mainloop: Clear wakeup pipe only when necessary > flist: Don't use atomic operations to manipulate ptr, next > flist: Don't make flist volatile > rtpoll: Annotate branches with LIKELY > mainloop: Annotate branches with LIKELY > alsa: Make rtpoll_run() runtime measurement compile-time code, default > off > alsa: Annotate branches in ALSA sink/source thread_func() with LIKELY > resampler: Drop pointless remix variable > build-sys: Add --disable-statistics > sample: Make pa_sample_size_table public > sample: Make pa_channels_valid() inlineable > sample-util: Add inlineable functions > core: Make use of use inlineable macros > resampler: Precompute maximum block size in frames > mix: Make use of pa_cvolume_is_norm/muted() macros > mix: Avoid redundant cvolume checks > mix: pa_mix() is always called with more than one steam > mix: Length over all chunk has already been computed by the caller > core: Add volume-util.h > core: Make use of volume macros > iochannel: Remove unnecessary zero-initialization > asyncmsgq: Drop weird assert > protocol-native: Make sink_input_pop_cb() return entire chunk > alsa-sink: Assume left_to_play can be computed, save one call to > snd_pcm_avail() > alsa: Refactor computation of sleep usec > alsa: Precompute max_frames > alsa: Remove redundant sample_spec parameter to reset_watermark() > function > > configure.ac | 13 +- > src/modules/alsa/alsa-mixer.c | 4 +- > src/modules/alsa/alsa-sink.c | 187 +++---- > src/modules/alsa/alsa-source.c | 135 ++--- > src/modules/alsa/alsa-util.c | 32 +- > src/modules/bluetooth/module-bluez4-device.c | 2 +- > src/modules/bluetooth/module-bluez5-device.c | 2 +- > src/modules/echo-cancel/module-echo-cancel.c | 42 +- > src/modules/echo-cancel/webrtc.cc | 10 +- > src/modules/module-card-restore.c | 4 +- > src/modules/module-combine-sink.c | 2 +- > src/modules/module-device-manager.c | 12 +- > src/modules/module-device-restore.c | 16 +- > src/modules/module-esound-sink.c | 2 +- > src/modules/module-null-sink.c | 2 +- > src/modules/module-null-source.c | 2 +- > src/modules/module-pipe-sink.c | 2 +- > src/modules/module-pipe-source.c | 2 +- > src/modules/module-sine-source.c | 2 +- > src/modules/module-stream-restore.c | 12 +- > src/modules/module-tunnel.c | 54 +- > src/modules/oss/module-oss.c | 2 +- > src/modules/raop/module-raop-sink.c | 2 +- > src/pulse/context.c | 29 +- > src/pulse/ext-device-manager.c | 14 +- > src/pulse/ext-device-restore.c | 10 +- > src/pulse/ext-stream-restore.c | 10 +- > src/pulse/introspect.c | 82 +-- > src/pulse/mainloop.c | 70 +-- > src/pulse/sample.c | 18 +- > src/pulse/sample.h | 4 +- > src/pulse/scache.c | 10 +- > src/pulse/stream.c | 43 +- > src/pulse/subscribe.c | 2 +- > src/pulsecore/asyncmsgq.c | 2 - > src/pulsecore/flist.c | 14 +- > src/pulsecore/flist.h | 2 +- > src/pulsecore/iochannel.c | 37 +- > src/pulsecore/memblock.c | 15 + > src/pulsecore/memblockq.c | 5 +- > src/pulsecore/mix.c | 42 +- > src/pulsecore/mix.h | 5 + > src/pulsecore/once.c | 18 +- > src/pulsecore/once.h | 25 +- > src/pulsecore/packet.c | 55 +- > src/pulsecore/packet.h | 20 +- > src/pulsecore/pdispatch.c | 9 +- > src/pulsecore/protocol-native.c | 162 +++--- > src/pulsecore/pstream-util.c | 33 +- > src/pulsecore/pstream-util.h | 2 - > src/pulsecore/pstream.c | 734 +++++++++++++++++---------- > src/pulsecore/pstream.h | 2 + > src/pulsecore/queue.c | 11 + > src/pulsecore/queue.h | 3 + > src/pulsecore/resampler.c | 45 +- > src/pulsecore/resampler.h | 3 +- > src/pulsecore/rtpoll.c | 46 +- > src/pulsecore/rtpoll.h | 5 +- > src/pulsecore/sample-util.c | 8 +- > src/pulsecore/sample-util.h | 53 ++ > src/pulsecore/sink-input.c | 13 +- > src/pulsecore/sink.c | 23 +- > src/pulsecore/source-output.c | 9 +- > src/pulsecore/source.c | 13 +- > src/pulsecore/tagstruct.c | 67 ++- > src/pulsecore/tagstruct.h | 4 +- > src/pulsecore/volume-util.h | 92 ++++ > src/tests/rtpoll-test.c | 4 +- > src/tests/srbchannel-test.c | 21 +- > 69 files changed, 1455 insertions(+), 982 deletions(-) > create mode 100644 src/pulsecore/volume-util.h > -- David Henningsson, Canonical Ltd. https://launchpad.net/~diwic