[PATCH RFCv3 00/51] optimizations

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 2014-11-05 00:25, Peter Meerwald wrote:
>
> preliminary benchmarking on Intel i5-2400S, 64-bit, Linux 3.13:
>
> running 'paplay --latency-msec=10 stereo_48KHz.wav', output on internal
> soundcard (Intel HDA), measuring the maximum CPU% in top for the pulseaudio
> and paplay
>
> code             flags          PA       paplay
> master 6d1fd4d1  -O2            < 14.0%  < 3.7%
> master 6d1fd4d1  -O2 -DNDEBUG   < 13.3%  < 3.3%
> proposed v3      -O2             < 8.3%  < 1.3%
> proposed v3      -O2 -DNDEBUG    < 7.6%  < 1.3%

Cool stuff!

Seems we can get the low-latency story somewhat better, and perhaps even 
more so if we can communicate directly with the I/O threads through 
srbchannels, but that's a different project...

But to focus on 6.0 release; which of these patches have large enough 
performance impact vs risk of regression to try to squeeze them into 6.0 
rc1, and which ones can just be deferred to the -next branch?

Also, is v4 going to be 90 patches...? ;-)

>
> ARMv7 benchmarking soonish
>
>
> this patch series aims to save memory allocations and some system calls
> related to PA's client/server protocol implementation
>
> v3 adds inlining and saves a snd_pcm_avail(), v2 code is largely unchanged
> (minibuffers are increased and better used)
>
>
> patches 1 to 5 ('tagstruct:') introduce a new tagstruct type _APPENDED
> which can hold tagstruct data up to a certain size; tagstructs are now
> kept in a specific free-list -- this typically replaces two malloc()/free()s
> with one flist push()/pop()
>
> patches 6 to 9 ('packet:') make packets fixed-size (typically); packets are
> kept in a specific free-list -- this replaces one malloc()/free() with one
> flist push()/pop()
>
> patches 10 to 14 ('pstream:') allows to send tagstructs directly to a pstream
> without encapsulation in a packet -- this saves one flist push()/pop()
>
> patches 15 and 16 ('pstream') often save a read() call by reading more than
> just the descriptor (up to 40 bytes, e.g. description (20 bytes) + shm
> info (16 bytes)); the idea is similar to b4342845d, "Optimize write
> of smaller packages", but for read -- this trades some extra memcpy() for
> a read(); in v3 the buffer size has been increased to 256 bytes
>
> patch 17 ('iochannel') fixes a strange behaviour in iochannel/mainloop that
> deleted the input_event with every read which caused a rebuild of the pollfds
> for every read()!
>
> patches 18 to 20 ('queue', 'pstream') aim to combine two (v3: or more) write items
> into one minibuffer by peeking ahead in the send queue
>
> patch 21 stop calling mainloop's defer_enable() after queuing a SHMRELEASE; this
> increases the chance that items can be combined (i.e. by patch 20)
>
> patch 22 inlines pa_run_once() as this function came out high in profiling
>
> patches 23 and 24 ('rtpoll') are cleanup
>
> patch 25 ('mainloop') only clears the wakeup pipe when poll() indicates that
> the pipe is readable; if the only ready file descriptor is the wakeup pipe,
> searching io_events can be avoided
>
> patch 26 and 27 ('flish') removes the volatile annotation and makes flist_elem attributes
> non-atomic -- needed?
>
> v3 material:
>
> patches 28 to 31 annotates some branches in and saves two rtclock() calls
>
> patch 32 ('resampler') is cleanup
>
> patch 33 ('build-sys') adds --disable-statistics to configure
>
> patches 34 to 37 make several hot functions inlinable; API function in pulse/
> do lot of error checking which is unnecessary in the core; worse, checking does NOT
> go away with NDEBUG
>
> patch 38 ('resampler') precomputes the maximum block size in frames
>
> patches 39 to 42 ('mix) makes functions inlineable and cleanup
>
> patches 43 and 44 makes volume-related function inlineable
>
> patch 45 and 46 ('iochannel', 'asyncmsgq') drop dead code
>
> patch 47 fixes sink_input_pop_cb() to return the entire memchunk (as per specification)
>
> patch 48 saves one call to snd_pcm_avail() by computing left_to_play -- this patch
> has probably THE BIGGEST impact
>
> patches 49 to 51 are cleanup and refactoring
>
>
> summary:
>
> with these patches typical playback (i.e. after setup) runs without any malloc()/free()
> thanks to the use of free-lists; the number of memory management operations is reduced
>
> many hot function have been made inlineable, redundant checks can be dropped by
> compiling with NDEBUG=1
>
> read() and write() syscalls are saved by combining data into minibuffers
>
> one call to snd_pcm_avail() is saved per mmap_write()
>
>
> Peter Meerwald (51):
>    tagstruct: Distinguish pa_tagstruct_new() use cases
>    tagstruct: Replace dynamic flag with type
>    tagstruct: Get rid of pa_tagstruct_free_data()
>    tagstruct: Add type _APPENDED
>    tagstruct: Use flist to potentially save calls to malloc()/free()
>    packet: Hide internals of pa_packet, introduce pa_packet_data()
>    packet: Make pa_packet_new() create fixed-size packets
>    packet: Introduce pa_packet_new_data() to copy data into a newly
>      created packet
>    packet: Use flist to save calls to malloc()/free()
>    pstream: Unionize item_info
>    pstream: Add pa_pstream_send_tagstruct()
>    pstream: #define PA_PSTREAM_SHM_SIZE
>    pstream: Duplicate assignment, write.data is always NULL
>    pstream: Only reset memchunk if it has been used
>    pstream: Split up do_read()
>    pstream: Use small minibuffer to combine several read()s if possible
>    iochannel: Fix channel enable
>    queue: Add pa_queue_peek() function
>    pstream: Add helper functions reset_descriptor(), shm_descriptor()
>    pstream: Peek into next item on send queue to see if it can be put
>      into minibuffer together with current item
>    pstream: Don't call defer_enable() on SHMRELEASE
>    once: Inline functions
>    rtpoll: Fix condition for DEBUG_TIMING output
>    rtpoll: Drop extra wait_op argument to pa_rtpoll_run()
>    mainloop: Clear wakeup pipe only when necessary
>    flist: Don't use atomic operations to manipulate ptr, next
>    flist: Don't make flist volatile
>    rtpoll: Annotate branches with LIKELY
>    mainloop: Annotate branches with LIKELY
>    alsa: Make rtpoll_run() runtime measurement compile-time code, default
>      off
>    alsa: Annotate branches in ALSA sink/source thread_func() with LIKELY
>    resampler: Drop pointless remix variable
>    build-sys: Add --disable-statistics
>    sample: Make pa_sample_size_table public
>    sample: Make pa_channels_valid() inlineable
>    sample-util: Add inlineable functions
>    core: Make use of use inlineable macros
>    resampler: Precompute maximum block size in frames
>    mix: Make use of pa_cvolume_is_norm/muted() macros
>    mix: Avoid redundant cvolume checks
>    mix: pa_mix() is always called with more than one steam
>    mix: Length over all chunk has already been computed by the caller
>    core: Add volume-util.h
>    core: Make use of volume macros
>    iochannel: Remove unnecessary zero-initialization
>    asyncmsgq: Drop weird assert
>    protocol-native: Make sink_input_pop_cb() return entire chunk
>    alsa-sink: Assume left_to_play can be computed, save one call to
>      snd_pcm_avail()
>    alsa: Refactor computation of sleep usec
>    alsa: Precompute max_frames
>    alsa: Remove redundant sample_spec parameter to reset_watermark()
>      function
>
>   configure.ac                                 |  13 +-
>   src/modules/alsa/alsa-mixer.c                |   4 +-
>   src/modules/alsa/alsa-sink.c                 | 187 +++----
>   src/modules/alsa/alsa-source.c               | 135 ++---
>   src/modules/alsa/alsa-util.c                 |  32 +-
>   src/modules/bluetooth/module-bluez4-device.c |   2 +-
>   src/modules/bluetooth/module-bluez5-device.c |   2 +-
>   src/modules/echo-cancel/module-echo-cancel.c |  42 +-
>   src/modules/echo-cancel/webrtc.cc            |  10 +-
>   src/modules/module-card-restore.c            |   4 +-
>   src/modules/module-combine-sink.c            |   2 +-
>   src/modules/module-device-manager.c          |  12 +-
>   src/modules/module-device-restore.c          |  16 +-
>   src/modules/module-esound-sink.c             |   2 +-
>   src/modules/module-null-sink.c               |   2 +-
>   src/modules/module-null-source.c             |   2 +-
>   src/modules/module-pipe-sink.c               |   2 +-
>   src/modules/module-pipe-source.c             |   2 +-
>   src/modules/module-sine-source.c             |   2 +-
>   src/modules/module-stream-restore.c          |  12 +-
>   src/modules/module-tunnel.c                  |  54 +-
>   src/modules/oss/module-oss.c                 |   2 +-
>   src/modules/raop/module-raop-sink.c          |   2 +-
>   src/pulse/context.c                          |  29 +-
>   src/pulse/ext-device-manager.c               |  14 +-
>   src/pulse/ext-device-restore.c               |  10 +-
>   src/pulse/ext-stream-restore.c               |  10 +-
>   src/pulse/introspect.c                       |  82 +--
>   src/pulse/mainloop.c                         |  70 +--
>   src/pulse/sample.c                           |  18 +-
>   src/pulse/sample.h                           |   4 +-
>   src/pulse/scache.c                           |  10 +-
>   src/pulse/stream.c                           |  43 +-
>   src/pulse/subscribe.c                        |   2 +-
>   src/pulsecore/asyncmsgq.c                    |   2 -
>   src/pulsecore/flist.c                        |  14 +-
>   src/pulsecore/flist.h                        |   2 +-
>   src/pulsecore/iochannel.c                    |  37 +-
>   src/pulsecore/memblock.c                     |  15 +
>   src/pulsecore/memblockq.c                    |   5 +-
>   src/pulsecore/mix.c                          |  42 +-
>   src/pulsecore/mix.h                          |   5 +
>   src/pulsecore/once.c                         |  18 +-
>   src/pulsecore/once.h                         |  25 +-
>   src/pulsecore/packet.c                       |  55 +-
>   src/pulsecore/packet.h                       |  20 +-
>   src/pulsecore/pdispatch.c                    |   9 +-
>   src/pulsecore/protocol-native.c              | 162 +++---
>   src/pulsecore/pstream-util.c                 |  33 +-
>   src/pulsecore/pstream-util.h                 |   2 -
>   src/pulsecore/pstream.c                      | 734 +++++++++++++++++----------
>   src/pulsecore/pstream.h                      |   2 +
>   src/pulsecore/queue.c                        |  11 +
>   src/pulsecore/queue.h                        |   3 +
>   src/pulsecore/resampler.c                    |  45 +-
>   src/pulsecore/resampler.h                    |   3 +-
>   src/pulsecore/rtpoll.c                       |  46 +-
>   src/pulsecore/rtpoll.h                       |   5 +-
>   src/pulsecore/sample-util.c                  |   8 +-
>   src/pulsecore/sample-util.h                  |  53 ++
>   src/pulsecore/sink-input.c                   |  13 +-
>   src/pulsecore/sink.c                         |  23 +-
>   src/pulsecore/source-output.c                |   9 +-
>   src/pulsecore/source.c                       |  13 +-
>   src/pulsecore/tagstruct.c                    |  67 ++-
>   src/pulsecore/tagstruct.h                    |   4 +-
>   src/pulsecore/volume-util.h                  |  92 ++++
>   src/tests/rtpoll-test.c                      |   4 +-
>   src/tests/srbchannel-test.c                  |  21 +-
>   69 files changed, 1455 insertions(+), 982 deletions(-)
>   create mode 100644 src/pulsecore/volume-util.h
>

-- 
David Henningsson, Canonical Ltd.
https://launchpad.net/~diwic


[Index of Archives]     [Linux Audio Users]     [AMD Graphics]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux