Re: [PATCH v7] Documentation: userspace-api: Document perf ring buffer mechanism

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Leo,

Thanks for your work on this!

On Sat, Aug 26, 2023 at 12:09 AM Leo Yan <leo.yan@xxxxxxxxxx> wrote:
>
> In the Linux perf tool, the ring buffer serves not only as a medium for
> transferring PMU event data but also as a vital mechanism for hardware
> tracing using technologies like Intel PT and Arm CoreSight, etc.
>
> Consequently, the ring buffer mechanism plays a crucial role by ensuring
> high throughput for data transfer between the kernel and user space
> while avoiding excessive overhead caused by the ring buffer itself.
>
> This commit documents the ring buffer mechanism in detail.  It explains
> the implementation of both the regular ring buffer and the AUX ring
> buffer.  Additionally, it covers how these ring buffers support various
> tracing modes and explains the synchronization with memory barriers.
>
> Signed-off-by: Leo Yan <leo.yan@xxxxxxxxxx>
> ---

[SNIP]
> +.. _writing_samples_into_buffer:
> +
> +2.3.2 Writing samples into buffer
> +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> +
> +Ring buffers are mapped in either read-write mode or read-only mode to
> +the user space.
> +
> +The ring buffer in the read-write mode is mapped with the property
> +``PROT_READ | PROT_WRITE``.  With the write permission, the perf tool
> +updates the ``data_tail`` to indicate the data start position.  Combining
> +with the head pointer ``data_head``, which works as the end position of
> +the current data, the perf tool can easily know where read out the data
> +from.
> +
> +Alternatively, in the read-only mode, only the kernel keeps to update
> +the ``data_head`` while the user space cannot access the ``data_tail`` due
> +to the mapping property ``PROT_READ``.
> +
> +In addition to the mapping modes, the write direction also matters. The
> +perf tool supports two write directions: forward and backward.  As a

s/perf tool/Linux kernel/

> +result, the matrix below shows the combinations between the mapping
> +modes and write directions.
> +
> +.. list-table::
> +   :widths: 1 1 1
> +   :header-rows: 1
> +
> +   * - Mapping mode
> +     - Forward
> +     - Backward
> +   * - read-write
> +     - Normal ring buffer
> +     - Not supported
> +   * - read-only
> +     - Not supported
> +     - Overwritable ring buffer

In the kernel's point of view, they are supported.  It's just perf tool
not using them.  So I think we should say "Not used".

> +
> +The normal ring buffer uses the read-write mapping with forward writing.
> +It starts to save data from the beginning of the ring buffer and wrap
> +around when overflow, which is used with the read-write mode in the
> +normal ring buffer.  When the consumer doesn't keep up with the
> +producer, it would lose some data, the kernel keeps how many records it
> +lost and generates the ``PERF_RECORD_LOST`` records in the next time
> +when it finds a space in the ring buffer.
> +
> +On the other hand, the overwritable ring buffer uses the backward
> +writing with the read-only mode.  It saves the data from the end of the
> +ring buffer and the ``data_head`` keeps the position of current data,
> +the perf always knows where it starts to read and until the end of the
> +ring buffer, thus it don't need the ``data_tail``.  In this mode, it
> +will not generate the ``PERF_RECORD_LOST`` records.
> +
> +When a sample is taken and saved into the ring buffer, the kernel
> +prepares sample fields based on the sample type; then it prepares the
> +info for writing ring buffer which is stored in the structure
> +``perf_output_handle``.  In the end, the kernel outputs the sample into
> +the ring buffer and updates the head pointer in the user page so the
> +perf tool can see the latest value.
> +
> +The structure ``perf_output_handle`` serves as a temporary context for
> +tracking the information related to the buffer.  The advantages of it is
> +that it enables concurrent writing to the buffer by different events.
> +For example, a software event and a hardware PMU event both are enabled
> +for profiling, two instances of ``perf_output_handle`` serve as separate
> +contexts for the software event and the hardware event respectively.
> +This allows each event to reserve its own memory space for populating
> +the record data.
> +

[SNIP]
> +
> +3. The mechanism of AUX ring buffer
> +===================================
> +
> +In this chapter, we will explain the implementation of the AUX ring
> +buffer.  In the first part it will discuss the connection between the
> +AUX ring buffer and the regular ring buffer, then the second part will
> +examine how the AUX ring buffer co-works with the regular ring buffer,
> +as well as the additional features introduced by the AUX ring buffer for
> +the sampling mechanism.
> +
> +3.1 The relationship between AUX and regular ring buffers
> +---------------------------------------------------------
> +
> +Generally, the AUX ring buffer is an auxiliary for the regular ring
> +buffer.  The regular ring buffer is primarily used to store the event
> +samples and every event format complies with the definition in the
> +union ``perf_event``; the AUX ring buffer is for recording the hardware
> +trace data and the trace data format is hardware IP dependent.
> +
> +The general use and advantage of the AUX ring buffer is that it is
> +written directly by hardware rather than by the kernel.  For example,
> +regular profile samples that write to the regular ring buffer cause an
> +interrupt.  Tracing execution requires a high number of samples and
> +using interrupts would be overwhelming for the regular ring buffer
> +mechanism.  Having an AUX buffer allows for a region of memory more
> +decoupled from the kernel and written to directly by hardware tracing.
> +
> +The AUX ring buffer reuses the same algorithm with the regular ring
> +buffer for the buffer management.  The control structure
> +``perf_event_mmap_page`` extends the new fields ``aux_head`` and ``aux_tail``
> +for the head and tail pointers of the AUX ring buffer.
> +
> +During the initialisation phase, besides the mmap()-ed regular ring
> +buffer, the perf tool invokes a second syscall in the

second mmap syscall

> +``auxtrace_mmap__mmap()`` function for the mmap of the AUX buffer;

with non-zero file offset.

> +``rb_alloc_aux()`` in the kernel allocates pages correspondingly, these
> +pages will be deferred to map into VMA when handling the page fault,
> +which is the same lazy mechanism with the regular ring buffer.
> +
> +AUX events and AUX trace data are two different things.  Let's see an
> +example::
> +
> +        perf record -a -e cycles -e cs_etm/@tmc_etr0/ -- sleep 2
> +
> +The above command enables two events: one is the event *cycles* from PMU
> +and another is the AUX event *cs_etm* from Arm CoreSight, both are saved
> +into the regular ring buffer while the CoreSight's AUX trace data is
> +stored in the AUX ring buffer.
> +
> +As a result, we can see the regular ring buffer and the AUX ring buffer
> +are allocated in pairs.  The perf in default mode allocates the regular
> +ring buffer and the AUX ring buffer per CPU-wise, which is the same as
> +the system wide mode, however, the default mode records samples only for
> +the profiled program, whereas the latter mode profiles for all programs
> +in the system.  For per-thread mode, the perf tool allocates only one
> +regular ring buffer and one AUX ring buffer for the whole session.  For
> +the per-CPU mode, the perf allocates two kinds of ring buffers for CPUs
> +specified by the option ``-C``.

Considering -a option, it can be "ring buffers for selected CPUs".

> +
> +The below figure demonstrates the buffers' layout in the system wide
> +mode; if there are any activities on one CPU, the AUX event samples and
> +the hardware trace data will be recorded into the dedicated buffers for
> +the CPU.
> +
> +::
> +
> +              T1                      T2                 T1
> +            +----+              +-----------+          +----+
> +    CPU0    |xxxx|              |xxxxxxxxxxx|          |xxxx|
> +            +----+--------------+-----------+----------+----+-------->
> +              |                       |                  |
> +              v                       v                  v
> +            +-----------------------------------------------------+
> +            |                  Ring buffer 0                      |
> +            +-----------------------------------------------------+
> +              |                       |                  |
> +              v                       v                  v
> +            +-----------------------------------------------------+
> +            |               AUX Ring buffer 0                     |
> +            +-----------------------------------------------------+
> +
> +                   T1
> +                 +-----+
> +    CPU1         |xxxxx|
> +            -----+-----+--------------------------------------------->
> +                    |
> +                    v
> +            +-----------------------------------------------------+
> +            |                  Ring buffer 1                      |
> +            +-----------------------------------------------------+
> +                    |
> +                    v
> +            +-----------------------------------------------------+
> +            |               AUX Ring buffer 1                     |
> +            +-----------------------------------------------------+
> +
> +                                        T1              T3
> +                                      +----+        +-------+
> +    CPU2                              |xxxx|        |xxxxxxx|
> +            --------------------------+----+--------+-------+-------->
> +                                        |               |
> +                                        v               v
> +            +-----------------------------------------------------+
> +            |                  Ring buffer 2                      |
> +            +-----------------------------------------------------+
> +                                        |               |
> +                                        v               v
> +            +-----------------------------------------------------+
> +            |               AUX Ring buffer 2                     |
> +            +-----------------------------------------------------+
> +
> +                              T1
> +                       +--------------+
> +    CPU3               |xxxxxxxxxxxxxx|
> +            -----------+--------------+------------------------------>
> +                              |
> +                              v
> +            +-----------------------------------------------------+
> +            |                  Ring buffer 3                      |
> +            +-----------------------------------------------------+
> +                              |
> +                              v
> +            +-----------------------------------------------------+
> +            |               AUX Ring buffer 3                     |
> +            +-----------------------------------------------------+
> +
> +            T1: Thread 1; T2: Thread 2; T3: Thread 3
> +            x: Thread is in running state
> +
> +                Figure 8. AUX ring buffer for system wide mode
> +
> +3.2 AUX events
> +--------------
> +
> +Similar to ``perf_output_begin()`` and ``perf_output_end()``'s working for the
> +regular ring buffer, ``perf_aux_output_begin()`` and ``perf_aux_output_end()``
> +serve for the AUX ring buffer for processing the hardware trace data.
> +The structure ``perf_output_handle`` is used as a context to track the AUX
> +buffer’s info.

I feel like this section contains kernel implementation details.
Please focus on the user-visible aspect.

> +
> +``perf_aux_output_begin()`` initializes the structure perf_output_handle.
> +It fetches the AUX head pointer and assigns to ``perf_output_handle::head``,
> +afterwards, the low level driver uses ``perf_output_handle::head`` as the
> +start address for storing hardware trace data.
> +
> +Once the hardware trace data is stored into the AUX ring buffer, the PMU
> +driver will stop hardware tracing by calling the ``pmu::stop()`` callback.
> +Similar to the regular ring buffer, the AUX ring buffer needs to apply
> +the memory synchronization mechanism as discussed in the section
> +:ref:`memory_synchronization`.  Since the AUX ring buffer is managed by the
> +PMU driver, the barrier (B), which is a writing barrier to ensure the trace
> +data is externally visible prior to updating the head pointer, is asked
> +to be implemented in the PMU driver.
> +
> +Then ``pmu::stop()`` can safely call the ``perf_aux_output_end()`` function to
> +finish two things:
> +
> +- It fills an event ``PERF_RECORD_AUX`` into the regular ring buffer, this
> +  event delivers the information of the start address and data size for a
> +  chunk of hardware trace data has been stored into the AUX ring buffer;
> +
> +- Since the hardware trace driver has stored new trace data into the AUX
> +  ring buffer, the argument *size* indicates how many bytes have been
> +  consumed by the hardware tracing, thus ``perf_aux_output_end()`` updates the
> +  header pointer ``perf_buffer::aux_head`` to reflect the latest buffer usage.
> +
> +At the end, the PMU driver will restart hardware tracing.  During this
> +temporary suspending period, it will lose hardware trace data, which
> +will introduce a discontinuity during decoding phase.
> +
> +The event ``PERF_RECORD_AUX`` presents an AUX event which is handled in the
> +kernel, but it lacks the information for saving the AUX trace data in
> +the perf file.  When the perf tool copies the trace data from AUX ring
> +buffer to the perf data file, it synthesizes a ``PERF_RECORD_AUXTRACE``

I think you should mention that AUXTRACE record is not a kernel ABI.
It's defined by perf tool to describe which portion of data in the AUX
ring buffer is saved.

Thanks,
Namhyung


> +event which includes the offest and size of the AUX trace data in the
> +perf file.  Afterwards, the perf tool reads out the AUX trace data from
> +the perf file based on the ``PERF_RECORD_AUXTRACE`` events, and the
> +``PERF_RECORD_AUX`` event is used to decode a chunk of data by correlating
> +with time order.
> +
> +3.3 Snapshot mode
> +-----------------
> +
> +Perf supports snapshot mode for AUX ring buffer, in this mode, users
> +only record AUX trace data at a specific time point which users are
> +interested in.  E.g. below gives an example of how to take snapshots
> +with 1 second interval with Arm CoreSight::
> +
> +  perf record -e cs_etm/@tmc_etr0/u -S -a program &
> +  PERFPID=$!
> +  while true; do
> +      kill -USR2 $PERFPID
> +      sleep 1
> +  done
> +
> +The main flow for snapshot mode is:
> +
> +- Before a snapshot is taken, the AUX ring buffer acts in free run mode.
> +  During free run mode the perf doesn't record any of the AUX events and
> +  trace data;
> +
> +- Once the perf tool receives the *USR2* signal, it triggers the callback
> +  function ``auxtrace_record::snapshot_start()`` to deactivate hardware
> +  tracing.  The kernel driver then populates the AUX ring buffer with the
> +  hardware trace data, and the event ``PERF_RECORD_AUX`` is stored in the
> +  regular ring buffer;
> +
> +- Then perf tool takes a snapshot, ``record__read_auxtrace_snapshot()``
> +  reads out the hardware trace data from the AUX ring buffer and saves it
> +  into perf data file;
> +
> +- After the snapshot is finished, ``auxtrace_record::snapshot_finish()``
> +  restarts the PMU event for AUX tracing.
> +
> +The perf only accesses the head pointer ``perf_event_mmap_page::aux_head``
> +in snapshot mode and doesn’t touch tail pointer ``aux_tail``, this is
> +because the AUX ring buffer can overflow in free run mode, the tail
> +pointer is useless in this case.  Alternatively, the callback
> +``auxtrace_record::find_snapshot()`` is introduced for making the decision
> +of whether the AUX ring buffer has been wrapped around or not, at the
> +end it fixes up the AUX buffer's head which are used to calculate the
> +trace data size.
> +
> +As we know, the buffers' deployment can be per-thread mode, per-CPU
> +mode, or system wide mode, and the snapshot can be applied to any of
> +these modes.  Below is an example of taking snapshot with system wide
> +mode.
> +
> +::
> +
> +                                         Snapshot is taken
> +                                                 |
> +                                                 v
> +                        +------------------------+
> +                        |  AUX Ring buffer 0     | <- aux_head
> +                        +------------------------+
> +                                                 v
> +                +--------------------------------+
> +                |          AUX Ring buffer 1     | <- aux_head
> +                +--------------------------------+
> +                                                 v
> +    +--------------------------------------------+
> +    |                      AUX Ring buffer 2     | <- aux_head
> +    +--------------------------------------------+
> +                                                 v
> +         +---------------------------------------+
> +         |                 AUX Ring buffer 3     | <- aux_head
> +         +---------------------------------------+
> +
> +                Figure 9. Snapshot with system wide mode
> --
> 2.34.1
>



[Index of Archives]     [Kernel Newbies]     [Security]     [Netfilter]     [Bugtraq]     [Linux FS]     [Yosemite Forum]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Samba]     [Video 4 Linux]     [Device Mapper]     [Linux Resources]

  Powered by Linux