Re: Question about bpf perfbuf/ringbuf: pinned in backend with overwriting

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, 21 Dec 2023 21:00:39 +0800
Philo Lu <lulie@xxxxxxxxxxxxxxxxx> wrote:

> Hi Steven,
> 
> Thanks for your explanation about ftrace ring buffer. Also thanks to 
> Shung-Hsi for the discussion.
> 
> Here are some features of ftrace buffer that I'm not sure if they are 
> right. Could you please tell me if my understandings correct?
> 
> (1) When reading and writing occur concurrently:
>    (a) If reader is faster than writer, the reader cannot get the page 
> which is still being written, which means the reader cannot get the data 
> immediately of one-page length in the worst case.

Nope, that's not the case. Otherwise you couldn't do this!

 ~# cd /sys/kernel/tracing
 ~# echo hello world > trace_marker
 ~# cat trace_pipe
           <...>-861     [001] ..... 76124.880943: tracing_mark_write: hello world

Yes, the reader swaps out an active sub-buffer to read it. But it's fine if
the writer is still on that sub-buffer. That's because the sub-buffers are
a linked list and the writer will simply walk off the end of the sub-buffer
and back into the sub-buffers in the active ring buffer.

Note, in this case, the ring buffer cannot give the sub-buffer to the
reader to pass to splice, as then it could free it while the writer is
still on it, but instead, copies the data for the reader. It also keeps
track of what it copied so it doesn't copy it again the next time.

>    (b) If writer is faster than reader, the only race between them is 
> when reader is doing swap while writer wraps in overwrite mode. But if 
> the reader has finished swapping, the writer can wrap safely, because 
> the reader page if already out of the buffer page list.

Yes, that is the point of contention. But the writer doesn't wait for the
reader. The reader does a cmpxchg loop to make sure it's not conflicting
with the writer. The writer has priority and doesn't loop in this case.
That is, a reader will not slow down the writer except for what the
hardware causes in the contention.

> 
> (2) As the per-cpu buffer list is dynamic with reader page moves, we 
> cannot do mmap to expose the buffer to user. Users can consume at most 
> one page at a time.

The code works with splice, and the way trace-cmd does it, is to use the
max pipe size, and will read by default 64kb at a time. The internals swap
out one sub-buffer at a time, but then move them into the pipe, with zero
copy (if the sub-buffers are full and the writer is not still on them). The
user can see all these sub-buffers in the pipe at once.

I'm working to have 6.8 remove the limit of "one page" and allow the
sub-buffers to be any order of pages (1,2,4,8,...). I'm hoping to have that
work pushed to linux-next by end of today.

 https://lore.kernel.org/linux-trace-kernel/20231219185414.474197117@xxxxxxxxxxx/

and we are also working on mmapping the ring buffer to user space:

 https://lore.kernel.org/linux-trace-kernel/20231219184556.1552951-1-vdonnefort@xxxxxxxxxx/

That may not make 6.8 but will likely make 6.9 at the latest.

It still requires user space to make an ioctl() system call between
sub-buffers, as the swap logic is still implemented.

The way it will work is all the sub-buffers will be mmapped to user space
including the reader page. A meta data will point to which sub-buffer is
what. When user space calls the ioctl() it will update which one of the
mapped sub-buffers is the "reader-page" (really "reader-subbuf") and the
writers will not write on it. When user space is finished reading the data
on the reader-page it will call the ioctl() again and the meta data will be
updated to point to which sub-buffer is now the new "reader-page" for user
space to read.

There's no new allocations needed for the swap. The old reader-subbuf gets
swapped with one of the active sub-buffers and becomes an active sub-buffer
itself. The swapped out sub-buffer becomes the new "reader-page/subbuf".

> 
> (3) The wake-up behavior is controllable. If there is no waiter at all, 
> no overhead will be induced because of waking up.

Correct. When there's a waiter, a bit is set and an irq_work is called to
wake up the waiter (this is basically the same as what perf does).

You can also set when you want to wake up via the buffer_percent file in
tracefs. If the buffer is not filled to the percentage specified, it will
not wake up the waiters.

-- Steve




[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux