Re: Rotation of instance traces

Steven Rostedt <rostedt@xxxxxxxxxxx> · Fri, 18 Feb 2022 14:22:16 -0500

On Fri, 18 Feb 2022 13:45:44 -0500
Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> wrote:

> Hello!
> 
> I am seeing something weird with trace instances, that I don't see
> with regular non-instance traces.
> 
> If I enable some sched events and set the buffer size to something
> tiny, like 1KB, I am expecting the buffer to rotate quickly as it
> overflows, but I don't see that:
> 
>   watch "cat trace | head -n 1" and I see the following
> 
> https://paste.googleplex.com/6259651512369152?raw
> 
> and if I do some heavy workloads, I still see that the top rows are
> not moving and are stuck showing the same thing. I am expecting the
> buffer to drop old events as it is being overwritten/rotation.
> 
>  However if I do a tail on the trace, I do see updates. So tracing is
> in progress...
> 
> I did the exact same experiment on non-instance global tracing and it
> works fine. I confirmed that 'overwrite' is set in the trace_options
> of the instance (and non-instance).
> 
> This could be some trace formatting issue or another bug? Note I am
> reading the tracefs trace file directly and not using trace-cmd for
> this.

It's not a bug it's a feature ;-)

The "trace" file is not a consumer file. That is, it does not clear out the
ring buffer when you read it. If you want that, use trace_pipe.

What you are seeing is the effect of the implementation of the ring buffer.
As the ring buffer is broken up into "sub buffers" which are currently a
page in size. Although Tzvetomir has RFC patches to change that:

  https://patchwork.kernel.org/project/linux-trace-devel/list/?series=594661&state=*

If you pause tracing (echo 0 > tracing_on) you can read the trace file over
and over and still get the same result. That is because it uses an iterator
to read over the ring buffer and does not modify it. But if you were to use
trace_pipe, it consumes the tracing data, and it does so by swapping out
the sub buffers from the ring buffer and giving it to the user.

Now to be able to do things like splice, and zero copy from the ring
buffer, when you read the ring buffer it swaps out a sub buffer from the
writable portion of the ring buffer and makes it into the "reader_page".
The reader page is now own by the reader, and will never be written to.
This allows the reader to take its time to read it and not worry about
being overwritten by the writer.

Now what you are seeing, is that a reader page was swapped out of the ring
buffer. This means that it will never be written to again, as it belongs to
the reader (it's the reader page). And since you are using the "trace" file
and not the "trace_pipe" file, it will not even discard this page. But as
you have not turned off tracing, the rest of the ring buffer is still being
updated.

What happens is, the start of the trace file will show you the reader page
(never be updated, and will always stay the same), and then walk into the
writable section of the ring buffer, which will constantly be updated. This
gives you exactly what you are seeing.

Like I said, it's a feature!

Now, why is it not happening in the top level file? It's because if the
reader page has not be read yet, the iterator may go right into the
writable portion and skip over the reader page as it is empty. If you did
something that causes the reader page to be filled, it will give you what
you see. So this isn't a difference between the instances and top level,
it's a difference between what you have have done to the instances before
reading the trace file, which you did not do for the top level buffer.

They both use the same code, so they should be acting identically.

-- Steve