[RFC] trace: Add option for polling ring buffers

Nicolas Saenz Julienne <nsaenzju@xxxxxxxxxx> · Wed, 19 May 2021 19:57:55 +0200

To minimize trace's effect on isolated CPUs. That is, CPUs were only a
handful or a single, process are allowed to run. Introduce a new trace
option: 'poll-rb'.

This option changes the heuristic used to wait for data on trace
buffers. The default one, based on wait queues, will trigger an IPI[1]
on the CPU responsible for new data, which will take care of waking up
the trace gathering process (generally trace-cmd). Whereas with
'poll-rb' we will poll (as in busy-wait) the ring buffers from the trace
gathering process, releasing the CPUs writing trace data from doing any
wakeup work.

This wakeup work, although negligible in the vast majority of workloads,
may cause unwarranted latencies on systems running trace on isolated
CPUs. This is made worse on PREEMPT_RT kernels, as they defer the IPI
handling into a kernel thread, forcing unwarranted context switches on
otherwise extremely busy CPUs.

To illustrate this, tracing with PREEMPT_RT=y on an isolated CPU with a
single process pinned to it (NO_HZ_FULL=y, and plenty more isolation
options enabled). I see:
  - 50-100us latency spikes with the default trace-cmd options
  - 14-10us latency spikes with 'poll-rb'
  - 11-8us latency spikes with no tracing at all

The obvious drawback of 'poll-rb' is putting more pressure on the
housekeeping CPUs. Wasting cycles. Hence the notice in the documentation
discouraging its use in general.

[1] The IPI, in this case, an irq_work, is needed since trace might run
in NMI context. Which is not suitable for wake-ups.

Signed-off-by: Nicolas Saenz Julienne <nsaenzju@xxxxxxxxxx>
---
 Documentation/trace/ftrace.rst | 10 +++++
 include/linux/ring_buffer.h    |  2 +-
 kernel/trace/ring_buffer.c     | 76 +++++++++++++++++++---------------
 kernel/trace/trace.c           |  2 +-
 kernel/trace/trace.h           |  1 +
 5 files changed, 55 insertions(+), 36 deletions(-)

diff --git a/Documentation/trace/ftrace.rst b/Documentation/trace/ftrace.rst
index 62c98e9bbdd9..224ed8652101 100644
--- a/Documentation/trace/ftrace.rst
+++ b/Documentation/trace/ftrace.rst
@@ -1248,6 +1248,16 @@ Here are the available options:
 	with the "nop" tracer is the same as just enabling the
 	"branch" tracer.
 
+  poll-rb
+        Use polling instead of waiting for new data in ringbuffers. The wait
+        heuristic will trigger IPIs on the CPU resposible for new data, which
+        can generate false positives when tracing low latency workloads (or the
+        IPIs themselves). This is made worse on PREEMPT_RT kernels, as they
+        defer the IPI handling into a kernel thread, forcing unwarranted
+        context switches on otherwise extremely busy processes.
+
+        When in doubt don't enable this, it'll just waste cpu cycles.
+
 .. tip:: Some tracers have their own options. They only appear in this
        file when the tracer is active. They always appear in the
        options directory.
diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index dac53fd3afea..b492061d369b 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -98,7 +98,7 @@ __ring_buffer_alloc(unsigned long size, unsigned flags, struct lock_class_key *k
 	__ring_buffer_alloc((size), (flags), &__key);	\
 })
 
-int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full);
+int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full, bool poll);
 __poll_t ring_buffer_poll_wait(struct trace_buffer *buffer, int cpu,
 			  struct file *filp, poll_table *poll_table);
 
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 2c0ee6484990..7840489623b2 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -892,14 +892,17 @@ static void rb_wake_up_waiters(struct irq_work *work)
  * @buffer: buffer to wait on
  * @cpu: the cpu buffer to wait on
  * @full: wait until the percentage of pages are available, if @cpu != RING_BUFFER_ALL_CPUS
+ * @poll: don't use wait queues, poll the ring buffer
  *
  * If @cpu == RING_BUFFER_ALL_CPUS then the task will wake up as soon
  * as data is added to any of the @buffer's cpu buffers. Otherwise
  * it will wait for data to be added to a specific cpu buffer.
  */
-int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full)
+int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full, bool poll)
 {
 	struct ring_buffer_per_cpu *cpu_buffer;
+	struct wait_queue_head *wq_head;
+	bool *waiters_pending;
 	DEFINE_WAIT(wait);
 	struct rb_irq_work *work;
 	int ret = 0;
@@ -920,37 +923,41 @@ int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full)
 		work = &cpu_buffer->irq_work;
 	}
 
+	if (full) {
+		wq_head = &work->full_waiters;
+		waiters_pending = &work->full_waiters_pending;
+	} else {
+		wq_head = &work->waiters;
+		waiters_pending = &work->waiters_pending;
+	}
 
 	while (true) {
-		if (full)
-			prepare_to_wait(&work->full_waiters, &wait, TASK_INTERRUPTIBLE);
-		else
-			prepare_to_wait(&work->waiters, &wait, TASK_INTERRUPTIBLE);
+		if (!poll) {
+			prepare_to_wait(wq_head, &wait, TASK_INTERRUPTIBLE);
 
-		/*
-		 * The events can happen in critical sections where
-		 * checking a work queue can cause deadlocks.
-		 * After adding a task to the queue, this flag is set
-		 * only to notify events to try to wake up the queue
-		 * using irq_work.
-		 *
-		 * We don't clear it even if the buffer is no longer
-		 * empty. The flag only causes the next event to run
-		 * irq_work to do the work queue wake up. The worse
-		 * that can happen if we race with !trace_empty() is that
-		 * an event will cause an irq_work to try to wake up
-		 * an empty queue.
-		 *
-		 * There's no reason to protect this flag either, as
-		 * the work queue and irq_work logic will do the necessary
-		 * synchronization for the wake ups. The only thing
-		 * that is necessary is that the wake up happens after
-		 * a task has been queued. It's OK for spurious wake ups.
-		 */
-		if (full)
-			work->full_waiters_pending = true;
-		else
-			work->waiters_pending = true;
+			/*
+			 * The events can happen in critical sections where
+			 * checking a work queue can cause deadlocks.  After
+			 * adding a task to the queue, this flag is set only to
+			 * notify events to try to wake up the queue using
+			 * irq_work.
+			 *
+			 * We don't clear it even if the buffer is no longer
+			 * empty. The flag only causes the next event to run
+			 * irq_work to do the work queue wake up. The worse
+			 * that can happen if we race with !trace_empty() is
+			 * that an event will cause an irq_work to try to wake
+			 * up an empty queue.
+			 *
+			 * There's no reason to protect this flag either, as
+			 * the work queue and irq_work logic will do the
+			 * necessary synchronization for the wake ups. The only
+			 * thing that is necessary is that the wake up happens
+			 * after a task has been queued. It's OK for spurious
+			 * wake ups.
+			 */
+			*waiters_pending = true;
+		}
 
 		if (signal_pending(current)) {
 			ret = -EINTR;
@@ -983,13 +990,14 @@ int ring_buffer_wait(struct trace_buffer *buffer, int cpu, int full)
 				break;
 		}
 
-		schedule();
+		if (!poll)
+			schedule();
+		else
+			schedule_timeout_interruptible(1);
 	}
 
-	if (full)
-		finish_wait(&work->full_waiters, &wait);
-	else
-		finish_wait(&work->waiters, &wait);
+	if (!poll)
+		finish_wait(wq_head, &wait);
 
 	return ret;
 }
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 1c4e702133e8..8408b0f50519 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -1887,7 +1887,7 @@ static int wait_on_pipe(struct trace_iterator *iter, int full)
 		return 0;
 
 	return ring_buffer_wait(iter->array_buffer->buffer, iter->cpu_file,
-				full);
+				full, iter->tr->trace_flags & TRACE_ITER_POLL_RB);
 }
 
 #ifdef CONFIG_FTRACE_STARTUP_TEST
diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
index cd80d046c7a5..045bef803549 100644
--- a/kernel/trace/trace.h
+++ b/kernel/trace/trace.h
@@ -1198,6 +1198,7 @@ extern int trace_get_user(struct trace_parser *parser, const char __user *ubuf,
 		C(EVENT_FORK,		"event-fork"),		\
 		C(PAUSE_ON_TRACE,	"pause-on-trace"),	\
 		C(HASH_PTR,		"hash-ptr"),	/* Print hashed pointer */ \
+		C(POLL_RB,		"poll-rb"),		\
 		FUNCTION_FLAGS					\
 		FGRAPH_FLAGS					\
 		STACK_FLAGS					\
-- 
2.31.1