Re: [PATCH v3] Documentation: KVM: Add vPMU implementaion and gap document

Like Xu <like.xu.linux@xxxxxxxxx> · Mon, 21 Aug 2023 14:39:52 +0800

On 10/8/2023 1:45 pm, Xiong Zhang wrote:
+2. Perf Scheduler Basic
+=======================
+
+Perf subsystem users can not get PMU counter or resource directly, user

s/can not get/not expected to access PMU hw counters/

+should create a perf event first and specify event’s attribute which is

eventâ€™s attribute, drop the Unicode character.

attribute --> attributes which are

+used to choose PMU counters, then perf event joins in perf scheduler,
+perf scheduler assigns the corresponding PMU counter to a perf event.

"Counter" is not generic enough for LBR case.

The number of perf_event is not necessarily 1:1 mapped to the number of PMU
hardware resources (such as counters) acquired either.

+
+Perf event is created by perf_event_open() system call::

KVM is using the perf_event_create_kernel_counter() API.

The difference between these two interfaces is worth being described here.
But generic perf scheduler behavior doesn't fit.

+
+    int syscall(SYS_perf_event_open, struct perf_event_attr *,
+		pid, cpu, group_fd, flags)
+    struct perf_event_attr {
+	    ......
+	    /* Major type: hardware/software/tracepoint/etc. */
+	    __u32   type;
+	    /* Type specific configuration information. */
+	    __u64   config;
+	    union {
+		    __u64      sample_period;
+		    __u64      sample_freq;
+	    }
+	   __u64   disabled :1;
+	           pinned   :1;
+		   exclude_user  :1;
+		   exclude_kernel :1;
+		   exclude_host   :1;
+	           exclude_guest  :1;
+	......
+    }
+
+The pid and cpu arguments allow specifying which process and CPU
+to monitor::
+
+  pid == 0 and cpu == -1
+        This measures the calling process/thread on any CPU.
+  pid == 0 and cpu >= 0
+        This measures the calling process/thread only when running on
+	the specified cpu.
+  pid > 0 and cpu == -1
+        This measures the specified process/thread on any cpu.
+  pid > 0 and cpu >= 0
+        This  measures the specified process/thread only when running
+	on the specified CPU.
+  pid == -1 and cpu >= 0
+        This measures all processes/threads on the specified CPU.
+  pid == -1 and cpu == -1
+        This setting is invalid and will return an error.
+
+Perf scheduler's responsibility is choosing which events are active at
+one moment and binding counter with perf event. As processor has limited

This is not rigorous at all, perf manages a lot of kernel abstractions,
and hardware pmu is just one part of it.

+PMU counters and other resource, only limited perf events can be active
+at one moment, the inactive perf event may be active in the next moment,
+perf scheduler has defined rules to control these things.

Some developers often ask about the mechanics of events/counters multiplexing,
which is expect to be mentioned here for generic perf behavior.

+
+Perf scheduler defines four types of perf event, defined by the pid and
+cpu arguments in perf_event_open(), plus perf_event_attr.pinned, their
+schedule priority are: per_cpu pinned > per_process pinned
+> per_cpu flexible > per_process flexible. High priority events can
+preempt low priority events when resources contend.

It's not "per-process",

 *  - CPU pinned (EVENT_CPU | EVENT_PINNED)
 *  - task pinned (EVENT_PINNED)
 *  - CPU flexible (EVENT_CPU | EVENT_FLEXIBLE)
 *  - task flexible (EVENT_FLEXIBLE).

It would be nice to mention here that perf function that handles prioritization:

static void ctx_resched(struct perf_cpu_context *cpuctx,
			struct perf_event_context *task_ctx,
			enum event_type_t event_type)

I wouldn't be surprised if the comment around ctx_resched() is out of date.

+
+perf event type::
+
+  --------------------------------------------------------
+  |                      |   pid   |   cpu   |   pinned  |
+  --------------------------------------------------------
+  | Per-cpu pinned       |   *    |   >= 0   |     1     |
+  --------------------------------------------------------
+  | Per-process pinned   |  >= 0  |    *     |     1     |
+  --------------------------------------------------------
+  | Per-cpu flexible     |   *    |   >= 0   |     0     |
+  --------------------------------------------------------
+  | Per-process flexible | >= 0   |    *     |     0     |
+  --------------------------------------------------------
+
+perf_event abstract::
+
+    struct perf_event {
+	    struct list_head       event_entry;
+	    ......
+	    struct pmu             *pmu;
+	    enum perf_event_state  state;
+	    local64_t              count;
+	    u64                    total_time_enabled;
+	    u64                    total_time_running;
+	    struct perf_event_attr attr;
+	    ......
+    }
+
+For per-cpu perf event, it is linked into per cpu global variable
+perf_cpu_context, for per-process perf event, it is linked into
+task_struct->perf_event_context.
+
+Usually the following cases cause perf event reschedule:
+1) In a context switch from one task to a different task.
+2) When an event is manually enabled.
+3) A call to perf_event_open() with disabled field of the
+perf_event_attr argument set to 0.
+
+When perf_event_open() or perf_event_enable() is called, perf event
+reschedule is needed on a specific cpu, perf will send an IPI to the
+target cpu, and the IPI handler will activate events ordered by event
+type, and will iterate all the eligible events in per cpu gloable
+variable perf_cpu_context and current->perf_event_context.
+
+When a perf event is sched out, this event mapped counter is disabled,
+and the counter's setting and count value are saved. When a perf event
+is sched in, perf driver assigns a counter to this event, the counter's
+setting and count values are restored from last saved.
+
+If the event could not be scheduled because no resource is available for
+it, pinned event goes into error state and is excluded from perf
+scheduler, the only way to recover it is re-enable it, flexible event
+goes into inactive state and can be multiplexed with other events if
+needed.

I highly doubt that these are internal behaviors that the perf system is designed
to do, some are, some aren't, and some have exceptions like the BTS event.

Obviously this part needs to be reviewed by more perf developers.
Trying to muddle through will only mislead more developers.

I'd much rather see those perf descriptions in the perf comments or man-pages.
From time to time, perf core will refactor or change its internal implementations.