Re: [RFC] perf: proposed perf_event_open() manpage

Namhyung Kim <namhyung@xxxxxxxxxx> · Wed, 24 Oct 2012 15:54:46 +0900

Hi Vince,

Great work!

On Tue, 23 Oct 2012 11:35:13 -0400 (EDT), Vince Weaver wrote:
> Hello
>
> attached is a proposed manpage for the perf_event_open() system call.
>
> I'd appreciate any review or comments, especially for the parts marked
> as FIXME or "[To be documented]"
>
> This system call has a complicated interface and I'm sure I've missed
> or glossed over various important features, so your feedback is needed and 
> appreciated.
>
> The eventual goal is to have this included with the Linux man-pages 
> project.
[snip]
> .BI "int perf_event_open(struct perf_event_attr *" hw_event ,

hw_event?  Looks unusual.. how about 'attr'?

> .BI "                    pid_t " pid ", int " cpu ", int " group_fd ,
> .BI "                    unsigned long " flags  );
> .fi
[snip]
> .SS Arguments
> .P
> The argument
> .I pid
> allows events to be attached to processes in various ways.
> If
> .I pid
> is 0, measurements happen on the current task, if
> .I pid
> is greater than 0, the process indicated by
> .I pid
> is measured, and if
> .I pid
> is less than 0, all processes are counted.

Is that true?  Shouldn't pid be -1?

>
> The
> .I cpu
> argument allows measurements to be specific to a CPU.
> If
> .I cpu
> is greater than or equal to 0,
> measurements are restricted to the specified CPU;
> if
> .I cpu
> is \-1, the events are measured on all CPUs.
> .P
> Note that the combination of
> .IR pid " == \-1"
> and
> .IR cpu " == \-1"
> is not valid.
> .P
> A
> .IR pid " > 0"

s/>/>=/ ?

> and
> .IR cpu " == \-1"
> setting measures per-process and follows that process to whatever CPU the
> process gets scheduled to.
> Per-process events can be created by any user.
> .P
> A
> .IR pid " == \-1"
> and
> .IR cpu " >= 0"
> setting is per-CPU and measures all processes on the specified CPU.
> Per-CPU events need the
> .B CAP_SYS_ADMIN
> capability.

Or value of perf_event_paranoid is less than 1.

> .TP
> .RB "dynamic PMU"
> Since Linux 2.6.39,
> .BR perf_event_open()
> can support multiple PMUs.
> To enable this, a value exported by the kernel can be used in the
> .I type
> field to indicate which PMU to use.
> The value to use can be found in the sysfs filesystem:
> there is a subdirectory per PMU instance under
> .IR /sys/devices .

/sys/bus/event_source/devices will be the right place.

> In each sub-directory there is a
> .I type
> file whose content is an integer that can be used in the
> .I type
> field.
> For instance,
> .I /sys/devices/cpu/type

/sys/bus/event_source/devices/cpu/type

> contains the value for the core CPU PMU, which is usually 4.
> .RE
>
[snip]
> .TP
> .IR sample_period ", " sample_freq
> A "sampling" counter is one that generates an interrupt
> every N events, where N is given by
> .IR sample_period .
> A sampling counter has
> .IR sample_period " > 0."

How about adding this here:

"When an (overflow) interrupt generated, requested data (sample) would
be recorded."

> The
> .I sample_type
> field controls what data is recorded on each interrupt.
>
> .I sample_freq
> can be used if you wish to use frequency rather than period.
> In this case you set the
> .I freq
> flag.
> The kernel will adjust the sampling period
> to try and achieve the desired rate.
> The rate of adjustment is a
> timer tick.

Is that true?  I thought it'd be adjusted whenever overflow occures.

>
>
> .TP
> .I "sample_type"
> The various bits in this field specify which values to include
> in the overflow packets.

I guess the overflow packets here means samples.  It'd be better if we
use a consistent word for specifying a thing.

> They will be recorded in a ring-buffer,
> which is available to user-space using
> .BR mmap (2).
> The order in which the values are saved in the
> overflow packets as documented in the MMAP Layout subsection below;
> it is not the
> .I "enum perf_event_sample_format"
> order.
> .RS
> .TP
> .B PERF_SAMPLE_IP
> instruction pointer
> .TP
> .B PERF_SAMPLE_TID
> thread id
> .TP
> .B PERF_SAMPLE_TIME
> time
> .TP
> .B PERF_SAMPLE_ADDR
> address
> .TP
> .B PERF_SAMPLE_READ
> [To be documented]

It's for an event group to sample leader only.  Values of other members
will be read when an interrupt occurred on the leader.

Jiri is working on it.

> .TP
> .B PERF_SAMPLE_CALLCHAIN
> [To be documented]

callchain (or stack backtrace)

> .TP
> .B PERF_SAMPLE_ID
> [To be documented]

unique(?) id for the opened event.

> .TP
> .B PERF_SAMPLE_CPU
> [To be documented]

cpu number

> .TP
> .B PERF_SAMPLE_PERIOD
> [To be documented]

event count

> .TP
> .B PERF_SAMPLE_STREAM_ID
> [To be documented]
> .TP
> .B PERF_SAMPLE_RAW
> [To be documented]

additional data - usually for tracepoint events

> .TP
> .BR PERF_SAMPLE_BRANCH_STACK " (Since Linux 3.4)"
> [To be documented]

requested branch stack - only supported on intel machines which has LBR
feature(?).  See branch_sample_type.

> .RE
[snip]
> .SS /proc/sys/kernel/perf_event_paranoid
>
> The
> .I /proc/sys/kernel/perf_event_paranoid
> file can be set to restrict access to the performance counters.
> 2
> means no measurements allowed,

This is not true.  It only allows user mode measurements.

$ cat /proc/sys/kernel/perf_event_paranoid 
2

$ perf stat usleep 1
  Error: You may not have permission to collect stats.
	 Consider tweaking /proc/sys/kernel/perf_event_paranoid or running as root.
Not all events could be opened.

$ perf stat -e cycles:u usleep 1

 Performance counter stats for 'usleep 1':

           253,055 cycles:u                  #    0.000 GHz                    

       0.001988538 seconds time elapsed

> 1
> means normal counter access,

This includes kernel mode measurements.

> 0
> means you can access CPU-specific data, and

But cannot access raw tracepoint samples.

> \-1
> means no restrictions.

Thanks,
Namhyung
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html