Hi Vince, Great work! On Tue, 23 Oct 2012 11:35:13 -0400 (EDT), Vince Weaver wrote: > Hello > > attached is a proposed manpage for the perf_event_open() system call. > > I'd appreciate any review or comments, especially for the parts marked > as FIXME or "[To be documented]" > > This system call has a complicated interface and I'm sure I've missed > or glossed over various important features, so your feedback is needed and > appreciated. > > The eventual goal is to have this included with the Linux man-pages > project. [snip] > .BI "int perf_event_open(struct perf_event_attr *" hw_event , hw_event? Looks unusual.. how about 'attr'? > .BI " pid_t " pid ", int " cpu ", int " group_fd , > .BI " unsigned long " flags ); > .fi [snip] > .SS Arguments > .P > The argument > .I pid > allows events to be attached to processes in various ways. > If > .I pid > is 0, measurements happen on the current task, if > .I pid > is greater than 0, the process indicated by > .I pid > is measured, and if > .I pid > is less than 0, all processes are counted. Is that true? Shouldn't pid be -1? > > The > .I cpu > argument allows measurements to be specific to a CPU. > If > .I cpu > is greater than or equal to 0, > measurements are restricted to the specified CPU; > if > .I cpu > is \-1, the events are measured on all CPUs. > .P > Note that the combination of > .IR pid " == \-1" > and > .IR cpu " == \-1" > is not valid. > .P > A > .IR pid " > 0" s/>/>=/ ? > and > .IR cpu " == \-1" > setting measures per-process and follows that process to whatever CPU the > process gets scheduled to. > Per-process events can be created by any user. > .P > A > .IR pid " == \-1" > and > .IR cpu " >= 0" > setting is per-CPU and measures all processes on the specified CPU. > Per-CPU events need the > .B CAP_SYS_ADMIN > capability. Or value of perf_event_paranoid is less than 1. > .TP > .RB "dynamic PMU" > Since Linux 2.6.39, > .BR perf_event_open() > can support multiple PMUs. > To enable this, a value exported by the kernel can be used in the > .I type > field to indicate which PMU to use. > The value to use can be found in the sysfs filesystem: > there is a subdirectory per PMU instance under > .IR /sys/devices . /sys/bus/event_source/devices will be the right place. > In each sub-directory there is a > .I type > file whose content is an integer that can be used in the > .I type > field. > For instance, > .I /sys/devices/cpu/type /sys/bus/event_source/devices/cpu/type > contains the value for the core CPU PMU, which is usually 4. > .RE > [snip] > .TP > .IR sample_period ", " sample_freq > A "sampling" counter is one that generates an interrupt > every N events, where N is given by > .IR sample_period . > A sampling counter has > .IR sample_period " > 0." How about adding this here: "When an (overflow) interrupt generated, requested data (sample) would be recorded." > The > .I sample_type > field controls what data is recorded on each interrupt. > > .I sample_freq > can be used if you wish to use frequency rather than period. > In this case you set the > .I freq > flag. > The kernel will adjust the sampling period > to try and achieve the desired rate. > The rate of adjustment is a > timer tick. Is that true? I thought it'd be adjusted whenever overflow occures. > > > .TP > .I "sample_type" > The various bits in this field specify which values to include > in the overflow packets. I guess the overflow packets here means samples. It'd be better if we use a consistent word for specifying a thing. > They will be recorded in a ring-buffer, > which is available to user-space using > .BR mmap (2). > The order in which the values are saved in the > overflow packets as documented in the MMAP Layout subsection below; > it is not the > .I "enum perf_event_sample_format" > order. > .RS > .TP > .B PERF_SAMPLE_IP > instruction pointer > .TP > .B PERF_SAMPLE_TID > thread id > .TP > .B PERF_SAMPLE_TIME > time > .TP > .B PERF_SAMPLE_ADDR > address > .TP > .B PERF_SAMPLE_READ > [To be documented] It's for an event group to sample leader only. Values of other members will be read when an interrupt occurred on the leader. Jiri is working on it. > .TP > .B PERF_SAMPLE_CALLCHAIN > [To be documented] callchain (or stack backtrace) > .TP > .B PERF_SAMPLE_ID > [To be documented] unique(?) id for the opened event. > .TP > .B PERF_SAMPLE_CPU > [To be documented] cpu number > .TP > .B PERF_SAMPLE_PERIOD > [To be documented] event count > .TP > .B PERF_SAMPLE_STREAM_ID > [To be documented] > .TP > .B PERF_SAMPLE_RAW > [To be documented] additional data - usually for tracepoint events > .TP > .BR PERF_SAMPLE_BRANCH_STACK " (Since Linux 3.4)" > [To be documented] requested branch stack - only supported on intel machines which has LBR feature(?). See branch_sample_type. > .RE [snip] > .SS /proc/sys/kernel/perf_event_paranoid > > The > .I /proc/sys/kernel/perf_event_paranoid > file can be set to restrict access to the performance counters. > 2 > means no measurements allowed, This is not true. It only allows user mode measurements. $ cat /proc/sys/kernel/perf_event_paranoid 2 $ perf stat usleep 1 Error: You may not have permission to collect stats. Consider tweaking /proc/sys/kernel/perf_event_paranoid or running as root. Not all events could be opened. $ perf stat -e cycles:u usleep 1 Performance counter stats for 'usleep 1': 253,055 cycles:u # 0.000 GHz 0.001988538 seconds time elapsed > 1 > means normal counter access, This includes kernel mode measurements. > 0 > means you can access CPU-specific data, and But cannot access raw tracepoint samples. > \-1 > means no restrictions. Thanks, Namhyung -- To unsubscribe from this list: send the line "unsubscribe linux-man" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html