It turns out that the perf_event mmap page rdpmc/time setting was broken, dating back to the introduction of the feature. Due to a mistake with a bitfield, two different values mapped to the same feature bit. A new somewhat backwards compatible interface was introduced in Linux 3.12. A much longer report on the issue can be found here: https://lwn.net/Articles/567894/ Signed-off-by: Vince Weaver <vincent.weaver@xxxxxxxxx> diff --git a/man2/perf_event_open.2 b/man2/perf_event_open.2 index 4ff9690..a443b6e 100644 --- a/man2/perf_event_open.2 +++ b/man2/perf_event_open.2 @@ -1142,8 +1196,13 @@ struct perf_event_mmap_page { __u64 time_running; /* time event on CPU */ union { __u64 capabilities; - __u64 cap_usr_time : 1, - cap_usr_rdpmc : 1, + struct { + __u64 cap_usr_time / cap_usr_rdpmc / cap_bit0 : 1, + cap_bit0_is_deprecated : 1, + cap_user_rdpmc : 1, + cap_user_time : 1, + cap_user_time_zero : 1, + }; }; __u16 pmc_width; __u16 time_shift; @@ -1173,8 +1232,9 @@ A seqlock for synchronization. A unique hardware counter identifier. .TP .I offset -.\" FIXME clarify -Add this to hardware counter value?? +When using rdpmc for reads this offset value +must be added to the one returned by rdpmc to get +the current total event count. .TP .I time_enabled Time the event was active. @@ -1182,10 +1242,45 @@ Time the event was active. .I time_running Time the event was running. .TP +.IR cap_usr_time " / " cap_usr_rdpmc " / " cap_bit0 " (Since Linux 3.4)" +There was a bug in the definition of .I cap_usr_time -User time capability. +and +.I cap_usr_rdpmc +from Linux 3.4 until Linux 3.11. +Both bits were defined to point to the same location, so it was +impossible to know if +.I cap_usr_time +or +.I cap_usr_rdpmc +were actually set. + +Starting with 3.12 these are renamed to +.I cap_bit0 +and you should use the new +.I cap_user_time +and +.I cap_user_rdpmc +fields instead. + .TP +.IR cap_bit0_is_deprecated " (Since Linux 3.12)" +If set this bit indicates that the kernel supports +the properly separated +.I cap_user_time +and +.I cap_user_rdpmc +bits. + +If not-set, it indicates an older kernel where +.I cap_usr_time +and .I cap_usr_rdpmc +map to the same bit and thus both features should +be used with caution. + +.TP +.IR cap_user_rdpmc " (Since Linux 3.12)" If the hardware supports user-space read of performance counters without syscall (this is the "rdpmc" instruction on x86), then the following code can be used to do a read: @@ -1195,7 +1290,6 @@ the following code can be used to do a read: u32 seq, time_mult, time_shift, idx, width; u64 count, enabled, running; u64 cyc, time_offset; -s64 pmc = 0; do { seq = pc\->lock; @@ -1215,7 +1309,7 @@ do { if (pc\->cap_usr_rdpmc && idx) { width = pc\->pmc_width; - pmc = rdpmc(idx \- 1); + count += rdpmc(idx \- 1); } barrier(); @@ -1223,6 +1317,16 @@ do { .fi .in .TP +.I cap_user_time " (Since Linux 3.12)" +This bit indicates the hardware has a constant, non-stop +timestamp counter (TSC on x86). +.TP +.IR cap_user_time_zero " (Since Linux 3.12)" +Indicates the presence of +.I time_zero +which allows mapping timestamp values to +the hardware clock. +.TP .I pmc_width If .IR cap_usr_rdpmc , @@ -1274,6 +1378,27 @@ enabled and possible running (if idx), improving the scaling: count = quot * enabled + (rem * enabled) / running; .fi .TP +.IR time_zero " (Since Linux 3.12)" + +If +.I cap_usr_time_zero +is set then the hardware clock (the TSC timestamp counter on x86) +can be calculated from the +.IR time_zero ", " time_mult ", and " time_shift " values:" +.nf + time = timestamp - time_zero; + quot = time / time_mult; + rem = time % time_mult; + cyc = (quot << time_shift) + (rem << time_shift) / time_mult; +.fi +And vice versa: +.nf + quot = cyc >> time_shift; + rem = cyc & ((1 << time_shift) - 1); + timestamp = time_zero + quot * time_mult + + ((rem * time_mult) >> time_shift); +.fi +.TP .I data_head This points to the head of the data section. The value continuously increases, it does not wrap. @@ -2221,6 +2387,17 @@ ioctl argument was broken and would repeatedly operate on the event specified rather than iterating across all sibling events in a group. +From Linux 3.4 to Linux 3.11 the mmap +.I cap_usr_rdpmc +and +.I cap_usr_time +bits mapped to the same location. +Code should migrate to the new +.I cap_user_rdpmc +and +.I cap_user_time +fields instead. + Always double-check your results! Various generalized events have had wrong values. For example, retired branches measured -- To unsubscribe from this list: send the line "unsubscribe linux-man" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html