Re: [patch] perf_event_open() updates for Linux 3.12

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 09/21/13 06:15, Vince Weaver wrote:
> 
> Below are the changes to perf_event_open.2 for the upcoming
> Linux 3.12 release.

Vince,

I just wanted to double check with you: everything in this 
old mail has now been applied via other patches, right?
(Indeed, I believe I have no more outstanding patches from you, 
right?)

Cheers,

Michael


> I'm not sure if sending these at 3.12-rc1 time is too early.
> 
> There are some pretty big changes this time, including an
> unfortunate ABI breakage with the cap_usr_rdpmc/cap_usr_time
> bits.
> 
> Signed-off-by: Vince Weaver <vincent.weaver@xxxxxxxxx>
> 
> diff --git a/man2/perf_event_open.2 b/man2/perf_event_open.2
> index 71a09d5..7b87c4c 100644
> --- a/man2/perf_event_open.2
> +++ b/man2/perf_event_open.2
> @@ -468,6 +468,13 @@ This counts the number of emulation faults.
>  The kernel sometimes traps on unimplemented instructions
>  and emulates them for user space.
>  This can negatively impact performance.
> +.TP
> +.BR PERF_COUNT_SW_DUMMY " (Since Linux 3.12)"
> +This is a placeholder event that counts nothing.
> +Informational sample record types such as mmap or comm
> +must be associated with an active event.
> +This dummy event allows gathering such records without requiring
> +a counting event.
>  .RE
>  
>  .RS
> @@ -680,6 +687,27 @@ Records the data source: where in the memory hierarchy
>  the data associated with the sampled instruction came from.
>  This is only available if the underlying hardware
>  supports this feature.
> +.TP
> +.BR PERF_SAMPLE_IDENTIFIER " (Since Linux 3.12)"
> +Places the SAMPLE_ID value in a fixed position in the record,
> +either at the beginning (for sample events) or at the end
> +(if a non-sample event).
> +
> +This was necessary because a sample stream may have
> +records from various different event sources with different
> +.I sample_type
> +settings.
> +Parsing the event stream properly was not possible because the 
> +format of the record was needed to find SAMPLE_ID, but
> +the the format could not be found without knowing what
> +event the sample belonged to (causing a circular
> +dependency).
> +
> +This new
> +.B PERF_SAMPLE_IDENTIFIER
> +setting makes the event stream always parsable
> +by putting SAMPLE_ID in a fixed location, even though
> +it means having duplicate SAMPLE_ID values in records.
>  .RE
>  .TP
>  .IR "read_format"
> @@ -860,12 +888,33 @@ field, but enables including data mmap events
>  in the ring-buffer.
>  .TP
>  .IR "sample_id_all" " (Since Linux 2.6.38)"
> -If set, then TID, TIME, ID, CPU, and STREAM_ID can
> +If set, then TID, TIME, ID, STREAM_ID, and CPU can
>  additionally be included in
>  .RB non- PERF_RECORD_SAMPLE s
>  if the corresponding
>  .I sample_type
>  is selected.
> +
> +If 
> +.B PERF_SAMPLE_IDENTIFIER
> +is specified than an additional ID value is included 
> +as the last value to ease parsing the record stream.
> +This may lead to the
> +.I id 
> +value appearing twice.
> +
> +The layout is described by this pseudo-structure:
> +.in +4n
> +.nf
> +struct sample_id {
> +    { u32 pid, tid; } /* if PERF_SAMPLE_TID set        */
> +    { u64 time;     } /* if PERF_SAMPLE_TIME set       */
> +    { u64 id;       } /* if PERF_SAMPLE_ID set         */
> +    { u64 stream_id;} /* if PERF_SAMPLE_STREAM_ID set  */
> +    { u32 cpu, res; } /* if PERF_SAMPLE_CPU set        */
> +    { u64 id;       } /* if PERF_SAMPLE_IDENTIFIER set */
> +};
> +.fi
>  .TP
>  .IR "exclude_host" " (Since Linux 3.2)"
>  Do not measure time spent in VM host.
> @@ -879,6 +928,11 @@ Do not include kernel callchains.
>  .IR "exclude_callchain_user" " (Since Linux 3.7)"
>  Do not include user callchains.
>  .TP
> +.IR "mmap2" " (Since Linux 3.12)"
> +Include an extended mmap record that contains enough
> +additional information to uniquely identify
> +shared mappings.
> +.TP
>  .IR "wakeup_events" ", " "wakeup_watermark"
>  This union sets how many samples
>  .RI ( wakeup_events )
> @@ -1142,8 +1196,13 @@ struct perf_event_mmap_page {
>      __u64 time_running;     /* time event on CPU */
>      union {
>          __u64   capabilities;
> -        __u64   cap_usr_time  : 1,
> -                cap_usr_rdpmc : 1,
> +        struct {
> +            __u64   cap_usr_time / cap_usr_rdpmc / cap_bit0 : 1,
> +                    cap_bit0_is_deprecated : 1,
> +                    cap_user_rdpmc         : 1,
> +                    cap_user_time          : 1,
> +                    cap_user_time_zero     : 1,
> +        };
>      };
>      __u16   pmc_width;
>      __u16   time_shift;
> @@ -1173,8 +1232,9 @@ A seqlock for synchronization.
>  A unique hardware counter identifier.
>  .TP
>  .I offset
> -.\" FIXME clarify
> -Add this to hardware counter value??
> +When using rdpmc for reads this offset value
> +must be added to the one returned by rdpmc to get
> +the current total event count.
>  .TP
>  .I time_enabled
>  Time the event was active.
> @@ -1182,10 +1242,45 @@ Time the event was active.
>  .I time_running
>  Time the event was running.
>  .TP
> +.IR cap_usr_time " / " cap_usr_rdpmc " / " cap_bit0 " (Since Linux 3.4)"
> +There was a bug in the definition of 
> +.I cap_usr_time
> +and
> +.I cap_usr_rdpmc
> +from Linux 3.4 until Linux 3.11.
> +Both bits were defined to point to the same location, so it was
> +impossible to know if 
>  .I cap_usr_time
> -User time capability.
> +or
> +.I cap_usr_rdpmc
> +were actually set.
> +
> +Starting with 3.12 these are renamed to
> +.I cap_bit0
> +and you should use the new
> +.I cap_user_time
> +and
> +.I cap_user_rdpmc
> +fields instead.
> +
>  .TP
> +.IR cap_bit0_is_deprecated " (Since Linux 3.12)"
> +If set this bit indicates that the kernel supports
> +the properly separated
> +.I cap_user_time
> +and
> +.I cap_user_rdpmc
> +bits.
> +
> +If not-set, it indicates an older kernel where
> +.I cap_usr_time
> +and
>  .I cap_usr_rdpmc
> +map to the same bit and thus both features should
> +be used with caution.
> +
> +.TP
> +.IR cap_user_rdpmc " (Since Linux 3.12)" 
>  If the hardware supports user-space read of performance counters
>  without syscall (this is the "rdpmc" instruction on x86), then
>  the following code can be used to do a read:
> @@ -1195,7 +1290,6 @@ the following code can be used to do a read:
>  u32 seq, time_mult, time_shift, idx, width;
>  u64 count, enabled, running;
>  u64 cyc, time_offset;
> -s64 pmc = 0;
>  
>  do {
>      seq = pc\->lock;
> @@ -1215,7 +1309,7 @@ do {
>  
>      if (pc\->cap_usr_rdpmc && idx) {
>          width = pc\->pmc_width;
> -        pmc = rdpmc(idx \- 1);
> +        count += rdpmc(idx \- 1);
>      }
>  
>      barrier();
> @@ -1223,6 +1317,16 @@ do {
>  .fi
>  .in
>  .TP
> +.I cap_user_time " (Since Linux 3.12)"
> +This bit indicates the hardware has a constant, non-stop
> +timestamp counter (TSC on x86).
> +.TP
> +.IR cap_user_time_zero " (Since Linux 3.12)"
> +Indicates the presence of
> +.I time_zero
> +which allows mapping timestamp values to
> +the hardware clock.
> +.TP
>  .I pmc_width
>  If
>  .IR cap_usr_rdpmc ,
> @@ -1274,6 +1378,27 @@ enabled and possible running (if idx), improving the scaling:
>      count = quot * enabled + (rem * enabled) / running;
>  .fi
>  .TP
> +.IR time_zero " (Since Linux 3.12)"
> +
> +If 
> +.I cap_usr_time_zero
> +is set then the hardware clock (the TSC timestamp counter on x86) 
> +can be calculated from the
> +.IR time_zero ", " time_mult ", and " time_shift " values:"
> +.nf
> +    time = timestamp - time_zero;
> +    quot = time / time_mult;
> +    rem  = time % time_mult;
> +    cyc = (quot << time_shift) + (rem << time_shift) / time_mult;
> +.fi
> +And vice versa:
> +.nf
> +    quot = cyc >> time_shift;
> +    rem  = cyc & ((1 << time_shift) - 1);
> +    timestamp = time_zero + quot * time_mult +
> +        ((rem * time_mult) >> time_shift);
> +.fi
> +.TP
>  .I data_head
>  This points to the head of the data section.
>  The value continuously increases, it does not wrap.
> @@ -1385,6 +1510,7 @@ The values in the corresponding record (that follows the header)
>  depend on the
>  .I type
>  selected as shown.
> +
>  .RS
>  .TP 4
>  .B PERF_RECORD_MMAP
> @@ -1416,6 +1542,7 @@ struct {
>      struct perf_event_header header;
>      u64 id;
>      u64 lost;
> +    struct sample_id sample_id;
>  };
>  .fi
>  .in
> @@ -1437,6 +1564,7 @@ struct {
>      struct perf_event_header header;
>      u32 pid, tid;
>      char comm[];
> +    struct sample_id sample_id;
>  };
>  .fi
>  .in
> @@ -1451,6 +1579,7 @@ struct {
>      u32 pid, ppid;
>      u32 tid, ptid;
>      u64 time;
> +    struct sample_id sample_id;
>  };
>  .fi
>  .in
> @@ -1465,6 +1594,7 @@ struct {
>      u64 time;
>      u64 id;
>      u64 stream_id;
> +    struct sample_id sample_id;
>  };
>  .fi
>  .in
> @@ -1479,6 +1609,7 @@ struct {
>      u32 pid, ppid;
>      u32 tid, ptid;
>      u64 time;
> +    struct sample_id sample_id;
>  };
>  .fi
>  .in
> @@ -1492,6 +1623,7 @@ struct {
>      struct perf_event_header header;
>      u32 pid, tid;
>      struct read_format values;
> +    struct sample_id sample_id;
>  };
>  .fi
>  .in
> @@ -1503,6 +1635,7 @@ This record indicates a sample.
>  .nf
>  struct {
>      struct perf_event_header header;
> +    u64   sample_id;  /* if PERF_SAMPLE_IDENTIFIER */
>      u64   ip;         /* if PERF_SAMPLE_IP */
>      u32   pid, tid;   /* if PERF_SAMPLE_TID */
>      u64   time;       /* if PERF_SAMPLE_TIME */
> @@ -1531,6 +1664,16 @@ struct {
>  .fi
>  .RS 4
>  .TP 4
> +.I sample_id
> +If
> +.B PERF_SAMPLE_IDENTIFIER
> +is enabled, a 64-bit unique ID is included.
> +This is a duplication of the 
> +.B PERF_SAMPLE_ID
> +.I id
> +value, but included at the beginning of the sample
> +so parsers can easily obtain the value.
> +.TP
>  .I ip
>  If
>  .B PERF_SAMPLE_IP
> @@ -1855,6 +1998,29 @@ OS fault handler
>  .PD
>  .RE
>  .RE
> +.TP
> +.B PERF_RECORD_MMAP2
> +This record includes information on mmap() calls.
> +It includes extended fields not available with
> +the
> +.B PERF_RECORD_MMAP
> +record that allow uniquely identifying shared mappings.
> +.in +4n
> +.nf
> +struct {
> +    struct perf_event_header header;
> +    u32 pid, tid;
> +    u64 addr;
> +    u64 len;
> +    u64 pgoff;
> +    u32 maj;
> +    u32 min;
> +    u64 ino;
> +    u64 ino_generation;
> +    char filename[];
> +    struct sample_id sample_id;
> +};
> +.fi
>  .RE
>  .RE
>  .SS Signal overflow
> @@ -1994,6 +2160,12 @@ output should be ignored.
>  This adds an ftrace filter to this event.
>  
>  The argument is a pointer to the desired ftrace filter.
> +.TP
> +.BR PERF_EVENT_IOC_ID " (Since Linux 3.12)"
> +Returns the event ID value for the given event fd.
> +
> +The argument is a pointer to a 64-bit unsigned integer
> +to hold the result.
>  .SS Using prctl
>  A process can enable or disable all the event groups that are
>  attached to it using the
> @@ -2200,6 +2372,17 @@ ioctl argument was broken and would repeatedly operate
>  on the event specified rather than iterating across
>  all sibling events in a group.
>  
> +From Linux 3.4 to Linux 3.11 the mmap
> +.I cap_usr_rdpmc
> +and
> +.I cap_usr_time
> +bits mapped to the same location.
> +Code should migrate to the new
> +.I cap_user_rdpmc
> +and
> +.I cap_user_time
> +fields instead.
> +
>  Always double-check your results!
>  Various generalized events have had wrong values.
>  For example, retired branches measured
> 


-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
--
To unsubscribe from this list: send the line "unsubscribe linux-man" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Kernel Documentation]     [Netdev]     [Linux Ethernet Bridging]     [Linux Wireless]     [Kernel Newbies]     [Security]     [Linux for Hams]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux RAID]     [Linux Admin]     [Samba]

  Powered by Linux