Re: HW perf. events arch implementation

Peter Zijlstra <peterz@xxxxxxxxxxxxx> · Fri, 05 Mar 2010 23:20:57 +0100

On Wed, 2010-02-24 at 14:35 +1300, Michael Cree wrote:
> I am trying to implement arch specific code on the Alpha for hardware  
> performance events (yeah, I'm probably a little bit loopy and unsound  
> of mind pursuing this on an end-of-line platform, but it's a way in to  
> learn a little bit of kernel programming and it scratches an itch).
> 
> I have taken a look at the code in the x86, sparc and ppc  
> implementations and tried to drum up an Alpha implementation for the  
> EV67/7/79 cpus, but it ain't working and is producing obviously  
> erroneous counts.  Part of the problem is that I don't understand  
> under what conditions, and with what assumptions, the performance  
> event subsystem is calling into the architecture specific code.  Is  
> there any documentation available that describes the architecture  
> specific interface?
> 
> The Alpha CPUs of interest have two 20-bit performance monitoring  
> counters that can count cycles, instructions, Bcache misses and Mbox  
> replays (but not all combinations of those).  For round numbers  
> consider a 1GHz CPU, with a theoretical maximal sustained throughput  
> of four instructions per cycle, then a single performance counter  
> could potentially generate 4000 interrupts per second to signal  
> counter overflow when counting instructions.
> 
> The x86, sparc and PPC implementations seem to me to assume that calls  
> to read back the counters occur more frequently than performance  
> counter overflow interrupts, and that the highest bit of the counter  
> can safely be used to detect overflow.  (Am I correct?)  That is  
> likely not to be true of the Alpha because of the small width of the  
> counter.  Is there someone who would be happy to give me, a kernel  
> newbie who probably doesn't even make the grade of neophyte, a bit of  
> direction on this?

Right, so the architecture interface is 2 fold, a struct pmu, and a
bunch of weak hw_perf_*() functions.

I'm trying to move away from the hw_perf*() functions, but for now
they're there and are useful for a number of things.

We have:

  hw_perf_event_init();
  hw_perf_disable();
  hw_perf_enable();
  hw_perf_group_sched_in();

hw_perf_event_init() is called when we are creating a counter of type
PERF_TYPE_RAW, PERF_TYPE_HARDWARE or PERF_TYPE_HW_CACHE, it will return
a struct pmu for that event.

hw_perf_disable()/hw_perf_enable() are like
local_irq_disable()/local_irq_enable() but for the Performance Monitor
Interrupt (PMI), which might be a NMI, so we need to disable it in some
arch specific way -- these basically freeze/unfreeze the PMU.

hw_perf_group_sched_in() is a bit of a nightmare and a source of bugs
and I really should get around to killing it off, but this is used
optimize multiple pmu->enable() calls.

Then we have struct pmu, it has the following members:

  enable()
  disable()
  start()
  stop()
  read()
  unthrottle()

->enable() will try to program the event onto the hardware and return 0
on success, if however it cannot, due to there not being a suitable
counter available, it shall return an error.

->disable() will remove the event from the hardware and release all
resources that were acquired by ->enable().

->start() will undo ->stop().

->stop() will stop the counter but not release any resources that might
have been acquired by ->enable().

->read() will read the hardware counter and fold the delta into
event->count.

->unthrottle(), when present, will undo whatever is done to stop the PMI
from triggering after perf_event_overflow() returns !0. That is, we have
sysctl_perf_event_sample_rate and we try to ensure the PMI doesn't
exceed that, if it does perf_event_overflow() will return !0 and the
arch code is supposed to inhibit it from firing again until
->unthrottle() is called. This avoid users from accidentally
live-locking the system by requesting a PMI on every completed
instruction ;-)

[ ->start/->stop are a way to reprogram the hardware without releasing
  constraint reservations, this is useful when you change the 

As to your counter width, if you have a special overflow bit in a
separate register then you can possibly use that, but otherwise you need
it to keep your count straight. The PMI will happen _after_ the
overflow, at which point you need to fold back the counter delta into
your event->count, if it just overflowed that's bound to be a very small
delta -- I guess you can always add the max value on PMI, but it might
racy, esp. in the presence of ->read() calls.

Also, if you have multiple registers sharing a PMI you need to be able
to tell which register overflowed and caused the PMI.

> Also, the Alpha CPUs have an interesting mode whereby one programmes  
> up one counter with a specified (or random) value that specifies a  
> future instruction to profile.  The CPU runs for that number of  
> instructions/cycles, then a short monitoring window (of a few cycles)  
> is opened about the profiled instruction and when completed an  
> interrupt is generated.  One can then read back a whole lot of  
> information about the pipeline at the time of the profiled  
> instruction.  This can be used for statistical sampling.  Does the  
> performance events subsystem support monitoring with such a mode?

That sounds like AMD IBS, which I've been told is based on the Alpha
PMU. We currently do no have AMD IBS supported.

AMD has two IBS counters, one does instructions and one does fetches, I
think Robert was going to support these by modeling them as fixed
purpose counters and provide the extra information through
PERF_SAMPLE_RAW until we can come up with a saner model.

A potential saner model is adding non sampling counters into its group
which are used to represent these other aspects of the unit.

--
To unsubscribe from this list: send the line "unsubscribe linux-alpha" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html