Re: [RFC] Energy/power monitoring within the kernel

Pawel Moll <pawel.moll@xxxxxxx> · Wed, 24 Oct 2012 17:51:47 +0100

On Wed, 2012-10-24 at 01:40 +0100, Thomas Renninger wrote:
> > More and more of people are getting interested in the subject of power
> > (energy) consumption monitoring. We have some external tools like
> > "battery simulators", energy probes etc., but some targets can measure
> > their power usage on their own.
> > 
> > Traditionally such data should be exposed to the user via hwmon sysfs
> > interface, and that's exactly what I did for "my" platform - I have
> > a /sys/class/hwmon/hwmon*/device/energy*_input and this was good
> > enough to draw pretty graphs in userspace. Everyone was happy...
> > 
> > Now I am getting new requests to do more with this data. In particular
> > I'm asked how to add such information to ftrace/perf output.
> Why? What is the gain?
> 
> Perf events can be triggered at any point in the kernel.
> A cpufreq event is triggered when the frequency gets changed.
> CPU idle events are triggered when the kernel requests to enter an idle state
> or exits one.
> 
> When would you trigger a thermal or a power event?
> There is the possibility of (critical) thermal limits.
> But if I understand this correctly you want this for debugging and
> I guess you have everything interesting one can do with temperature
> values:
>   - read the temperature
>   - draw some nice graphs from the results
> 
> Hm, I guess I know what you want to do:
> In your temperature/energy graph, you want to have some dots
> when relevant HW states (frequency, sleep states,  DDR power,...)
> changed. Then you are able to see the effects over a timeline.
> 
> So you have to bring the existing frequency/idle perf events together
> with temperature readings
> 
> Cleanest solution could be to enhance the exisiting userspace apps
> (pytimechart/perf timechart) and let them add another line
> (temperature/energy), but the data would not come from perf, but
> from sysfs/hwmon.
> Not sure whether this works out with the timechart tools.
> Anyway, this sounds like a userspace only problem.

Ok, so it is actually what I'm working on right now. Not with the
standard perf tool (there are other users of that API ;-) but indeed I'm
trying to "enrich" the data stream coming from kernel with user-space
originating values. I am a little bit concerned about effect of extra
syscalls (accessing the value and gettimeofday to generate a timestamp)
at a higher sampling rates, but most likely it won't be a problem. Can
report once I know more, if this is of interest to anyone.

Anyway, there are at least two debug/trace related use cases that can
not be satisfied that way (of course one could argue about their
usefulness):

1. ftrace-over-network (https://lwn.net/Articles/410200/) which is
particularly appealing for "embedded users", where there's virtually no
useful userspace available (think Android). Here a (functional) trace
event is embedded into a normal trace and available "for free" at the
host side.

2. perf groups - the general idea is that one event (let it be cycle
counter interrupt or even a timer) triggers read of other values (eg.
cache counter or - in this case - energy counter). The aim is to have a
regular "snapshots" of the system state. I'm not sure if the standard
perf tool can do this, but I do :-)

And last, but not least, there are the non-debug/trace clients for
energy data as discussed in other mails in this thread. Of course the
trace event won't really satisfy their needs either.

Thanks for your feedback!

Paweł

_______________________________________________
lm-sensors mailing list
lm-sensors@xxxxxxxxxxxxxx
http://lists.lm-sensors.org/mailman/listinfo/lm-sensors