Marc Zyngier <maz@xxxxxxxxxx> writes:
On Tue, 21 Mar 2023 19:10:04 +0000, Colton Lewis <coltonlewis@xxxxxxxxxx> wrote:
Marc Zyngier <maz@xxxxxxxxxx> writes:
>> +#define MEASURE_CYCLES(x) \ >> + ({ \ >> + uint64_t start; \ >> + start = cycles_read(); \ >> + x; \
> You insert memory accesses inside a sequence that has no dependency > with it. On a weakly ordered memory system, there is absolutely no > reason why the memory access shouldn't be moved around. What do you > exactly measure in that case?
cycles_read is built on another function timer_get_cntct which includes its own barriers. Stripped of some abstraction, the sequence is:
timer_get_cntct (isb+read timer) whatever is being measured timer_get_cntct
I hadn't looked at it too closely before but on review of the manual I think you are correct. Borrowing from example D7-2 in the manual, it should be:
timer_get_cntct isb whatever is being measured dsb timer_get_cntct
That's better, but also very heavy handed. You'd be better off constructing an address dependency from the timer value, and feed that into a load-acquire/store-release pair wrapping your payload.
I can do something like that.
>> + cycles_read() - start; \
> I also question the usefulness of this exercise. You're comparing the > time it takes for a multi-GHz system to put a write in a store buffer > (assuming it didn't miss in the TLBs) vs a counter that gets updated > at a frequency of a few tens of MHz.
> My guts feeling is that this results in a big fat zero most of the > time, but I'm happy to be explained otherwise.
In context, I'm trying to measure the time it takes to write to a buffer *with dirty memory logging enabled*. What do you mean by zero? I can confirm from running this code I am not measuring zero time.
See my earlier point: the counter tick is a few MHz, and the CPU multiple GHz. So unless "whatever" is something that takes a significant time (several thousands of CPU cycles), you'll measure nothing using the counter. Page faults will probably show, but not a normal access.
The right tool for this job is to use PMU events, as they count at the CPU frequency.
Thanks. I understand you clearly now. I think it works out to tens of cpu cycles, not thousands to observe a timer tick in the usual case (2 GHz / 25 MHz = 80, of course slower timers exist), but I agree with you a more precise tool is called for.