Re: [PATCH 1/5] x86, perf: Fix LBR call stack save/restore

Ingo Molnar <mingo@xxxxxxxxxx> · Wed, 21 Oct 2015 18:24:00 +0200

* Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:

> >  	mask = x86_pmu.lbr_nr - 1;
> > -	tos = intel_pmu_lbr_tos();
> > +	tos = task_ctx->tos;
> >  	for (i = 0; i < tos; i++) {
> >  		lbr_idx = (tos - i) & mask;
> >  		wrmsrl(x86_pmu.lbr_from + lbr_idx, task_ctx->lbr_from[i]);
> > @@ -247,6 +247,7 @@ static void __intel_pmu_lbr_restore(struct x86_perf_task_context *task_ctx)
> >  		if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_INFO)
> >  			wrmsrl(MSR_LBR_INFO_0 + lbr_idx, task_ctx->lbr_info[i]);
> >  	}
> > +	wrmsrl(x86_pmu.lbr_tos, tos);
> >  	task_ctx->lbr_stack_state = LBR_NONE;
> >  }
> 
> Any idea who much more expensive that wrmsr() is compared to the rdmsr() it 
> replaces?
> 
> If its significant we could think about having this behaviour depend on 
> callstacks.

The WRMSR extra cost is probably rather significant - here is a typical Intel 
WRMSR vs. RDMSR (non-hardwired) cache-hot/cache-cold cost difference:

[  170.798574] x86/bench: -------------------------------------------------------------------
[  170.807258] x86/bench: |                 RDTSC-cycles:    hot  (±noise) /   cold  (±noise)
[  170.816115] x86/bench: -------------------------------------------------------------------
[  212.146982] x86/bench: rdtsc                         :     16           /     60
[  213.725998] x86/bench: rdmsr                         :    100           /    148
[  215.469958] x86/bench: wrmsr                         :    456           /    708

That's on a Xeon E7-4890 (22nm IvyBridge-EX).

So it's 350-550 RDTSC cycles ...

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe stable" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html