* Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote: > > mask = x86_pmu.lbr_nr - 1; > > - tos = intel_pmu_lbr_tos(); > > + tos = task_ctx->tos; > > for (i = 0; i < tos; i++) { > > lbr_idx = (tos - i) & mask; > > wrmsrl(x86_pmu.lbr_from + lbr_idx, task_ctx->lbr_from[i]); > > @@ -247,6 +247,7 @@ static void __intel_pmu_lbr_restore(struct x86_perf_task_context *task_ctx) > > if (x86_pmu.intel_cap.lbr_format == LBR_FORMAT_INFO) > > wrmsrl(MSR_LBR_INFO_0 + lbr_idx, task_ctx->lbr_info[i]); > > } > > + wrmsrl(x86_pmu.lbr_tos, tos); > > task_ctx->lbr_stack_state = LBR_NONE; > > } > > Any idea who much more expensive that wrmsr() is compared to the rdmsr() it > replaces? > > If its significant we could think about having this behaviour depend on > callstacks. The WRMSR extra cost is probably rather significant - here is a typical Intel WRMSR vs. RDMSR (non-hardwired) cache-hot/cache-cold cost difference: [ 170.798574] x86/bench: ------------------------------------------------------------------- [ 170.807258] x86/bench: | RDTSC-cycles: hot (±noise) / cold (±noise) [ 170.816115] x86/bench: ------------------------------------------------------------------- [ 212.146982] x86/bench: rdtsc : 16 / 60 [ 213.725998] x86/bench: rdmsr : 100 / 148 [ 215.469958] x86/bench: wrmsr : 456 / 708 That's on a Xeon E7-4890 (22nm IvyBridge-EX). So it's 350-550 RDTSC cycles ... Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe stable" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html