RE: [PATCH V1 2/3] perf/x86/intel/bm.c: Add Intel Branch Monitoring support

"Dey, Megha" <megha.dey@xxxxxxxxx> · Mon, 13 Nov 2017 19:22:40 +0000

>-----Original Message-----
>From: Peter Zijlstra [mailto:peterz@xxxxxxxxxxxxx]
>Sent: Monday, November 13, 2017 1:00 AM
>To: Megha Dey <megha.dey@xxxxxxxxxxxxxxx>
>Cc: x86@xxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; linux-
>doc@xxxxxxxxxxxxxxx; tglx@xxxxxxxxxxxxx; mingo@xxxxxxxxxx;
>hpa@xxxxxxxxx; andriy.shevchenko@xxxxxxxxxxxxxxx;
>kstewart@xxxxxxxxxxxxxxxxxxx; Yu, Yu-cheng <yu-cheng.yu@xxxxxxxxx>;
>Brown, Len <len.brown@xxxxxxxxx>; gregkh@xxxxxxxxxxxxxxxxxxx;
>acme@xxxxxxxxxx; alexander.shishkin@xxxxxxxxxxxxxxx; jolsa@xxxxxxxxxx;
>namhyung@xxxxxxxxxx; vikas.shivappa@xxxxxxxxxxxxxxx;
>pombredanne@xxxxxxxx; me@xxxxxxxxxxxx; bp@xxxxxxx; Andrejczuk,
>Grzegorz <grzegorz.andrejczuk@xxxxxxxxx>; Luck, Tony
><tony.luck@xxxxxxxxx>; corbet@xxxxxxx; Shankar, Ravi V
><ravi.v.shankar@xxxxxxxxx>; Dey, Megha <megha.dey@xxxxxxxxx>
>Subject: Re: [PATCH V1 2/3] perf/x86/intel/bm.c: Add Intel Branch
>Monitoring support
>
>On Sat, Nov 11, 2017 at 01:20:05PM -0800, Megha Dey wrote:
>> Currently, the cannonlake family of Intel processors support the
>> branch monitoring feature. Intel's Branch monitoring feature is trying
>> to utilize heuristics to detect the occurrence of an ROP (Return
>> Oriented Programming) attack.
>>
>> A perf-based kernel driver has been used to monitor the occurrence of
>> one of the 6 branch monitoring events. There are 2 counters that each
>> can select between one of these events for evaluation over a specified
>> instruction window size (0 to 1023). For each counter, a threshold
>> value
>> (0 to 127) can be configured to set a point at which ROP detection
>> event action is taken (determined by user-space). Each task can
>> monitor a maximum of 2 events at any given time.
>>
>> Apart from window_size(global) and threshold(per-counter), various
>> sysfs entries are provided for the user to configure: guest_disable,
>> lbr_freeze, window_cnt_sel, cnt_and_mode (all global) and
>mispred_evt_cnt(per-counter).
>> For all events belonging to the same task, the global parameters are
>> shared.
>
>Is there any sensible documentation on this except the MSR listings?

I have documented these sysfs entries in the next patch of this patch set:
Add Documentation for branch monitoring. Apart from that, unfortunately
there is only the MSR listings.

>
>>
>> Everytime a task is scheduled out, we save current window and count
>> associated with the event being monitored. When the task is scheduled
>> next, we start counting from previous count associated with this event.
>> Thus, a full context switch in this case is not necessary.
>
>What? That doesn't make any sense. The fact that we scheduled out and
>then in again _is_ a full context switch no?

What I meant was we need not save and restore all the branch monitoring 
MSRs during a context switch. I agree this is confusing. Will remove this line. 
>
>>
>> Signed-off-by: Megha Dey <megha.dey@xxxxxxxxxxxxxxx>
>> Signed-off-by: Yu-Cheng Yu <yu-cheng.yu@xxxxxxxxx>
>
>That SoB chain is buggered.

Will change the ordering.
>
>
>> +static int intel_bm_event_nmi_handler(unsigned int cmd, struct
>> +pt_regs *regs) {
>> +	struct perf_event *event;
>> +	union bm_detect_status stat;
>> +	int i;
>> +	unsigned long x;
>> +
>> +	rdmsrl(BR_DETECT_STATUS_MSR, stat.raw);
>
>	if (!stat.event)
>		return NMI_DONE;
>
>saves you a whole bunch of indentation, no?

Yep, it does.
>
>> +
>> +	if (stat.event) {
>> +		wrmsrl(BR_DETECT_STATUS_MSR, 0);
>> +		apic_write(APIC_LVTPC, APIC_DM_NMI);
>> +		/*
>> +		 * Issue wake-up to corresponding polling event
>> +		 */
>> +		x = stat.ctrl_hit;
>> +		for_each_set_bit(i, &x, BM_MAX_COUNTERS) {
>> +			event = current->thread.bm_counter_owner[i];
>> +			local64_inc(&event->count);
>> +			atomic_set(&event->hw.bm_poll, POLLIN);
>> +			event->pending_wakeup = 1;
>> +			irq_work_queue(&event->pending);
>> +		}
>> +		return NMI_HANDLED;
>> +	}
>> +	return NMI_DONE;
>> +}
>> +
>> +/*
>> + * Unmask the NMI bit of the local APIC the first time task is
>> +scheduled
>> + * on a particular CPU.
>> + */
>> +static void intel_bm_unmask_nmi(void) {
>> +	this_cpu_write(bm_unmask_apic, 0);
>> +
>> +	if (!(this_cpu_read(bm_unmask_apic))) {
>> +		apic_write(APIC_LVTPC, APIC_DM_NMI);
>> +		this_cpu_inc(bm_unmask_apic);
>> +	}
>> +}
>
>What? Why?

Normally, other drivers using perf create an event on every CPU (thereby calling
perf_init on every CPU), where this bit(APIC_DM_NMI)is explicitly unmasked.
In our driver, we do not do this (since we are worried only about a particular task)
and hence this bit is only disabled on the local APIC where the perf event is initialized.
As such, if the task is scheduled out to some other CPU, this bit is set and hence
would stop the interrupt from reaching the processing core.

>
>> +static int intel_bm_event_add(struct perf_event *event, int mode) {
>> +	union bm_detect_status cur_stat, prev_stat;
>> +
>> +	WARN_ON(event->hw.id >= BM_MAX_COUNTERS);
>> +
>> +	prev_stat.raw = local64_read(&event->hw.prev_count);
>> +
>> +	/*
>> +	 * Start counting from previous count associated with this event
>> +	 */
>> +	rdmsrl(BR_DETECT_STATUS_MSR, cur_stat.raw);
>> +
>> +	cur_stat.count[event->hw.id] = prev_stat.count[event->hw.id];
>> +	cur_stat.count_window = prev_stat.count_window;
>> +	wrmsrl(BR_DETECT_STATUS_MSR, cur_stat.raw);
>
>Why are you writing back the value you read? Just to waste cycles?

We only wanted to update the window count and event count associated with one of
the 2 event IDs. But you are right, we don't read to read the MSR, will directly write to it.
>
>> +	wrmsrl(BR_DETECT_CONTROL_MSR, event->hw.bm_ctrl);
>> +
>> +	intel_bm_unmask_nmi();
>> +
>> +	wrmsrl(BR_DETECT_COUNTER_CONFIG_BASE + event->hw.id,
>> +		(event->hw.bm_counter_conf | 1));
>
>Please use a named construct for that enable bit.

I will switch to using bitfields for this MSR, similar to the status register.
>
>> +
>> +	return 0;
>> +}
>> +
>> +static void intel_bm_event_update(struct perf_event *event) {
>> +	union bm_detect_status cur_stat;
>> +
>> +	rdmsrl(BR_DETECT_STATUS_MSR, cur_stat.raw);
>> +	local64_set(&event->hw.prev_count, (uint64_t)cur_stat.raw); }
>
>That looks wrong... the general point of update functions is to update the
>count, the above does not in fact do that.

Ok will remove this and add this functionality to event_del()
>
>> +
>> +static void intel_bm_event_del(struct perf_event *event, int flags) {
>> +	WARN_ON(event->hw.id >= BM_MAX_COUNTERS);
>> +
>> +	wrmsrl(BR_DETECT_COUNTER_CONFIG_BASE + event->hw.id,
>> +		(event->hw.bm_counter_conf & ~1));
>
>Either that EN bit is part of the bm_counter_conf, in which case you didn't
>need to add it in _add(), or its not and you don't need to clear it here. Make
>up your mind.

We are starting the count in add() (last bit of bm_counter_conf) and stopping it here.
I am not sure what you mean by this. Are you saying that we don't explicitly have to
enable or disable the count if the enable bit is a part of the MSR?
>
>> +
>> +	intel_bm_event_update(event);
>
>Except of course, that does not in fact update...

Will remove this.
>
>> +}
>> +
>> +static void intel_bm_event_destroy(struct perf_event *event) {
>> +	bm_counter_owner[event->hw.id] = NULL; }
>> +
>> +static DEFINE_MUTEX(bm_counter_mutex);
>> +
>> +static int intel_bm_event_init(struct perf_event *event) {
>> +	u64 cfg;
>> +	int counter_to_use = -1, i;
>> +
>> +	local64_set(&event->hw.prev_count, 0);
>> +
>
>> +	/*
>> +	 * Find a hardware counter for the target task
>> +	 */
>> +	bm_counter_owner = event->hw.target-
>>thread.bm_counter_owner;
>> +
>> +	mutex_lock(&bm_counter_mutex);
>> +	for (i = 0; i < BM_MAX_COUNTERS; i++) {
>> +		if (bm_counter_owner[i] == NULL) {
>> +			counter_to_use = i;
>> +			bm_counter_owner[i] = event;
>> +			break;
>> +		}
>> +	}
>> +	mutex_unlock(&bm_counter_mutex);
>> +
>> +	if (counter_to_use == -1)
>> +		return -EBUSY;
>> +
>> +	event->hw.bm_ctrl = (bm_window_size <<
>BM_WINDOW_SIZE_SHIFT) |
>> +			    (bm_guest_disable << BM_GUEST_DISABLE_SHIFT)
>|
>> +			    (bm_lbr_freeze << BM_LBR_FREEZE_SHIFT) |
>> +			    (bm_window_cnt_sel <<
>BM_WINDOW_CNT_SEL_SHIFT) |
>> +			    (bm_cnt_and_mode <<
>BM_CNT_AND_MODE_SHIFT) |
>> +								BM_ENABLE;
>> +	event->hw.bm_counter_conf = (bm_threshold <<
>BM_THRESHOLD_SHIFT) |
>> +			(bm_mispred_evt_cnt <<
>BM_MISPRED_EVT_CNT_SHIFT) |
>> +					(cfg << BM_EVENT_TYPE_SHIFT);
>> +
>> +	event->hw.id = counter_to_use;
>> +	local64_set(&event->count, 0);
>
>That is just a really ugly hack to work around:

Actually, this was intended to make sure event->count is always zero for a new event.
This wasn't intended to be a hack to work around the below. Hence, was not mentioned
in the changelog.
>
>
>> +static struct pmu intel_bm_pmu = {
>> +	.task_ctx_nr     = perf_sw_context,
>
>
>this. And you didn't bother to mention that atrocity in your Changelog.

We don't need this. Will remove it in the next patch series.
>
>
>
>NAK.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html