Re: [PATCH V4 3/7] KVM, pkeys: update memeory permission bitmask for pkeys

Paolo Bonzini <pbonzini@xxxxxxxxxx> · Mon, 7 Mar 2016 00:14:07 +0100

On 06/03/2016 08:42, Xiao Guangrong wrote:
>>
>> +        rsvdf = pfec & PFERR_RSVD_MASK;
> 
> No. RSVD is reserved by SMAP and it should not be used to walk guest
> page page table.

Agreed.  You can treat your code as if rsvdf was always false.  Reserved
bits are handled elsewhere.

>> +        pkuf = pfec & PFERR_PK_MASK;
>>           /*
>>            * PFERR_RSVD_MASK bit is set in PFEC if the access is not
>>            * subject to SMAP restrictions, and cleared otherwise. The
>> @@ -3824,12 +3830,34 @@ static void update_permission_bitmask(struct
>> kvm_vcpu *vcpu,
>>                    *   clearer.
>>                    */
>>                   smap = cr4_smap && u && !uf && !ff;
>> +
>> +                /*
>> +                * PKU:additional mechanism by which the paging
>> +                * controls access to user-mode addresses based
>> +                * on the value in the PKRU register. A fault is
>> +                * considered as a PKU violation if all of the
>> +                * following conditions are true:
>> +                * 1.CR4_PKE=1.
>> +                * 2.EFER_LMA=1.
>> +                * 3.page is present with no reserved bit
>> +                *   violations.
>> +                * 4.the access is not an instruction fetch.
>> +                * 5.the access is to a user page.
>> +                * 6.PKRU.AD=1
>> +                *    or The access is a data write and
>> +                *       PKRU.WD=1 and either CR0.WP=1
>> +                *       or it is a user access.
>> +                *
>> +                * The 2nd and 6th conditions are computed
>> +                * dynamically in permission_fault.
>> +                */
> 
> It is not good as there are branches in the next patch.

It's important to note that branches in general are _not_ a problem.
Only badly-predicted branches are a problem; well-predicted branches are
_faster_ than branchless code.  For example, take is_last_gpte.  The
branchy way to write it in walk_addr_generic would be (excluding the
32-bit !PSE case) something like:

	do {
	} while (level > PT_PAGE_TABLE_LEVEL &&
		 (!(gpte & PT_PAGE_SIZE_MASK) ||
		  level == mmu->root_level));

Here none of the branches is easily predicted, so we want to get rid of
them.

The next patch adds three branches, and they are not all equal:

- is_long_vcpu is well predicted to true (or even for 32-bit OSes it
should be well predicted if the host is not overcommitted).

- pkru != 0 should be well-predicted to false, at least for a few
years... and perhaps even later considering that most MMIO access
happens in the kernel.

- !wf || (!uf && !is_write_protection(vcpu)) is badly predicted and
should be removed

So only the last one is a problem.

> However, i do not think we need a new byte index for PK. The conditions
> detecting PK enablement
> can be fully found in current vcpu content (i.e, CR4, EFER and U/S access).

Adding a new byte index lets you cache CR4.PKE (and actually EFER.LMA
too, though Huaitong's patch doesn't do that).  It's a good thing to do.
 U/S is also handled by adding a new byte index, see Huaitong's

	pku = cr4_pku && !ff && u;

If this is improved to

	pku = cr4_pku && long_mode_vcpu && !ff && u;

one branch goes away in permission_fault.  The read_pkru() branch, if
well predicted, lets you optimize away the pkru tests.  I think it
_would_ be well predicted, so I think it should remain.

The "(!wf || (!uf && !is_write_protection(vcpu)))" is indeed the worst
of the three.  I was lenient in my previous review because this code
won't run on any system being sold now and in the next 1-2 (?) years.
However, we can indeed get rid of the branch, so let's do it. :)

I don't like the idea of making permissions[] four times larger.
Instead, we can find the value of the expression elsewhere in
mmu->permissions (!), or cache it separately in a different field of mmu.

If I interpret the rules correctly, WD works like this.  First, we take
the PTE and imagine that it had W=0.  Then, if this access would not
fault, WD is ignored.  This is because:

- on reads, WD is always ignored

- on writes, WD is ignored in supervisor mode if !CR0.WP

... and this is how W=0 page work, isn't it?

If so, I think it's something like this in code:

-		if (!wf || (!uf && !is_write_protection(vcpu)))
-			pkru_bits &= ~(1 << PKRU_WRITE);
+		/* Only testing writes, so ignore SMAP and fetch.  */
+		pfec_uw = pfec & (PFERR_WRITE_MASK|PFERR_USER_MASK);
+		fault_uw = mmu->permissions[pfec_uw >> 1];
+		/*
+		 * This page has U=1, so check if a U=1 W=0 page faults
+		 * on this access; if not ignore WD.
+		 */
+		pkru_bits &= ~(1 << PKRU_WRITE) |
+			(fault_uw >> (ACC_USER_MASK - PKRU_WRITE));

I think I even prefer if update_permission_bitmask sets up a separate
bitmask:

		mmu->fault_u1w0 |= (wf && !w) << byte;

and then this other bitmap can be tested in permission_fault:

-		if (!wf || (!uf && !is_write_protection(vcpu)))
-			pkru_bits &= ~(1 << PKRU_WRITE);
+		/*
+		 * fault_u1w0 ignores SMAP and PKRU, so use the
+		 * partially-computed PFEC that we were given.
+		 */
+		fault_uw = (mmu->fault_u1w0 >> (pfec >> 1)) & 1;
+		pkru_bits &= ~(1 << PKRU_WRITE) |
+			(fault_uw << PKRU_WRITE);

These ideas are untested, of course.  I apologize for any mistake.
However, they should apply both to Huaitong's current code (which needs
PFERR_PK_MASK in gva_to_gpa) and to my other suggestion from the reply
to patch 5 (http://article.gmane.org/gmane.comp.emulators.kvm.devel/148311).

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html