Re: [PATCH V4 3/7] KVM, pkeys: update memeory permission bitmask for pkeys

Xiao Guangrong <guangrong.xiao@xxxxxxxxxxxxxxx> · Tue, 8 Mar 2016 15:35:20 +0800






On 03/07/2016 07:14 AM, Paolo Bonzini wrote:


On 06/03/2016 08:42, Xiao Guangrong wrote:

+        rsvdf = pfec & PFERR_RSVD_MASK;

No. RSVD is reserved by SMAP and it should not be used to walk guest
page page table.

Agreed.  You can treat your code as if rsvdf was always false.  Reserved
bits are handled elsewhere.

+        pkuf = pfec & PFERR_PK_MASK;
           /*
            * PFERR_RSVD_MASK bit is set in PFEC if the access is not
            * subject to SMAP restrictions, and cleared otherwise. The
@@ -3824,12 +3830,34 @@ static void update_permission_bitmask(struct
kvm_vcpu *vcpu,
                    *   clearer.
                    */
                   smap = cr4_smap && u && !uf && !ff;
+
+                /*
+                * PKU:additional mechanism by which the paging
+                * controls access to user-mode addresses based
+                * on the value in the PKRU register. A fault is
+                * considered as a PKU violation if all of the
+                * following conditions are true:
+                * 1.CR4_PKE=1.
+                * 2.EFER_LMA=1.
+                * 3.page is present with no reserved bit
+                *   violations.
+                * 4.the access is not an instruction fetch.
+                * 5.the access is to a user page.
+                * 6.PKRU.AD=1
+                *    or The access is a data write and
+                *       PKRU.WD=1 and either CR0.WP=1
+                *       or it is a user access.
+                *
+                * The 2nd and 6th conditions are computed
+                * dynamically in permission_fault.
+                */

It is not good as there are branches in the next patch.

It's important to note that branches in general are _not_ a problem.
Only badly-predicted branches are a problem;

I agreed on this point.
well-predicted branches are
_faster_ than branchless code.

Er, i do not understand this. If these two case have the same cache hit,
how can a branch be faster?

For example, take is_last_gpte.  The
branchy way to write it in walk_addr_generic would be (excluding the
32-bit !PSE case) something like:

	do {
	} while (level > PT_PAGE_TABLE_LEVEL &&
		 (!(gpte & PT_PAGE_SIZE_MASK) ||
		  level == mmu->root_level));

Here none of the branches is easily predicted, so we want to get rid of
them.

The next patch adds three branches, and they are not all equal:

- is_long_vcpu is well predicted to true (or even for 32-bit OSes it
should be well predicted if the host is not overcommitted).


But, in the production, cpu over-commit is the normal case...

- pkru != 0 should be well-predicted to false, at least for a few
years... and perhaps even later considering that most MMIO access
happens in the kernel.

- !wf || (!uf && !is_write_protection(vcpu)) is badly predicted and
should be removed

So only the last one is a problem.

However, i do not think we need a new byte index for PK. The conditions
detecting PK enablement
can be fully found in current vcpu content (i.e, CR4, EFER and U/S access).

Adding a new byte index lets you cache CR4.PKE (and actually EFER.LMA
too, though Huaitong's patch doesn't do that).  It's a good thing to do.
  U/S is also handled by adding a new byte index, see Huaitong's

It is not on the same page, the U/S is the type of memory access which
is depended on vCPU runtime. But the condition whether PKEY is enabled or not
is fully depended on the envorment of CPU and we should _always_ check PKEY
even if PFEC_PKEY is not set.

As PKEY is not enabled on softmmu, the gva_to_gpa mostly comes from internal
KVM, that means we should always set PFEC.PKEY for all the gva_to_gpa request.
Wasting a bit is really unnecessary.

And it is always better to move more workload from permission_fault() to
update_permission_bitmask() as the former is much hotter than the latter.


	pku = cr4_pku && !ff && u;

If this is improved to

	pku = cr4_pku && long_mode_vcpu && !ff && u;

one branch goes away in permission_fault.  The read_pkru() branch, if
well predicted, lets you optimize away the pkru tests.  I think it
_would_ be well predicted, so I think it should remain.

The "(!wf || (!uf && !is_write_protection(vcpu)))" is indeed the worst
of the three.  I was lenient in my previous review because this code
won't run on any system being sold now and in the next 1-2 (?) years.
However, we can indeed get rid of the branch, so let's do it. :)

I don't like the idea of making permissions[] four times larger.

Okay, then lets introduce a new field for PKEY separately. Your approach
, fault_u1w0, looks good to me.

Instead, we can find the value of the expression elsewhere in
mmu->permissions (!), or cache it separately in a different field of mmu.

If I interpret the rules correctly, WD works like this.  First, we take
the PTE and imagine that it had W=0.  Then, if this access would not
fault, WD is ignored.  This is because:

- on reads, WD is always ignored

- on writes, WD is ignored in supervisor mode if !CR0.WP

... and this is how W=0 page work, isn't it?


Yes, it is.

If so, I think it's something like this in code:

-		if (!wf || (!uf && !is_write_protection(vcpu)))
-			pkru_bits &= ~(1 << PKRU_WRITE);
+		/* Only testing writes, so ignore SMAP and fetch.  */
+		pfec_uw = pfec & (PFERR_WRITE_MASK|PFERR_USER_MASK);
+		fault_uw = mmu->permissions[pfec_uw >> 1];
+		/*
+		 * This page has U=1, so check if a U=1 W=0 page faults
+		 * on this access; if not ignore WD.
+		 */
+		pkru_bits &= ~(1 << PKRU_WRITE) |
+			(fault_uw >> (ACC_USER_MASK - PKRU_WRITE));


This is trick and finally i understand it, yeah, it works. :) Except
i do not think PFEC.PKEY should be taken to index as i explained above.

I think I even prefer if update_permission_bitmask sets up a separate
bitmask:

		mmu->fault_u1w0 |= (wf && !w) << byte;

and then this other bitmap can be tested in permission_fault:


-		if (!wf || (!uf && !is_write_protection(vcpu)))
-			pkru_bits &= ~(1 << PKRU_WRITE);
+		/*
+		 * fault_u1w0 ignores SMAP and PKRU, so use the
+		 * partially-computed PFEC that we were given.
+		 */
+		fault_uw = (mmu->fault_u1w0 >> (pfec >> 1)) & 1;
+		pkru_bits &= ~(1 << PKRU_WRITE) |
+			(fault_uw << PKRU_WRITE);


It looks good to me!
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html