Re: [PATCH V4 3/7] KVM, pkeys: update memeory permission bitmask for pkeys

Xiao Guangrong <guangrong.xiao@xxxxxxxxxxxxxxx> · Tue, 8 Mar 2016 17:19:26 +0800

On 03/08/2016 04:29 PM, Paolo Bonzini wrote:

On 08/03/2016 08:35, Xiao Guangrong wrote:
well-predicted branches are _faster_ than branchless code.

Er, i do not understand this. If these two case have the same cache hit,
how can a branch be faster?

Because branchless code typically executes fewer instructions.

Take the same example here:

     do {
     } while (level > PT_PAGE_TABLE_LEVEL &&
          (!(gpte & PT_PAGE_SIZE_MASK) ||
           level == mmu->root_level));

The assembly looks like (assuming %level, %gpte and %mmu are registers)

	cmp $1, %level
	jbe 1f
	test $128, %gpte
	jz beginning_of_loop
	cmpb ROOT_LEVEL_OFFSET(%mmu), %level
	je beginning_of_loop
1:

These are two to six instructions, with no dependency and which the
processor can change into one to three macro-ops.  For the branchless
code (I posted a patch to implement this algorithm yesterday):

	lea -2(%level), %temp1
	orl %temp1, %gpte
	movzbl LAST_NONLEAF_LEVEL_OFFSET(%mmu), %temp1
	movl %level, %temp2
	subl %temp1, %temp2
	andl %temp2, %gpte
	test $128, %gpte
	jz beginning_of_loop

These are eight instructions, with some dependencies between them too.
In some cases branchless code throws away the result of 10-15
instructions (because in the end it's ANDed with 0, for example).  If it
weren't for mispredictions, the branchy code would be faster.

Good lesson, thank you, Paolo. :)

Here none of the branches is easily predicted, so we want to get rid of
them.

The next patch adds three branches, and they are not all equal:

- is_long_vcpu is well predicted to true (or even for 32-bit OSes it
should be well predicted if the host is not overcommitted).

But, in the production, cpu over-commit is the normal case...

It depends on the workload.  I would guess that 32-bit OSes are more
common where you have a single legacy guest because e.g. it doesn't have
drivers for recent hardware.

However, i do not think we need a new byte index for PK. The conditions
detecting PK enablement
can be fully found in current vcpu content (i.e, CR4, EFER and U/S
access).

Adding a new byte index lets you cache CR4.PKE (and actually EFER.LMA
too, though Huaitong's patch doesn't do that).  It's a good thing to do.
   U/S is also handled by adding a new byte index, see Huaitong's

It is not on the same page, the U/S is the type of memory access which
is depended on vCPU runtime.

Do you mean the type of page (ACC_USER_MASK)?  Only U=1 pages are
subject to PKRU, even in the kernel.  The processor CPL
(PFERR_USER_MASK) only matters if CR0.WP=0.

No. The index is:
| Byte index: page fault error code [4:1]

So, the type i mentioned is the type of memory access issued by CPU, e,g CPU is
writing the memory or CPU is executing on the memory.

But the condition whether PKEY is enabled or not
is fully depended on the envorment of CPU and we should _always_
check PKEY even if PFEC_PKEY is not set.

As PKEY is not enabled on softmmu, the gva_to_gpa mostly comes from internal
KVM, that means we should always set PFEC.PKEY for all the gva_to_gpa request.
Wasting a bit is really unnecessary.

And it is always better to move more workload from permission_fault() to
update_permission_bitmask() as the former is much hotter than the latter.

I agree, but I'm not sure why you say that adding a bits adds more work
to permission_fault().

A branch to check PFEC.PKEY, which is not well predictable on soft mmu. (It
should always be set in EPT as the page table walking is done by software,
however, if we only consider EPT we can assume it is always true).

Adding a bit lets us skip CR4.PKU and EFER.LMA checks in
permission_fault() and in all gva_to_gpa() callers.

The point is when we can clear this bit to skip these checks. We should
_always_ check PKEY even if PFEC.PKEY = 0, because:
1) all gva_to_gpa()s issued by KVM should always check PKEY. This is the
   case of ept only.

2) if the feature is enabled in softmmu, shadow page table may change its
   behavior, for example, the mmio-access causes a reserved PF which
   may clear PFEC.PKEY.

And skipping these checks is not really necessary as we can take them into
account when we update the bitmask.

So my proposal is to compute the "effective" PKRU bits (i.e. extract the
relevant AD and WD bits, and mask away WD if irrelevant) in
update_permission_bitmask(), and add PFERR_PK_MASK to the error code if
they are nonzero.

PFERR_PK_MASK must be computed in permission_fault().  It's a runtime
condition that it's not known before.

Yes, you are right.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html