On 08/03/2016 08:35, Xiao Guangrong wrote: >> well-predicted branches are _faster_ than branchless code. > > Er, i do not understand this. If these two case have the same cache hit, > how can a branch be faster? Because branchless code typically executes fewer instructions. Take the same example here: >> do { >> } while (level > PT_PAGE_TABLE_LEVEL && >> (!(gpte & PT_PAGE_SIZE_MASK) || >> level == mmu->root_level)); The assembly looks like (assuming %level, %gpte and %mmu are registers) cmp $1, %level jbe 1f test $128, %gpte jz beginning_of_loop cmpb ROOT_LEVEL_OFFSET(%mmu), %level je beginning_of_loop 1: These are two to six instructions, with no dependency and which the processor can change into one to three macro-ops. For the branchless code (I posted a patch to implement this algorithm yesterday): lea -2(%level), %temp1 orl %temp1, %gpte movzbl LAST_NONLEAF_LEVEL_OFFSET(%mmu), %temp1 movl %level, %temp2 subl %temp1, %temp2 andl %temp2, %gpte test $128, %gpte jz beginning_of_loop These are eight instructions, with some dependencies between them too. In some cases branchless code throws away the result of 10-15 instructions (because in the end it's ANDed with 0, for example). If it weren't for mispredictions, the branchy code would be faster. >> Here none of the branches is easily predicted, so we want to get rid of >> them. >> >> The next patch adds three branches, and they are not all equal: >> >> - is_long_vcpu is well predicted to true (or even for 32-bit OSes it >> should be well predicted if the host is not overcommitted). > > But, in the production, cpu over-commit is the normal case... It depends on the workload. I would guess that 32-bit OSes are more common where you have a single legacy guest because e.g. it doesn't have drivers for recent hardware. >>> However, i do not think we need a new byte index for PK. The conditions >>> detecting PK enablement >>> can be fully found in current vcpu content (i.e, CR4, EFER and U/S >>> access). >> >> Adding a new byte index lets you cache CR4.PKE (and actually EFER.LMA >> too, though Huaitong's patch doesn't do that). It's a good thing to do. >> U/S is also handled by adding a new byte index, see Huaitong's > > It is not on the same page, the U/S is the type of memory access which > is depended on vCPU runtime. Do you mean the type of page (ACC_USER_MASK)? Only U=1 pages are subject to PKRU, even in the kernel. The processor CPL (PFERR_USER_MASK) only matters if CR0.WP=0. > But the condition whether PKEY is enabled or not > is fully depended on the envorment of CPU and we should _always_ > check PKEY even if PFEC_PKEY is not set. > > As PKEY is not enabled on softmmu, the gva_to_gpa mostly comes from internal > KVM, that means we should always set PFEC.PKEY for all the gva_to_gpa request. > Wasting a bit is really unnecessary. > > And it is always better to move more workload from permission_fault() to > update_permission_bitmask() as the former is much hotter than the latter. I agree, but I'm not sure why you say that adding a bits adds more work to permission_fault(). Adding a bit lets us skip CR4.PKU and EFER.LMA checks in permission_fault() and in all gva_to_gpa() callers. So my proposal is to compute the "effective" PKRU bits (i.e. extract the relevant AD and WD bits, and mask away WD if irrelevant) in update_permission_bitmask(), and add PFERR_PK_MASK to the error code if they are nonzero. PFERR_PK_MASK must be computed in permission_fault(). It's a runtime condition that it's not known before. >> I don't like the idea of making permissions[] four times larger. > > Okay, then lets introduce a new field for PKEY separately. Your approach > , fault_u1w0, looks good to me. > >> I think I even prefer if update_permission_bitmask sets up a separate >> bitmask: >> >> mmu->fault_u1w0 |= (wf && !w) << byte; >> >> and then this other bitmap can be tested in permission_fault: >> >> - if (!wf || (!uf && !is_write_protection(vcpu))) >> - pkru_bits &= ~(1 << PKRU_WRITE); >> + /* >> + * fault_u1w0 ignores SMAP and PKRU, so use the >> + * partially-computed PFEC that we were given. >> + */ >> + fault_uw = (mmu->fault_u1w0 >> (pfec >> 1)) & 1; >> + pkru_bits &= ~(1 << PKRU_WRITE) | >> + (fault_uw << PKRU_WRITE); > > It looks good to me! Good. Thanks for reviewing the idea! Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html