On 06/03/2016 08:42, Xiao Guangrong wrote: >> >> + rsvdf = pfec & PFERR_RSVD_MASK; > > No. RSVD is reserved by SMAP and it should not be used to walk guest > page page table. Agreed. You can treat your code as if rsvdf was always false. Reserved bits are handled elsewhere. >> + pkuf = pfec & PFERR_PK_MASK; >> /* >> * PFERR_RSVD_MASK bit is set in PFEC if the access is not >> * subject to SMAP restrictions, and cleared otherwise. The >> @@ -3824,12 +3830,34 @@ static void update_permission_bitmask(struct >> kvm_vcpu *vcpu, >> * clearer. >> */ >> smap = cr4_smap && u && !uf && !ff; >> + >> + /* >> + * PKU:additional mechanism by which the paging >> + * controls access to user-mode addresses based >> + * on the value in the PKRU register. A fault is >> + * considered as a PKU violation if all of the >> + * following conditions are true: >> + * 1.CR4_PKE=1. >> + * 2.EFER_LMA=1. >> + * 3.page is present with no reserved bit >> + * violations. >> + * 4.the access is not an instruction fetch. >> + * 5.the access is to a user page. >> + * 6.PKRU.AD=1 >> + * or The access is a data write and >> + * PKRU.WD=1 and either CR0.WP=1 >> + * or it is a user access. >> + * >> + * The 2nd and 6th conditions are computed >> + * dynamically in permission_fault. >> + */ > > It is not good as there are branches in the next patch. It's important to note that branches in general are _not_ a problem. Only badly-predicted branches are a problem; well-predicted branches are _faster_ than branchless code. For example, take is_last_gpte. The branchy way to write it in walk_addr_generic would be (excluding the 32-bit !PSE case) something like: do { } while (level > PT_PAGE_TABLE_LEVEL && (!(gpte & PT_PAGE_SIZE_MASK) || level == mmu->root_level)); Here none of the branches is easily predicted, so we want to get rid of them. The next patch adds three branches, and they are not all equal: - is_long_vcpu is well predicted to true (or even for 32-bit OSes it should be well predicted if the host is not overcommitted). - pkru != 0 should be well-predicted to false, at least for a few years... and perhaps even later considering that most MMIO access happens in the kernel. - !wf || (!uf && !is_write_protection(vcpu)) is badly predicted and should be removed So only the last one is a problem. > However, i do not think we need a new byte index for PK. The conditions > detecting PK enablement > can be fully found in current vcpu content (i.e, CR4, EFER and U/S access). Adding a new byte index lets you cache CR4.PKE (and actually EFER.LMA too, though Huaitong's patch doesn't do that). It's a good thing to do. U/S is also handled by adding a new byte index, see Huaitong's pku = cr4_pku && !ff && u; If this is improved to pku = cr4_pku && long_mode_vcpu && !ff && u; one branch goes away in permission_fault. The read_pkru() branch, if well predicted, lets you optimize away the pkru tests. I think it _would_ be well predicted, so I think it should remain. The "(!wf || (!uf && !is_write_protection(vcpu)))" is indeed the worst of the three. I was lenient in my previous review because this code won't run on any system being sold now and in the next 1-2 (?) years. However, we can indeed get rid of the branch, so let's do it. :) I don't like the idea of making permissions[] four times larger. Instead, we can find the value of the expression elsewhere in mmu->permissions (!), or cache it separately in a different field of mmu. If I interpret the rules correctly, WD works like this. First, we take the PTE and imagine that it had W=0. Then, if this access would not fault, WD is ignored. This is because: - on reads, WD is always ignored - on writes, WD is ignored in supervisor mode if !CR0.WP ... and this is how W=0 page work, isn't it? If so, I think it's something like this in code: - if (!wf || (!uf && !is_write_protection(vcpu))) - pkru_bits &= ~(1 << PKRU_WRITE); + /* Only testing writes, so ignore SMAP and fetch. */ + pfec_uw = pfec & (PFERR_WRITE_MASK|PFERR_USER_MASK); + fault_uw = mmu->permissions[pfec_uw >> 1]; + /* + * This page has U=1, so check if a U=1 W=0 page faults + * on this access; if not ignore WD. + */ + pkru_bits &= ~(1 << PKRU_WRITE) | + (fault_uw >> (ACC_USER_MASK - PKRU_WRITE)); I think I even prefer if update_permission_bitmask sets up a separate bitmask: mmu->fault_u1w0 |= (wf && !w) << byte; and then this other bitmap can be tested in permission_fault: - if (!wf || (!uf && !is_write_protection(vcpu))) - pkru_bits &= ~(1 << PKRU_WRITE); + /* + * fault_u1w0 ignores SMAP and PKRU, so use the + * partially-computed PFEC that we were given. + */ + fault_uw = (mmu->fault_u1w0 >> (pfec >> 1)) & 1; + pkru_bits &= ~(1 << PKRU_WRITE) | + (fault_uw << PKRU_WRITE); These ideas are untested, of course. I apologize for any mistake. However, they should apply both to Huaitong's current code (which needs PFERR_PK_MASK in gva_to_gpa) and to my other suggestion from the reply to patch 5 (http://article.gmane.org/gmane.comp.emulators.kvm.devel/148311). Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html