On Thu, Aug 30, 2012 at 9:54 PM, Xiao Guangrong <xiaoguangrong@xxxxxxxxxxxxxxxxxx> wrote: > On 08/31/2012 02:59 AM, Hugo wrote: >> On Thu, Aug 30, 2012 at 5:22 AM, Xiao Guangrong >> <xiaoguangrong@xxxxxxxxxxxxxxxxxx> wrote: >>> On 08/28/2012 11:30 AM, Felix wrote: >>>> Xiao Guangrong <xiaoguangrong <at> linux.vnet.ibm.com> writes: >>>> >>>>> >>>>> On 07/31/2012 01:18 AM, Sunil wrote: >>>>>> Hello List, >>>>>> >>>>>> I am a KVM newbie and studying KVM mmu code. >>>>>> >>>>>> On the existing guest, I am trying to track all guest writes by >>>>>> marking page table entry as read-only in EPT entry [ I am using Intel >>>>>> machine with vmx and ept support ]. Looks like EPT support re-uses >>>>>> shadow page table(SPT) code and hence some of SPT routines. >>>>>> >>>>>> I was thinking of below possible approach. Use pte_list_walk() to >>>>>> traverse through list of sptes and use mmu_spte_update() to flip the >>>>>> PT_WRITABLE_MASK flag. But all SPTEs are not part of any single list; >>>>>> but on separate lists (based on gfn, page level, memory_slot). So, >>>>>> recording all the faulted guest GFN and then using above method work ? >>>>>> >>>>> >>>>> There are two ways to write-protect all sptes: >>>>> - use kvm_mmu_slot_remove_write_access() on all memslots >>>>> - walk the shadow page cache to get the shadow pages in the highest level >>>>> (level = 4 on EPT), then write-protect its entries. >>>>> >>>>> If you just want to do it for the specified gfn, you can use >>>>> rmap_write_protect(). >>>>> >>>>> Just inquisitive, what is your purpose? :) >>>>> >>>>> -- >>>>> To unsubscribe from this list: send the line "unsubscribe kvm" in >>>>> the body of a message to majordomo <at> vger.kernel.org >>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>> >>>>> >>>> Hi, Guangrong, >>>> >>>> I have done similar things like Sunil did. Simply for study purpose. However, I >>>> found some very weird situations. Basically, in the guest vm, I allocate a chunk >>>> of memory (with size of a page) in a user level program. Through a guest kernel >>>> level module and my self defined hypercall, I pass the gva of this memory to >>>> kvm. Then I try different methods in the hypercall handler to write protect this >>>> page of memory. You can see that I want to write protect it through ETP instead >>>> of write protected in the guest page tables. >>>> >>>> 1. I use kvm_mmu_gva_to_gpa_read to translate the gva into gpa. Based on the >>>> function, kvm_mmu_get_spte_hierarchy(vcpu, gpa, spte[4]), I change the codes to >>>> read sptep (the pointer to spte) instead of spte, so I can modify the spte >>>> corresponding to this gpa. What I observe is that if I modify spte[0] (I think >>>> this is the lowest level page table entry corresponding to EPT table; I can >>>> successfully modify it as the changes are reflected in the result of calling >>>> kvm_mmu_get_spte_hierarchy again), but my user level program in vm can still >>>> write to this page. >>>> >>>> In your this blog post, you mentioned (the shadow pages in the highest level >>>> (level = 4 on EPT)), I don't understand this part. Does this mean I have to >>>> modify spte[3] instead of spte[0]? I just try modify spte[1] and spte[3], both >>>> can cause vmexit. So I am totally confused about the meaning of level used in >>>> shadow page table and its relations to shadow page table. Can you help me to >>>> understand this? >>>> >>>> 2. As suggested by this post, I also use rmap_write_protect() to write protect >>>> this page. With kvm_mmu_get_spte_hierarchy(vcpu, gpa, spte[4]), I still can see >>>> that spte[0] gives me xxxxxx005 such result, this means that the function is >>>> called successfully. But still I can write to this page. >>>> >>>> I even try the function kvm_age_hva() to remove this spte, this gives me 0 of >>>> spte[0], but I still can write to this page. So I am further confused about the >>>> level used in the shadow page? >>>> >>> >>> kvm_mmu_get_spte_hierarchy get sptes out of mmu-lock, you can hold spin_lock(&vcpu->kvm->mmu_lock) >>> and use for_each_shadow_entry instead. And, after change, did you flush all tlbs? >> >> I do apply the lock in my codes and I do flush tlb. >> >>> >>> If it can not work, please post your code. >>> >> >> Here is my codes. The modifications are made in x86/x86.c in >> >> KVM_HC_HL_EPTPER is my hypercall number. >> >> Method 1: >> >> int kvm_emulate_hypercall(struct kvm_vcpu *vcpu){ >> ................ >> >> case KVM_HC_HL_EPTPER : >> //// This method is not working >> >> localGpa = kvm_mmu_gva_to_gpa_read(vcpu, a0, &localEx); >> if(localGpa == UNMAPPED_GVA){ >> printk("read is not correct\n"); >> return -KVM_ENOSYS; >> } >> >> hl_kvm_mmu_update_spte(vcpu, localGpa, 5); >> hl_result = kvm_mmu_get_spte_hierarchy(vcpu, localGpa, >> hl_sptes); >> >> printk("after changes return result is %d , gpa: %llx >> sptes: %llx , %llx , %llx , %llx \n", hl_result, localGpa, >> hl_sptes[0], hl_sptes[1], hl_sptes[2], hl_sptes[3]); >> kvm_flush_remote_tlbs(vcpu->kvm); >> ................... >> } >> >> The function hl_kvm_mmu_update_spte is defined as >> >> int hl_kvm_mmu_update_spte(struct kvm_vcpu *vcpu, u64 addr, u64 mask) >> { >> struct kvm_shadow_walk_iterator iterator; >> int nr_sptes = 0; >> u64 sptes[4]; >> u64* sptep[4]; >> u64 localMask = 0xFFFFFFFFFFFFFFF8; /// 1000 >> >> spin_lock(&vcpu->kvm->mmu_lock); >> for_each_shadow_entry(vcpu, addr, iterator) { >> sptes[iterator.level-1] = *iterator.sptep; >> sptep[iterator.level-1] = iterator.sptep; >> nr_sptes++; >> if (!is_shadow_present_pte(*iterator.sptep)) >> break; >> } >> >> sptes[0] = sptes[0] & localMask; >> sptes[0] = sptes[0] | mask ; >> __set_spte(sptep[0], sptes[0]); >> //update_spte(sptep[0], sptes[0]); >> /* >> sptes[1] = sptes[1] & localMask; >> sptes[1] = sptes[1] | mask ; >> update_spte(sptep[1], sptes[1]); >> */ >> /* >> >> sptes[3] = sptes[3] & localMask; >> sptes[3] = sptes[3] | mask ; >> update_spte(sptep[3], sptes[3]); >> */ >> spin_unlock(&vcpu->kvm->mmu_lock); >> >> return nr_sptes; >> } >> >> The execution results are from kern.log >> >> xxxx kernel: [ 4371.002579] hypercall f002, a71000 >> xxxx kernel: [ 4371.002581] after changes return result is 4 , gpa: >> 723ae000 sptes: 16c7bd275 , 1304c7007 , 136d6f007 , 13cc88007 >> >> I find that if I write to this page, actually the write protected >> permission bit is set as writable again. I am not quite sure why. >> >> Method 2: >> >> int kvm_emulate_hypercall(struct kvm_vcpu *vcpu){ >> ................ >> >> case KVM_HC_HL_EPTPER : >> //// This method is not working >> localGpa = kvm_mmu_gva_to_gpa_read(vcpu, a0, &localEx); >> localGfn = gpa_to_gfn(localGpa); >> >> spin_lock(&vcpu->kvm->mmu_lock); >> hl_result = rmap_write_protect(vcpu->kvm, localGfn); >> printk("local gfn is %llx , result of kvm_age_hva is >> %d\n", localGfn, hl_result); >> kvm_flush_remote_tlbs(vcpu->kvm); >> spin_unlock(&vcpu->kvm->mmu_lock); >> >> hl_result = kvm_mmu_get_spte_hierarchy(vcpu, localGpa, >> hl_sptes); >> printk("return result is %d , gpa: %llx sptes: %llx , >> %llx , %llx , %llx \n", hl_result, localGpa, hl_sptes[0], hl_sptes[1], >> hl_sptes[2], hl_sptes[3]); >> ................... >> } >> >> The execution results are: >> >> xxxx kernel: [ 4044.020816] hypercall f002, 1201000 >> xxxx kernel: [ 4044.020819] local gfn is 70280 , result of kvm_age_hva is 1 >> xxxx kernel: [ 4044.020823] return result is 4 , gpa: 70280000 sptes: >> 13c2aa275 , 1304ff007 , 15eb3d007 , 15eb3e007 >> >> My feeling is seems that I have to modify something else instead of spte alone. > > Aha. > > There two issues i found: > > - you should use kvm_mmu_gva_to_gpa_write instead of kvm_mmu_gva_to_gpa_read, since > if the page in guest is readonly, it will trigger COW and switch to a new page > > - you also need to do some work on page fault path to avoid setting W bit on the spte > Thanks for the quick reply. BTW, I am using KVM 2.6.32.27 kernel module. And use virt-manager as the guest module. The host is Ubuntu 10.04 with kernel 2.6.32.33. I have changed to use kvm_mmu_gva_to_gpa_write function. I am also putting extra printk message into page_fault, tdp_page_fault, and inject_page_fault, functions, none of them gives me any information if I write to the memory whose spte is changed as readonly. I also try to trace when the __set_spte is called after I modify the spte. I still don't get any luck. So I really want to know where the problem is. As Davidlohr mentions, this is a basic technique that I found in many papers, that is why I used it as a study case. There is another experiment that I am doing. It is said in the comments of the code that : Page fault handler will be triggered by "normal guest page fault due to the guest pte marked not present, not writable, or not executable" (FNAME(page_fault) function in the paging_tmpl.h). I have use mprotect system call in my user program in the guest OS to set the guest page as readonly, and write to this page. In Linux kernel, this is handle by the seg fault. Actually page_fault is not called in the kvm. I don't get it, why kvm wants to interfere with the guest page fault and force it to vm exit. I believe there is a performance issue in theory. Thanks for helping me, even though I still don't know the answer, this is a interesting study process. Best, Hui -- Hui Lin PhD Candidate, Research Assistant Electrical and Computer Engineering Department University of Illinois at Urbana-Champaign -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html