On 09/01/2012 05:30 AM, Hui Lin (Hugo) wrote: > On Thu, Aug 30, 2012 at 9:54 PM, Xiao Guangrong > <xiaoguangrong@xxxxxxxxxxxxxxxxxx> wrote: >> On 08/31/2012 02:59 AM, Hugo wrote: >>> On Thu, Aug 30, 2012 at 5:22 AM, Xiao Guangrong >>> <xiaoguangrong@xxxxxxxxxxxxxxxxxx> wrote: >>>> On 08/28/2012 11:30 AM, Felix wrote: >>>>> Xiao Guangrong <xiaoguangrong <at> linux.vnet.ibm.com> writes: >>>>> >>>>>> >>>>>> On 07/31/2012 01:18 AM, Sunil wrote: >>>>>>> Hello List, >>>>>>> >>>>>>> I am a KVM newbie and studying KVM mmu code. >>>>>>> >>>>>>> On the existing guest, I am trying to track all guest writes by >>>>>>> marking page table entry as read-only in EPT entry [ I am using Intel >>>>>>> machine with vmx and ept support ]. Looks like EPT support re-uses >>>>>>> shadow page table(SPT) code and hence some of SPT routines. >>>>>>> >>>>>>> I was thinking of below possible approach. Use pte_list_walk() to >>>>>>> traverse through list of sptes and use mmu_spte_update() to flip the >>>>>>> PT_WRITABLE_MASK flag. But all SPTEs are not part of any single list; >>>>>>> but on separate lists (based on gfn, page level, memory_slot). So, >>>>>>> recording all the faulted guest GFN and then using above method work ? >>>>>>> >>>>>> >>>>>> There are two ways to write-protect all sptes: >>>>>> - use kvm_mmu_slot_remove_write_access() on all memslots >>>>>> - walk the shadow page cache to get the shadow pages in the highest level >>>>>> (level = 4 on EPT), then write-protect its entries. >>>>>> >>>>>> If you just want to do it for the specified gfn, you can use >>>>>> rmap_write_protect(). >>>>>> >>>>>> Just inquisitive, what is your purpose? :) >>>>>> >>>>>> -- >>>>>> To unsubscribe from this list: send the line "unsubscribe kvm" in >>>>>> the body of a message to majordomo <at> vger.kernel.org >>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>> >>>>>> >>>>> Hi, Guangrong, >>>>> >>>>> I have done similar things like Sunil did. Simply for study purpose. However, I >>>>> found some very weird situations. Basically, in the guest vm, I allocate a chunk >>>>> of memory (with size of a page) in a user level program. Through a guest kernel >>>>> level module and my self defined hypercall, I pass the gva of this memory to >>>>> kvm. Then I try different methods in the hypercall handler to write protect this >>>>> page of memory. You can see that I want to write protect it through ETP instead >>>>> of write protected in the guest page tables. >>>>> >>>>> 1. I use kvm_mmu_gva_to_gpa_read to translate the gva into gpa. Based on the >>>>> function, kvm_mmu_get_spte_hierarchy(vcpu, gpa, spte[4]), I change the codes to >>>>> read sptep (the pointer to spte) instead of spte, so I can modify the spte >>>>> corresponding to this gpa. What I observe is that if I modify spte[0] (I think >>>>> this is the lowest level page table entry corresponding to EPT table; I can >>>>> successfully modify it as the changes are reflected in the result of calling >>>>> kvm_mmu_get_spte_hierarchy again), but my user level program in vm can still >>>>> write to this page. >>>>> >>>>> In your this blog post, you mentioned (the shadow pages in the highest level >>>>> (level = 4 on EPT)), I don't understand this part. Does this mean I have to >>>>> modify spte[3] instead of spte[0]? I just try modify spte[1] and spte[3], both >>>>> can cause vmexit. So I am totally confused about the meaning of level used in >>>>> shadow page table and its relations to shadow page table. Can you help me to >>>>> understand this? >>>>> >>>>> 2. As suggested by this post, I also use rmap_write_protect() to write protect >>>>> this page. With kvm_mmu_get_spte_hierarchy(vcpu, gpa, spte[4]), I still can see >>>>> that spte[0] gives me xxxxxx005 such result, this means that the function is >>>>> called successfully. But still I can write to this page. >>>>> >>>>> I even try the function kvm_age_hva() to remove this spte, this gives me 0 of >>>>> spte[0], but I still can write to this page. So I am further confused about the >>>>> level used in the shadow page? >>>>> >>>> >>>> kvm_mmu_get_spte_hierarchy get sptes out of mmu-lock, you can hold spin_lock(&vcpu->kvm->mmu_lock) >>>> and use for_each_shadow_entry instead. And, after change, did you flush all tlbs? >>> >>> I do apply the lock in my codes and I do flush tlb. >>> >>>> >>>> If it can not work, please post your code. >>>> >>> >>> Here is my codes. The modifications are made in x86/x86.c in >>> >>> KVM_HC_HL_EPTPER is my hypercall number. >>> >>> Method 1: >>> >>> int kvm_emulate_hypercall(struct kvm_vcpu *vcpu){ >>> ................ >>> >>> case KVM_HC_HL_EPTPER : >>> //// This method is not working >>> >>> localGpa = kvm_mmu_gva_to_gpa_read(vcpu, a0, &localEx); >>> if(localGpa == UNMAPPED_GVA){ >>> printk("read is not correct\n"); >>> return -KVM_ENOSYS; >>> } >>> >>> hl_kvm_mmu_update_spte(vcpu, localGpa, 5); >>> hl_result = kvm_mmu_get_spte_hierarchy(vcpu, localGpa, >>> hl_sptes); >>> >>> printk("after changes return result is %d , gpa: %llx >>> sptes: %llx , %llx , %llx , %llx \n", hl_result, localGpa, >>> hl_sptes[0], hl_sptes[1], hl_sptes[2], hl_sptes[3]); >>> kvm_flush_remote_tlbs(vcpu->kvm); >>> ................... >>> } >>> >>> The function hl_kvm_mmu_update_spte is defined as >>> >>> int hl_kvm_mmu_update_spte(struct kvm_vcpu *vcpu, u64 addr, u64 mask) >>> { >>> struct kvm_shadow_walk_iterator iterator; >>> int nr_sptes = 0; >>> u64 sptes[4]; >>> u64* sptep[4]; >>> u64 localMask = 0xFFFFFFFFFFFFFFF8; /// 1000 >>> >>> spin_lock(&vcpu->kvm->mmu_lock); >>> for_each_shadow_entry(vcpu, addr, iterator) { >>> sptes[iterator.level-1] = *iterator.sptep; >>> sptep[iterator.level-1] = iterator.sptep; >>> nr_sptes++; >>> if (!is_shadow_present_pte(*iterator.sptep)) >>> break; >>> } >>> >>> sptes[0] = sptes[0] & localMask; >>> sptes[0] = sptes[0] | mask ; >>> __set_spte(sptep[0], sptes[0]); >>> //update_spte(sptep[0], sptes[0]); >>> /* >>> sptes[1] = sptes[1] & localMask; >>> sptes[1] = sptes[1] | mask ; >>> update_spte(sptep[1], sptes[1]); >>> */ >>> /* >>> >>> sptes[3] = sptes[3] & localMask; >>> sptes[3] = sptes[3] | mask ; >>> update_spte(sptep[3], sptes[3]); >>> */ >>> spin_unlock(&vcpu->kvm->mmu_lock); >>> >>> return nr_sptes; >>> } >>> >>> The execution results are from kern.log >>> >>> xxxx kernel: [ 4371.002579] hypercall f002, a71000 >>> xxxx kernel: [ 4371.002581] after changes return result is 4 , gpa: >>> 723ae000 sptes: 16c7bd275 , 1304c7007 , 136d6f007 , 13cc88007 >>> >>> I find that if I write to this page, actually the write protected >>> permission bit is set as writable again. I am not quite sure why. >>> >>> Method 2: >>> >>> int kvm_emulate_hypercall(struct kvm_vcpu *vcpu){ >>> ................ >>> >>> case KVM_HC_HL_EPTPER : >>> //// This method is not working >>> localGpa = kvm_mmu_gva_to_gpa_read(vcpu, a0, &localEx); >>> localGfn = gpa_to_gfn(localGpa); >>> >>> spin_lock(&vcpu->kvm->mmu_lock); >>> hl_result = rmap_write_protect(vcpu->kvm, localGfn); >>> printk("local gfn is %llx , result of kvm_age_hva is >>> %d\n", localGfn, hl_result); >>> kvm_flush_remote_tlbs(vcpu->kvm); >>> spin_unlock(&vcpu->kvm->mmu_lock); >>> >>> hl_result = kvm_mmu_get_spte_hierarchy(vcpu, localGpa, >>> hl_sptes); >>> printk("return result is %d , gpa: %llx sptes: %llx , >>> %llx , %llx , %llx \n", hl_result, localGpa, hl_sptes[0], hl_sptes[1], >>> hl_sptes[2], hl_sptes[3]); >>> ................... >>> } >>> >>> The execution results are: >>> >>> xxxx kernel: [ 4044.020816] hypercall f002, 1201000 >>> xxxx kernel: [ 4044.020819] local gfn is 70280 , result of kvm_age_hva is 1 >>> xxxx kernel: [ 4044.020823] return result is 4 , gpa: 70280000 sptes: >>> 13c2aa275 , 1304ff007 , 15eb3d007 , 15eb3e007 >>> >>> My feeling is seems that I have to modify something else instead of spte alone. >> >> Aha. >> >> There two issues i found: >> >> - you should use kvm_mmu_gva_to_gpa_write instead of kvm_mmu_gva_to_gpa_read, since >> if the page in guest is readonly, it will trigger COW and switch to a new page >> >> - you also need to do some work on page fault path to avoid setting W bit on the spte >> > > Thanks for the quick reply. > > BTW, I am using KVM 2.6.32.27 kernel module. And use virt-manager as > the guest module. The host is Ubuntu 10.04 with kernel 2.6.32.33. > > I have changed to use kvm_mmu_gva_to_gpa_write function. > > I am also putting extra printk message into page_fault, > tdp_page_fault, and inject_page_fault, functions, none of them gives Could you show these change please? > me any information if I write to the memory whose spte is changed as > readonly. I also try to trace when the __set_spte is called after I Try to add some debug message in mmu_spte_set and mmu_spte_update > modify the spte. I still don't get any luck. So I really want to know > where the problem is. As Davidlohr mentions, this is a basic technique > that I found in many papers, that is why I used it as a study case. You'd better show what you did in the guest OS. > > There is another experiment that I am doing. It is said in the > comments of the code that : Page fault handler will be triggered by > "normal guest page fault due to the guest pte marked not present, not > writable, or not executable" (FNAME(page_fault) function in the > paging_tmpl.h). I have use mprotect system call in my user program in > the guest OS to set the guest page as readonly, and write to this > page. In Linux kernel, this is handle by the seg fault. Actually > page_fault is not called in the kvm. I don't get it, why kvm wants to cat /sys/module/kvm_intel/parameters/ept, if it is 'Y', it is normal. if N, what you see is out of my mind. :) > interfere with the guest page fault and force it to vm exit. I believe > there is a performance issue in theory. If ept/npt is used, kvm does not care the #PF in guest, FNAME(page_fault) is used for ept/npt unsupported. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html