On Sun, Sep 2, 2012 at 8:29 AM, Xiao Guangrong <xiaoguangrong@xxxxxxxxxxxxxxxxxx> wrote: > On 09/01/2012 05:30 AM, Hui Lin (Hugo) wrote: >> On Thu, Aug 30, 2012 at 9:54 PM, Xiao Guangrong >> <xiaoguangrong@xxxxxxxxxxxxxxxxxx> wrote: >>> On 08/31/2012 02:59 AM, Hugo wrote: >>>> On Thu, Aug 30, 2012 at 5:22 AM, Xiao Guangrong >>>> <xiaoguangrong@xxxxxxxxxxxxxxxxxx> wrote: >>>>> On 08/28/2012 11:30 AM, Felix wrote: >>>>>> Xiao Guangrong <xiaoguangrong <at> linux.vnet.ibm.com> writes: >>>>>> >>>>>>> >>>>>>> On 07/31/2012 01:18 AM, Sunil wrote: >>>>>>>> Hello List, >>>>>>>> >>>>>>>> I am a KVM newbie and studying KVM mmu code. >>>>>>>> >>>>>>>> On the existing guest, I am trying to track all guest writes by >>>>>>>> marking page table entry as read-only in EPT entry [ I am using Intel >>>>>>>> machine with vmx and ept support ]. Looks like EPT support re-uses >>>>>>>> shadow page table(SPT) code and hence some of SPT routines. >>>>>>>> >>>>>>>> I was thinking of below possible approach. Use pte_list_walk() to >>>>>>>> traverse through list of sptes and use mmu_spte_update() to flip the >>>>>>>> PT_WRITABLE_MASK flag. But all SPTEs are not part of any single list; >>>>>>>> but on separate lists (based on gfn, page level, memory_slot). So, >>>>>>>> recording all the faulted guest GFN and then using above method work ? >>>>>>>> >>>>>>> >>>>>>> There are two ways to write-protect all sptes: >>>>>>> - use kvm_mmu_slot_remove_write_access() on all memslots >>>>>>> - walk the shadow page cache to get the shadow pages in the highest level >>>>>>> (level = 4 on EPT), then write-protect its entries. >>>>>>> >>>>>>> If you just want to do it for the specified gfn, you can use >>>>>>> rmap_write_protect(). >>>>>>> >>>>>>> Just inquisitive, what is your purpose? :) >>>>>>> >>>>>>> -- >>>>>>> To unsubscribe from this list: send the line "unsubscribe kvm" in >>>>>>> the body of a message to majordomo <at> vger.kernel.org >>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>>> >>>>>>> >>>>>> Hi, Guangrong, >>>>>> >>>>>> I have done similar things like Sunil did. Simply for study purpose. However, I >>>>>> found some very weird situations. Basically, in the guest vm, I allocate a chunk >>>>>> of memory (with size of a page) in a user level program. Through a guest kernel >>>>>> level module and my self defined hypercall, I pass the gva of this memory to >>>>>> kvm. Then I try different methods in the hypercall handler to write protect this >>>>>> page of memory. You can see that I want to write protect it through ETP instead >>>>>> of write protected in the guest page tables. >>>>>> >>>>>> 1. I use kvm_mmu_gva_to_gpa_read to translate the gva into gpa. Based on the >>>>>> function, kvm_mmu_get_spte_hierarchy(vcpu, gpa, spte[4]), I change the codes to >>>>>> read sptep (the pointer to spte) instead of spte, so I can modify the spte >>>>>> corresponding to this gpa. What I observe is that if I modify spte[0] (I think >>>>>> this is the lowest level page table entry corresponding to EPT table; I can >>>>>> successfully modify it as the changes are reflected in the result of calling >>>>>> kvm_mmu_get_spte_hierarchy again), but my user level program in vm can still >>>>>> write to this page. >>>>>> >>>>>> In your this blog post, you mentioned (the shadow pages in the highest level >>>>>> (level = 4 on EPT)), I don't understand this part. Does this mean I have to >>>>>> modify spte[3] instead of spte[0]? I just try modify spte[1] and spte[3], both >>>>>> can cause vmexit. So I am totally confused about the meaning of level used in >>>>>> shadow page table and its relations to shadow page table. Can you help me to >>>>>> understand this? >>>>>> >>>>>> 2. As suggested by this post, I also use rmap_write_protect() to write protect >>>>>> this page. With kvm_mmu_get_spte_hierarchy(vcpu, gpa, spte[4]), I still can see >>>>>> that spte[0] gives me xxxxxx005 such result, this means that the function is >>>>>> called successfully. But still I can write to this page. >>>>>> >>>>>> I even try the function kvm_age_hva() to remove this spte, this gives me 0 of >>>>>> spte[0], but I still can write to this page. So I am further confused about the >>>>>> level used in the shadow page? >>>>>> >>>>> >>>>> kvm_mmu_get_spte_hierarchy get sptes out of mmu-lock, you can hold spin_lock(&vcpu->kvm->mmu_lock) >>>>> and use for_each_shadow_entry instead. And, after change, did you flush all tlbs? >>>> >>>> I do apply the lock in my codes and I do flush tlb. >>>> >>>>> >>>>> If it can not work, please post your code. >>>>> >>>> >>>> Here is my codes. The modifications are made in x86/x86.c in >>>> >>>> KVM_HC_HL_EPTPER is my hypercall number. >>>> >>>> Method 1: >>>> >>>> int kvm_emulate_hypercall(struct kvm_vcpu *vcpu){ >>>> ................ >>>> >>>> case KVM_HC_HL_EPTPER : >>>> //// This method is not working >>>> >>>> localGpa = kvm_mmu_gva_to_gpa_read(vcpu, a0, &localEx); >>>> if(localGpa == UNMAPPED_GVA){ >>>> printk("read is not correct\n"); >>>> return -KVM_ENOSYS; >>>> } >>>> >>>> hl_kvm_mmu_update_spte(vcpu, localGpa, 5); >>>> hl_result = kvm_mmu_get_spte_hierarchy(vcpu, localGpa, >>>> hl_sptes); >>>> >>>> printk("after changes return result is %d , gpa: %llx >>>> sptes: %llx , %llx , %llx , %llx \n", hl_result, localGpa, >>>> hl_sptes[0], hl_sptes[1], hl_sptes[2], hl_sptes[3]); >>>> kvm_flush_remote_tlbs(vcpu->kvm); >>>> ................... >>>> } >>>> >>>> The function hl_kvm_mmu_update_spte is defined as >>>> >>>> int hl_kvm_mmu_update_spte(struct kvm_vcpu *vcpu, u64 addr, u64 mask) >>>> { >>>> struct kvm_shadow_walk_iterator iterator; >>>> int nr_sptes = 0; >>>> u64 sptes[4]; >>>> u64* sptep[4]; >>>> u64 localMask = 0xFFFFFFFFFFFFFFF8; /// 1000 >>>> >>>> spin_lock(&vcpu->kvm->mmu_lock); >>>> for_each_shadow_entry(vcpu, addr, iterator) { >>>> sptes[iterator.level-1] = *iterator.sptep; >>>> sptep[iterator.level-1] = iterator.sptep; >>>> nr_sptes++; >>>> if (!is_shadow_present_pte(*iterator.sptep)) >>>> break; >>>> } >>>> >>>> sptes[0] = sptes[0] & localMask; >>>> sptes[0] = sptes[0] | mask ; >>>> __set_spte(sptep[0], sptes[0]); >>>> //update_spte(sptep[0], sptes[0]); >>>> /* >>>> sptes[1] = sptes[1] & localMask; >>>> sptes[1] = sptes[1] | mask ; >>>> update_spte(sptep[1], sptes[1]); >>>> */ >>>> /* >>>> >>>> sptes[3] = sptes[3] & localMask; >>>> sptes[3] = sptes[3] | mask ; >>>> update_spte(sptep[3], sptes[3]); >>>> */ >>>> spin_unlock(&vcpu->kvm->mmu_lock); >>>> >>>> return nr_sptes; >>>> } >>>> >>>> The execution results are from kern.log >>>> >>>> xxxx kernel: [ 4371.002579] hypercall f002, a71000 >>>> xxxx kernel: [ 4371.002581] after changes return result is 4 , gpa: >>>> 723ae000 sptes: 16c7bd275 , 1304c7007 , 136d6f007 , 13cc88007 >>>> >>>> I find that if I write to this page, actually the write protected >>>> permission bit is set as writable again. I am not quite sure why. >>>> >>>> Method 2: >>>> >>>> int kvm_emulate_hypercall(struct kvm_vcpu *vcpu){ >>>> ................ >>>> >>>> case KVM_HC_HL_EPTPER : >>>> //// This method is not working >>>> localGpa = kvm_mmu_gva_to_gpa_read(vcpu, a0, &localEx); >>>> localGfn = gpa_to_gfn(localGpa); >>>> >>>> spin_lock(&vcpu->kvm->mmu_lock); >>>> hl_result = rmap_write_protect(vcpu->kvm, localGfn); >>>> printk("local gfn is %llx , result of kvm_age_hva is >>>> %d\n", localGfn, hl_result); >>>> kvm_flush_remote_tlbs(vcpu->kvm); >>>> spin_unlock(&vcpu->kvm->mmu_lock); >>>> >>>> hl_result = kvm_mmu_get_spte_hierarchy(vcpu, localGpa, >>>> hl_sptes); >>>> printk("return result is %d , gpa: %llx sptes: %llx , >>>> %llx , %llx , %llx \n", hl_result, localGpa, hl_sptes[0], hl_sptes[1], >>>> hl_sptes[2], hl_sptes[3]); >>>> ................... >>>> } >>>> >>>> The execution results are: >>>> >>>> xxxx kernel: [ 4044.020816] hypercall f002, 1201000 >>>> xxxx kernel: [ 4044.020819] local gfn is 70280 , result of kvm_age_hva is 1 >>>> xxxx kernel: [ 4044.020823] return result is 4 , gpa: 70280000 sptes: >>>> 13c2aa275 , 1304ff007 , 15eb3d007 , 15eb3e007 >>>> >>>> My feeling is seems that I have to modify something else instead of spte alone. >>> >>> Aha. >>> >>> There two issues i found: >>> >>> - you should use kvm_mmu_gva_to_gpa_write instead of kvm_mmu_gva_to_gpa_read, since >>> if the page in guest is readonly, it will trigger COW and switch to a new page >>> >>> - you also need to do some work on page fault path to avoid setting W bit on the spte >>> >> >> Thanks for the quick reply. >> >> BTW, I am using KVM 2.6.32.27 kernel module. And use virt-manager as >> the guest module. The host is Ubuntu 10.04 with kernel 2.6.32.33. >> >> I have changed to use kvm_mmu_gva_to_gpa_write function. >> >> I am also putting extra printk message into page_fault, >> tdp_page_fault, and inject_page_fault, functions, none of them gives > > Could you show these change please? What I did in tdp_page_fault and inject_page_fault is simple, In tdp_page_fault, inject_page_fault, I added the same piece of codes at the beginning of the function. The target_gpa is set in x86/x86.c by the vmcall handler: ///// if(gpa == target_gpa){ printk("XXXX Debug %llx \n", gpa); } ///// This way, no crazy kernel logs are made. > >> me any information if I write to the memory whose spte is changed as >> readonly. I also try to trace when the __set_spte is called after I > > Try to add some debug message in mmu_spte_set and mmu_spte_update > >> modify the spte. I still don't get any luck. So I really want to know >> where the problem is. As Davidlohr mentions, this is a basic technique >> that I found in many papers, that is why I used it as a study case. > > You'd better show what you did in the guest OS. What I did in Guest OS includes two parts: kernel level: pseudo device driver, includes read and write function. The write function accept the virtual address defined in a user program. And then pass this virtual address to the KVM through vmcall. This is basic device driver module introduced in linux device driver. Guest level: I allocate a page of memory in the program's address space: pagesize = sysconf(_SC_PAGE_SIZE); if(pagesize == -1){ printf("sysconf error\n"); return -1; } //buffer = (char*)memalign(pagesize, pagesize); ori = (char*)malloc(1024 + pagesize - 1); if (ori == NULL){ printf("memalign\n"); return -1; } buffer = (char *)(((int) ori + pagesize -1) & ~(pagesize-1)); address = (unsigned long) buffer; Then pass the "address " to the kernel module: size = write(fd, &address, sizeof(unsigned long)); > >> >> There is another experiment that I am doing. It is said in the >> comments of the code that : Page fault handler will be triggered by >> "normal guest page fault due to the guest pte marked not present, not >> writable, or not executable" (FNAME(page_fault) function in the >> paging_tmpl.h). I have use mprotect system call in my user program in >> the guest OS to set the guest page as readonly, and write to this >> page. In Linux kernel, this is handle by the seg fault. Actually >> page_fault is not called in the kvm. I don't get it, why kvm wants to > > cat /sys/module/kvm_intel/parameters/ept, if it is 'Y', it is normal. > if N, what you see is out of my mind. :) > Luckily, the answer is ‘Y’. >> interfere with the guest page fault and force it to vm exit. I believe >> there is a performance issue in theory. > > If ept/npt is used, kvm does not care the #PF in guest, FNAME(page_fault) > is used for ept/npt unsupported. > Ah, I see. I got it. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html