On 08/31/2012 02:59 AM, Hugo wrote: > On Thu, Aug 30, 2012 at 5:22 AM, Xiao Guangrong > <xiaoguangrong@xxxxxxxxxxxxxxxxxx> wrote: >> On 08/28/2012 11:30 AM, Felix wrote: >>> Xiao Guangrong <xiaoguangrong <at> linux.vnet.ibm.com> writes: >>> >>>> >>>> On 07/31/2012 01:18 AM, Sunil wrote: >>>>> Hello List, >>>>> >>>>> I am a KVM newbie and studying KVM mmu code. >>>>> >>>>> On the existing guest, I am trying to track all guest writes by >>>>> marking page table entry as read-only in EPT entry [ I am using Intel >>>>> machine with vmx and ept support ]. Looks like EPT support re-uses >>>>> shadow page table(SPT) code and hence some of SPT routines. >>>>> >>>>> I was thinking of below possible approach. Use pte_list_walk() to >>>>> traverse through list of sptes and use mmu_spte_update() to flip the >>>>> PT_WRITABLE_MASK flag. But all SPTEs are not part of any single list; >>>>> but on separate lists (based on gfn, page level, memory_slot). So, >>>>> recording all the faulted guest GFN and then using above method work ? >>>>> >>>> >>>> There are two ways to write-protect all sptes: >>>> - use kvm_mmu_slot_remove_write_access() on all memslots >>>> - walk the shadow page cache to get the shadow pages in the highest level >>>> (level = 4 on EPT), then write-protect its entries. >>>> >>>> If you just want to do it for the specified gfn, you can use >>>> rmap_write_protect(). >>>> >>>> Just inquisitive, what is your purpose? :) >>>> >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe kvm" in >>>> the body of a message to majordomo <at> vger.kernel.org >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>> >>>> >>> Hi, Guangrong, >>> >>> I have done similar things like Sunil did. Simply for study purpose. However, I >>> found some very weird situations. Basically, in the guest vm, I allocate a chunk >>> of memory (with size of a page) in a user level program. Through a guest kernel >>> level module and my self defined hypercall, I pass the gva of this memory to >>> kvm. Then I try different methods in the hypercall handler to write protect this >>> page of memory. You can see that I want to write protect it through ETP instead >>> of write protected in the guest page tables. >>> >>> 1. I use kvm_mmu_gva_to_gpa_read to translate the gva into gpa. Based on the >>> function, kvm_mmu_get_spte_hierarchy(vcpu, gpa, spte[4]), I change the codes to >>> read sptep (the pointer to spte) instead of spte, so I can modify the spte >>> corresponding to this gpa. What I observe is that if I modify spte[0] (I think >>> this is the lowest level page table entry corresponding to EPT table; I can >>> successfully modify it as the changes are reflected in the result of calling >>> kvm_mmu_get_spte_hierarchy again), but my user level program in vm can still >>> write to this page. >>> >>> In your this blog post, you mentioned (the shadow pages in the highest level >>> (level = 4 on EPT)), I don't understand this part. Does this mean I have to >>> modify spte[3] instead of spte[0]? I just try modify spte[1] and spte[3], both >>> can cause vmexit. So I am totally confused about the meaning of level used in >>> shadow page table and its relations to shadow page table. Can you help me to >>> understand this? >>> >>> 2. As suggested by this post, I also use rmap_write_protect() to write protect >>> this page. With kvm_mmu_get_spte_hierarchy(vcpu, gpa, spte[4]), I still can see >>> that spte[0] gives me xxxxxx005 such result, this means that the function is >>> called successfully. But still I can write to this page. >>> >>> I even try the function kvm_age_hva() to remove this spte, this gives me 0 of >>> spte[0], but I still can write to this page. So I am further confused about the >>> level used in the shadow page? >>> >> >> kvm_mmu_get_spte_hierarchy get sptes out of mmu-lock, you can hold spin_lock(&vcpu->kvm->mmu_lock) >> and use for_each_shadow_entry instead. And, after change, did you flush all tlbs? > > I do apply the lock in my codes and I do flush tlb. > >> >> If it can not work, please post your code. >> > > Here is my codes. The modifications are made in x86/x86.c in > > KVM_HC_HL_EPTPER is my hypercall number. > > Method 1: > > int kvm_emulate_hypercall(struct kvm_vcpu *vcpu){ > ................ > > case KVM_HC_HL_EPTPER : > //// This method is not working > > localGpa = kvm_mmu_gva_to_gpa_read(vcpu, a0, &localEx); > if(localGpa == UNMAPPED_GVA){ > printk("read is not correct\n"); > return -KVM_ENOSYS; > } > > hl_kvm_mmu_update_spte(vcpu, localGpa, 5); > hl_result = kvm_mmu_get_spte_hierarchy(vcpu, localGpa, > hl_sptes); > > printk("after changes return result is %d , gpa: %llx > sptes: %llx , %llx , %llx , %llx \n", hl_result, localGpa, > hl_sptes[0], hl_sptes[1], hl_sptes[2], hl_sptes[3]); > kvm_flush_remote_tlbs(vcpu->kvm); > ................... > } > > The function hl_kvm_mmu_update_spte is defined as > > int hl_kvm_mmu_update_spte(struct kvm_vcpu *vcpu, u64 addr, u64 mask) > { > struct kvm_shadow_walk_iterator iterator; > int nr_sptes = 0; > u64 sptes[4]; > u64* sptep[4]; > u64 localMask = 0xFFFFFFFFFFFFFFF8; /// 1000 > > spin_lock(&vcpu->kvm->mmu_lock); > for_each_shadow_entry(vcpu, addr, iterator) { > sptes[iterator.level-1] = *iterator.sptep; > sptep[iterator.level-1] = iterator.sptep; > nr_sptes++; > if (!is_shadow_present_pte(*iterator.sptep)) > break; > } > > sptes[0] = sptes[0] & localMask; > sptes[0] = sptes[0] | mask ; > __set_spte(sptep[0], sptes[0]); > //update_spte(sptep[0], sptes[0]); > /* > sptes[1] = sptes[1] & localMask; > sptes[1] = sptes[1] | mask ; > update_spte(sptep[1], sptes[1]); > */ > /* > > sptes[3] = sptes[3] & localMask; > sptes[3] = sptes[3] | mask ; > update_spte(sptep[3], sptes[3]); > */ > spin_unlock(&vcpu->kvm->mmu_lock); > > return nr_sptes; > } > > The execution results are from kern.log > > xxxx kernel: [ 4371.002579] hypercall f002, a71000 > xxxx kernel: [ 4371.002581] after changes return result is 4 , gpa: > 723ae000 sptes: 16c7bd275 , 1304c7007 , 136d6f007 , 13cc88007 > > I find that if I write to this page, actually the write protected > permission bit is set as writable again. I am not quite sure why. > > Method 2: > > int kvm_emulate_hypercall(struct kvm_vcpu *vcpu){ > ................ > > case KVM_HC_HL_EPTPER : > //// This method is not working > localGpa = kvm_mmu_gva_to_gpa_read(vcpu, a0, &localEx); > localGfn = gpa_to_gfn(localGpa); > > spin_lock(&vcpu->kvm->mmu_lock); > hl_result = rmap_write_protect(vcpu->kvm, localGfn); > printk("local gfn is %llx , result of kvm_age_hva is > %d\n", localGfn, hl_result); > kvm_flush_remote_tlbs(vcpu->kvm); > spin_unlock(&vcpu->kvm->mmu_lock); > > hl_result = kvm_mmu_get_spte_hierarchy(vcpu, localGpa, > hl_sptes); > printk("return result is %d , gpa: %llx sptes: %llx , > %llx , %llx , %llx \n", hl_result, localGpa, hl_sptes[0], hl_sptes[1], > hl_sptes[2], hl_sptes[3]); > ................... > } > > The execution results are: > > xxxx kernel: [ 4044.020816] hypercall f002, 1201000 > xxxx kernel: [ 4044.020819] local gfn is 70280 , result of kvm_age_hva is 1 > xxxx kernel: [ 4044.020823] return result is 4 , gpa: 70280000 sptes: > 13c2aa275 , 1304ff007 , 15eb3d007 , 15eb3e007 > > My feeling is seems that I have to modify something else instead of spte alone. Aha. There two issues i found: - you should use kvm_mmu_gva_to_gpa_write instead of kvm_mmu_gva_to_gpa_read, since if the page in guest is readonly, it will trigger COW and switch to a new page - you also need to do some work on page fault path to avoid setting W bit on the spte -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html