Re: KVM: MMU: Tracking guest writes through EPT entries ?

Hugo <hugolin615@xxxxxxxxx> · Sun, 2 Sep 2012 21:09:48 -0500

On Sun, Sep 2, 2012 at 8:29 AM, Xiao Guangrong
<xiaoguangrong@xxxxxxxxxxxxxxxxxx> wrote:
> On 09/01/2012 05:30 AM, Hui Lin (Hugo) wrote:
>> On Thu, Aug 30, 2012 at 9:54 PM, Xiao Guangrong
>> <xiaoguangrong@xxxxxxxxxxxxxxxxxx> wrote:
>>> On 08/31/2012 02:59 AM, Hugo wrote:
>>>> On Thu, Aug 30, 2012 at 5:22 AM, Xiao Guangrong
>>>> <xiaoguangrong@xxxxxxxxxxxxxxxxxx> wrote:
>>>>> On 08/28/2012 11:30 AM, Felix wrote:
>>>>>> Xiao Guangrong <xiaoguangrong <at> linux.vnet.ibm.com> writes:
>>>>>>
>>>>>>>
>>>>>>> On 07/31/2012 01:18 AM, Sunil wrote:
>>>>>>>> Hello List,
>>>>>>>>
>>>>>>>> I am a KVM newbie and studying KVM mmu code.
>>>>>>>>
>>>>>>>> On the existing guest, I am trying to track all guest writes by
>>>>>>>> marking page table entry as read-only in EPT entry [ I am using Intel
>>>>>>>> machine with vmx and ept support ]. Looks like EPT support re-uses
>>>>>>>> shadow page table(SPT) code and hence some of SPT routines.
>>>>>>>>
>>>>>>>> I was thinking of below possible approach. Use pte_list_walk() to
>>>>>>>> traverse through list of sptes and use mmu_spte_update()  to flip the
>>>>>>>> PT_WRITABLE_MASK flag. But all SPTEs are not part of any single list;
>>>>>>>> but on separate lists (based on gfn, page level, memory_slot). So,
>>>>>>>> recording all the faulted guest GFN and then using above method work ?
>>>>>>>>
>>>>>>>
>>>>>>> There are two ways to write-protect all sptes:
>>>>>>> - use kvm_mmu_slot_remove_write_access() on all memslots
>>>>>>> - walk the shadow page cache to get the shadow pages in the highest level
>>>>>>>   (level = 4 on EPT), then write-protect its entries.
>>>>>>>
>>>>>>> If you just want to do it for the specified gfn, you can use
>>>>>>> rmap_write_protect().
>>>>>>>
>>>>>>> Just inquisitive, what is your purpose? :)
>>>>>>>
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>>>>>>> the body of a message to majordomo <at> vger.kernel.org
>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>
>>>>>>>
>>>>>> Hi, Guangrong,
>>>>>>
>>>>>> I have done similar things like Sunil did. Simply for study purpose. However, I
>>>>>> found some very weird situations. Basically, in the guest vm, I allocate a chunk
>>>>>> of memory (with size of a page) in a user level program. Through a guest kernel
>>>>>> level module and my self defined hypercall, I pass the gva of this memory to
>>>>>> kvm. Then I try different methods in the hypercall handler to write protect this
>>>>>> page of memory. You can see that I want to write protect it through ETP instead
>>>>>> of write protected in the guest page tables.
>>>>>>
>>>>>> 1. I use kvm_mmu_gva_to_gpa_read to translate the gva into gpa. Based on the
>>>>>> function, kvm_mmu_get_spte_hierarchy(vcpu, gpa, spte[4]), I change the codes to
>>>>>> read sptep (the pointer to spte) instead of spte, so I can modify the spte
>>>>>> corresponding to this gpa. What I observe is that if I modify spte[0] (I think
>>>>>> this is the lowest level page table entry corresponding to EPT table; I can
>>>>>> successfully modify it as the changes are reflected in the result of calling
>>>>>> kvm_mmu_get_spte_hierarchy again), but my user level program in vm can still
>>>>>> write to this page.
>>>>>>
>>>>>> In your this blog post, you mentioned (the shadow pages in the highest level
>>>>>> (level = 4 on EPT)), I don't understand this part. Does this mean I have to
>>>>>> modify spte[3] instead of spte[0]? I just try modify spte[1] and spte[3], both
>>>>>> can cause vmexit. So I am totally confused about the meaning of level used in
>>>>>> shadow page table and its relations to shadow page table. Can you help me to
>>>>>> understand this?
>>>>>>
>>>>>> 2. As suggested by this post, I also use rmap_write_protect() to write protect
>>>>>> this page. With kvm_mmu_get_spte_hierarchy(vcpu, gpa, spte[4]), I still can see
>>>>>> that spte[0] gives me xxxxxx005 such result, this means that the function is
>>>>>> called successfully. But still I can write to this page.
>>>>>>
>>>>>> I even try the function kvm_age_hva() to remove this spte, this gives me 0 of
>>>>>> spte[0], but I still can write to this page. So I am further confused about the
>>>>>> level used in the shadow page?
>>>>>>
>>>>>
>>>>> kvm_mmu_get_spte_hierarchy get sptes out of mmu-lock, you can hold spin_lock(&vcpu->kvm->mmu_lock)
>>>>> and use for_each_shadow_entry instead. And, after change, did you flush all tlbs?
>>>>
>>>> I do apply the lock in my codes and I do flush tlb.
>>>>
>>>>>
>>>>> If it can not work, please post your code.
>>>>>
>>>>
>>>> Here is my codes. The modifications are made in x86/x86.c in
>>>>
>>>> KVM_HC_HL_EPTPER is my hypercall number.
>>>>
>>>> Method 1:
>>>>
>>>> int kvm_emulate_hypercall(struct kvm_vcpu *vcpu){
>>>>                    ................
>>>>
>>>> case KVM_HC_HL_EPTPER :
>>>>                 //// This method is not working
>>>>
>>>>                 localGpa = kvm_mmu_gva_to_gpa_read(vcpu, a0, &localEx);
>>>>                 if(localGpa == UNMAPPED_GVA){
>>>>                         printk("read is not correct\n");
>>>>                         return -KVM_ENOSYS;
>>>>                 }
>>>>
>>>>                 hl_kvm_mmu_update_spte(vcpu, localGpa, 5);
>>>>                 hl_result = kvm_mmu_get_spte_hierarchy(vcpu, localGpa,
>>>> hl_sptes);
>>>>
>>>>                 printk("after changes return result is %d , gpa: %llx
>>>> sptes: %llx , %llx , %llx , %llx \n", hl_result, localGpa,
>>>> hl_sptes[0], hl_sptes[1], hl_sptes[2], hl_sptes[3]);
>>>>                 kvm_flush_remote_tlbs(vcpu->kvm);
>>>>                  ...................
>>>> }
>>>>
>>>> The function hl_kvm_mmu_update_spte is defined as
>>>>
>>>> int hl_kvm_mmu_update_spte(struct kvm_vcpu *vcpu, u64 addr, u64 mask)
>>>> {
>>>>         struct kvm_shadow_walk_iterator iterator;
>>>>         int nr_sptes = 0;
>>>>         u64 sptes[4];
>>>>         u64* sptep[4];
>>>>         u64 localMask = 0xFFFFFFFFFFFFFFF8;   /// 1000
>>>>
>>>>         spin_lock(&vcpu->kvm->mmu_lock);
>>>>         for_each_shadow_entry(vcpu, addr, iterator) {
>>>>                 sptes[iterator.level-1] = *iterator.sptep;
>>>>                 sptep[iterator.level-1] = iterator.sptep;
>>>>                 nr_sptes++;
>>>>                 if (!is_shadow_present_pte(*iterator.sptep))
>>>>                         break;
>>>>         }
>>>>
>>>>         sptes[0] = sptes[0] & localMask;
>>>>         sptes[0] = sptes[0] | mask ;
>>>>         __set_spte(sptep[0], sptes[0]);
>>>>         //update_spte(sptep[0], sptes[0]);
>>>> /*
>>>>         sptes[1] = sptes[1] & localMask;
>>>>         sptes[1] = sptes[1] | mask ;
>>>>         update_spte(sptep[1], sptes[1]);
>>>> */
>>>> /*
>>>>
>>>>         sptes[3] = sptes[3] & localMask;
>>>>         sptes[3] = sptes[3] | mask ;
>>>>         update_spte(sptep[3], sptes[3]);
>>>> */
>>>>         spin_unlock(&vcpu->kvm->mmu_lock);
>>>>
>>>>         return nr_sptes;
>>>> }
>>>>
>>>> The execution results are from kern.log
>>>>
>>>> xxxx kernel: [ 4371.002579] hypercall f002, a71000
>>>> xxxx kernel: [ 4371.002581] after changes return result is 4 , gpa:
>>>> 723ae000 sptes: 16c7bd275 , 1304c7007 , 136d6f007 , 13cc88007
>>>>
>>>> I find that if I write to this page, actually the write protected
>>>> permission bit is set as writable again. I am not quite sure why.
>>>>
>>>> Method 2:
>>>>
>>>> int kvm_emulate_hypercall(struct kvm_vcpu *vcpu){
>>>>                    ................
>>>>
>>>> case KVM_HC_HL_EPTPER :
>>>>                 //// This method is not working
>>>>                 localGpa = kvm_mmu_gva_to_gpa_read(vcpu, a0, &localEx);
>>>>                 localGfn = gpa_to_gfn(localGpa);
>>>>
>>>>                 spin_lock(&vcpu->kvm->mmu_lock);
>>>>                 hl_result = rmap_write_protect(vcpu->kvm, localGfn);
>>>>                 printk("local gfn is %llx , result of kvm_age_hva is
>>>> %d\n", localGfn, hl_result);
>>>>                 kvm_flush_remote_tlbs(vcpu->kvm);
>>>>                 spin_unlock(&vcpu->kvm->mmu_lock);
>>>>
>>>>                 hl_result = kvm_mmu_get_spte_hierarchy(vcpu, localGpa,
>>>> hl_sptes);
>>>>                 printk("return result is %d , gpa: %llx sptes: %llx ,
>>>> %llx , %llx , %llx \n", hl_result, localGpa, hl_sptes[0], hl_sptes[1],
>>>> hl_sptes[2], hl_sptes[3]);
>>>>                  ...................
>>>> }
>>>>
>>>> The execution results are:
>>>>
>>>> xxxx kernel: [ 4044.020816] hypercall f002, 1201000
>>>> xxxx kernel: [ 4044.020819] local gfn is 70280 , result of kvm_age_hva is 1
>>>> xxxx kernel: [ 4044.020823] return result is 4 , gpa: 70280000 sptes:
>>>> 13c2aa275 , 1304ff007 , 15eb3d007 , 15eb3e007
>>>>
>>>> My feeling is seems that I have to modify something else instead of spte alone.
>>>
>>> Aha.
>>>
>>> There two issues i found:
>>>
>>> - you should use kvm_mmu_gva_to_gpa_write instead of kvm_mmu_gva_to_gpa_read, since
>>>   if the page in guest is readonly, it will trigger COW and switch to a new page
>>>
>>> - you also need to do some work on page fault path to avoid setting W bit on the spte
>>>
>>
>> Thanks for the quick reply.
>>
>> BTW, I am using KVM 2.6.32.27 kernel module. And use virt-manager as
>> the guest module. The host is Ubuntu 10.04 with kernel 2.6.32.33.
>>
>> I have changed to use kvm_mmu_gva_to_gpa_write function.
>>
>> I am also putting extra printk message into page_fault,
>> tdp_page_fault, and inject_page_fault, functions, none of them gives
>
> Could you show these change please?

What I did in tdp_page_fault and inject_page_fault is simple,

In tdp_page_fault, inject_page_fault, I added the same piece of codes
at the beginning of the function. The target_gpa is set in x86/x86.c
by the vmcall handler:

        /////
        if(gpa == target_gpa){
                printk("XXXX Debug %llx \n", gpa);
        }
        /////

This way, no crazy kernel logs are made.
>
>> me any information if I write to the memory whose spte is changed as
>> readonly. I also try to trace when the __set_spte is called after I
>
> Try to add some debug message in mmu_spte_set and mmu_spte_update
>
>> modify the spte. I still don't get any luck. So I really want to know
>> where the problem is. As Davidlohr mentions, this is a basic technique
>> that I found in many papers, that is why I used it as a study case.
>
> You'd better show what you did in the guest OS.
What I did in Guest OS includes two parts:
kernel level: pseudo device driver, includes read and write function.
The write function accept the virtual address defined in a user
program. And then pass this virtual address to the KVM through vmcall.
This is basic device driver module introduced in linux device driver.
Guest level:
I allocate a page of memory in the program's address space:
        pagesize = sysconf(_SC_PAGE_SIZE);
        if(pagesize == -1){
                printf("sysconf error\n");
                return -1;
        }
        //buffer = (char*)memalign(pagesize, pagesize);
        ori = (char*)malloc(1024 + pagesize - 1);
        if (ori == NULL){
                printf("memalign\n");
                return -1;
        }
        buffer = (char *)(((int) ori + pagesize -1) & ~(pagesize-1));
        address = (unsigned long) buffer;
Then pass the "address " to the kernel module:

size = write(fd, &address, sizeof(unsigned long));

>
>>
>> There is another experiment that I am doing. It is said in the
>> comments of the code that :  Page fault handler will be triggered by
>> "normal guest page fault due to the guest pte marked not present, not
>> writable, or not executable" (FNAME(page_fault) function in the
>> paging_tmpl.h). I have use mprotect system call in my user program in
>> the guest OS to set the guest page as readonly, and write to this
>> page. In Linux kernel, this is handle by the seg fault. Actually
>> page_fault is not called in the kvm. I don't get it, why kvm wants to
>
> cat /sys/module/kvm_intel/parameters/ept, if it is 'Y', it is normal.
> if N, what you see is out of my mind. :)
>
Luckily, the answer is ‘Y’.

>> interfere with the guest page fault and force it to vm exit. I believe
>> there is a performance issue in theory.
>
> If ept/npt is used, kvm does not care the #PF in guest, FNAME(page_fault)
> is used for ept/npt unsupported.
>
Ah, I see. I got it.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html