Re: KVM: MMU: Tracking guest writes through EPT entries ?

Hugo <hugolin615@xxxxxxxxx> · Thu, 30 Aug 2012 13:59:59 -0500

On Thu, Aug 30, 2012 at 5:22 AM, Xiao Guangrong
<xiaoguangrong@xxxxxxxxxxxxxxxxxx> wrote:
> On 08/28/2012 11:30 AM, Felix wrote:
>> Xiao Guangrong <xiaoguangrong <at> linux.vnet.ibm.com> writes:
>>
>>>
>>> On 07/31/2012 01:18 AM, Sunil wrote:
>>>> Hello List,
>>>>
>>>> I am a KVM newbie and studying KVM mmu code.
>>>>
>>>> On the existing guest, I am trying to track all guest writes by
>>>> marking page table entry as read-only in EPT entry [ I am using Intel
>>>> machine with vmx and ept support ]. Looks like EPT support re-uses
>>>> shadow page table(SPT) code and hence some of SPT routines.
>>>>
>>>> I was thinking of below possible approach. Use pte_list_walk() to
>>>> traverse through list of sptes and use mmu_spte_update()  to flip the
>>>> PT_WRITABLE_MASK flag. But all SPTEs are not part of any single list;
>>>> but on separate lists (based on gfn, page level, memory_slot). So,
>>>> recording all the faulted guest GFN and then using above method work ?
>>>>
>>>
>>> There are two ways to write-protect all sptes:
>>> - use kvm_mmu_slot_remove_write_access() on all memslots
>>> - walk the shadow page cache to get the shadow pages in the highest level
>>>   (level = 4 on EPT), then write-protect its entries.
>>>
>>> If you just want to do it for the specified gfn, you can use
>>> rmap_write_protect().
>>>
>>> Just inquisitive, what is your purpose? :)
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>>> the body of a message to majordomo <at> vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>> Hi, Guangrong,
>>
>> I have done similar things like Sunil did. Simply for study purpose. However, I
>> found some very weird situations. Basically, in the guest vm, I allocate a chunk
>> of memory (with size of a page) in a user level program. Through a guest kernel
>> level module and my self defined hypercall, I pass the gva of this memory to
>> kvm. Then I try different methods in the hypercall handler to write protect this
>> page of memory. You can see that I want to write protect it through ETP instead
>> of write protected in the guest page tables.
>>
>> 1. I use kvm_mmu_gva_to_gpa_read to translate the gva into gpa. Based on the
>> function, kvm_mmu_get_spte_hierarchy(vcpu, gpa, spte[4]), I change the codes to
>> read sptep (the pointer to spte) instead of spte, so I can modify the spte
>> corresponding to this gpa. What I observe is that if I modify spte[0] (I think
>> this is the lowest level page table entry corresponding to EPT table; I can
>> successfully modify it as the changes are reflected in the result of calling
>> kvm_mmu_get_spte_hierarchy again), but my user level program in vm can still
>> write to this page.
>>
>> In your this blog post, you mentioned (the shadow pages in the highest level
>> (level = 4 on EPT)), I don't understand this part. Does this mean I have to
>> modify spte[3] instead of spte[0]? I just try modify spte[1] and spte[3], both
>> can cause vmexit. So I am totally confused about the meaning of level used in
>> shadow page table and its relations to shadow page table. Can you help me to
>> understand this?
>>
>> 2. As suggested by this post, I also use rmap_write_protect() to write protect
>> this page. With kvm_mmu_get_spte_hierarchy(vcpu, gpa, spte[4]), I still can see
>> that spte[0] gives me xxxxxx005 such result, this means that the function is
>> called successfully. But still I can write to this page.
>>
>> I even try the function kvm_age_hva() to remove this spte, this gives me 0 of
>> spte[0], but I still can write to this page. So I am further confused about the
>> level used in the shadow page?
>>
>
> kvm_mmu_get_spte_hierarchy get sptes out of mmu-lock, you can hold spin_lock(&vcpu->kvm->mmu_lock)
> and use for_each_shadow_entry instead. And, after change, did you flush all tlbs?

I do apply the lock in my codes and I do flush tlb.

>
> If it can not work, please post your code.
>

Here is my codes. The modifications are made in x86/x86.c in

KVM_HC_HL_EPTPER is my hypercall number.

Method 1:

int kvm_emulate_hypercall(struct kvm_vcpu *vcpu){
                   ................

case KVM_HC_HL_EPTPER :
                //// This method is not working

                localGpa = kvm_mmu_gva_to_gpa_read(vcpu, a0, &localEx);
                if(localGpa == UNMAPPED_GVA){
                        printk("read is not correct\n");
                        return -KVM_ENOSYS;
                }

                hl_kvm_mmu_update_spte(vcpu, localGpa, 5);
                hl_result = kvm_mmu_get_spte_hierarchy(vcpu, localGpa,
hl_sptes);

                printk("after changes return result is %d , gpa: %llx
sptes: %llx , %llx , %llx , %llx \n", hl_result, localGpa,
hl_sptes[0], hl_sptes[1], hl_sptes[2], hl_sptes[3]);
                kvm_flush_remote_tlbs(vcpu->kvm);
                 ...................
}

The function hl_kvm_mmu_update_spte is defined as

int hl_kvm_mmu_update_spte(struct kvm_vcpu *vcpu, u64 addr, u64 mask)
{
        struct kvm_shadow_walk_iterator iterator;
        int nr_sptes = 0;
        u64 sptes[4];
        u64* sptep[4];
        u64 localMask = 0xFFFFFFFFFFFFFFF8;   /// 1000

        spin_lock(&vcpu->kvm->mmu_lock);
        for_each_shadow_entry(vcpu, addr, iterator) {
                sptes[iterator.level-1] = *iterator.sptep;
                sptep[iterator.level-1] = iterator.sptep;
                nr_sptes++;
                if (!is_shadow_present_pte(*iterator.sptep))
                        break;
        }

        sptes[0] = sptes[0] & localMask;
        sptes[0] = sptes[0] | mask ;
        __set_spte(sptep[0], sptes[0]);
        //update_spte(sptep[0], sptes[0]);
/*
        sptes[1] = sptes[1] & localMask;
        sptes[1] = sptes[1] | mask ;
        update_spte(sptep[1], sptes[1]);
*/
/*

        sptes[3] = sptes[3] & localMask;
        sptes[3] = sptes[3] | mask ;
        update_spte(sptep[3], sptes[3]);
*/
        spin_unlock(&vcpu->kvm->mmu_lock);

        return nr_sptes;
}

The execution results are from kern.log

xxxx kernel: [ 4371.002579] hypercall f002, a71000
xxxx kernel: [ 4371.002581] after changes return result is 4 , gpa:
723ae000 sptes: 16c7bd275 , 1304c7007 , 136d6f007 , 13cc88007

I find that if I write to this page, actually the write protected
permission bit is set as writable again. I am not quite sure why.

Method 2:

int kvm_emulate_hypercall(struct kvm_vcpu *vcpu){
                   ................

case KVM_HC_HL_EPTPER :
                //// This method is not working
                localGpa = kvm_mmu_gva_to_gpa_read(vcpu, a0, &localEx);
                localGfn = gpa_to_gfn(localGpa);

                spin_lock(&vcpu->kvm->mmu_lock);
                hl_result = rmap_write_protect(vcpu->kvm, localGfn);
                printk("local gfn is %llx , result of kvm_age_hva is
%d\n", localGfn, hl_result);
                kvm_flush_remote_tlbs(vcpu->kvm);
                spin_unlock(&vcpu->kvm->mmu_lock);

                hl_result = kvm_mmu_get_spte_hierarchy(vcpu, localGpa,
hl_sptes);
                printk("return result is %d , gpa: %llx sptes: %llx ,
%llx , %llx , %llx \n", hl_result, localGpa, hl_sptes[0], hl_sptes[1],
hl_sptes[2], hl_sptes[3]);
                 ...................
}

The execution results are:

xxxx kernel: [ 4044.020816] hypercall f002, 1201000
xxxx kernel: [ 4044.020819] local gfn is 70280 , result of kvm_age_hva is 1
xxxx kernel: [ 4044.020823] return result is 4 , gpa: 70280000 sptes:
13c2aa275 , 1304ff007 , 15eb3d007 , 15eb3e007

My feeling is seems that I have to modify something else instead of spte alone.

Thanks for your help,

Felix
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html