Re: DANGER WILL ROBINSON, DANGER

Paolo Bonzini <pbonzini@xxxxxxxxxx> · Wed, 2 Oct 2019 15:46:30 +0200

On 02/10/19 21:27, Jerome Glisse wrote:
> On Tue, Sep 10, 2019 at 07:49:51AM +0000, Mircea CIRJALIU - MELIU wrote:
>>> On 05/09/19 20:09, Jerome Glisse wrote:
>>>> Not sure i understand, you are saying that the solution i outline
>>>> above does not work ? If so then i think you are wrong, in the above
>>>> solution the importing process mmap a device file and the resulting
>>>> vma is then populated using insert_pfn() and constantly keep
>>>> synchronize with the target process through mirroring which means that
>>>> you never have to look at the struct page ... you can mirror any kind
>>>> of memory from the remote process.
>>>
>>> If insert_pfn in turn calls MMU notifiers for the target VMA (which would be
>>> the KVM MMU notifier), then that would work.  Though I guess it would be
>>> possible to call MMU notifier update callbacks around the call to insert_pfn.
>>
>> Can't do that.
>> First, insert_pfn() uses set_pte_at() which won't trigger the MMU notifier on
>> the target VMA. It's also static, so I'll have to access it thru vmf_insert_pfn()
>> or vmf_insert_mixed().
> 
> Why would you need to target mmu notifier on target vma ?

If the mapping of the source VMA changes, mirroring can update the
target VMA via insert_pfn.  But what ensures that KVM's MMU notifier
dismantles its own existing page tables (so that they can be recreated
with the new mapping from the source VMA)?

Thanks,

Paolo

> You do not need
> that. The workflow is:
> 
>     userspace:
>         ptr = mmap(/dev/kvm-mirroring-device, virtual_addresse_of_target)
> 
> Then when the mirroring process access ptr it triggers page fault that
> endup in the vm_operation_struct->fault() which is just doing:
> 
>     kernel-kvm-mirroring-function:
>         kvm_mirror_page_fault(struct vm_fault *vmf) {
>             struct kvm_mirror_struct *kvmms;
> 
>             kvmms = kvm_mirror_struct_from_file(vmf->vma->vm_file);
>             ...
>         again:
>             hmm_range_register(&range);
>             hmm_range_snapshot(&range);
>             take_lock(kvmms->update);
>             if (!hmm_range_valid(&range)) {
>                 vm_insert_pfn();
>                 drop_lock(kvmms->update);
>                 hmm_range_unregister(&range);
>                 return VM_FAULT_NOPAGE;
>             }
>             drop_lock(kvmms->update);
>             goto again;
>         }
> 
> The notifier callback:
>         kvmms_notifier_start() {
>             take_lock(kvmms->update);
>             clear_pte(start, end);
>             drop_lock(kvmms->update);
>         }
> 
>>
>> Our model (the importing process is encapsulated in another VM) forces us
>> to mirror certain pages from the anon VMA backing one VM's system RAM to 
>> the other VM's anon VMA. 
> 
> The mirror does not have to be an anon vma it can very well be a
> device vma ie mmap of a device file. I do not see any reasons why
> the mirror need to be an anon vma. Please explain why.
> 
>>
>> Using the functions above means setting VM_PFNMAP|VM_MIXEDMAP on 
>> the target anon VMA, but I guess this breaks the VMA. Is this recommended?
> 
> The mirror vma should not be an anon vma.
> 
>>
>> Then, mapping anon pages from one VMA to another without fixing the 
>> refcount and the mapcount breaks the daemons that think they're working 
>> on a pure anon VMA (kcompactd, khugepaged).
> 
> Note here the target vma ie the mirroring one is a mmap of device file
> and thus is skip by all of the above (kcompactd, khugepaged, ...) it is
> fully ignore by core mm.
> 
> Thus you do not need to fix the refcount in any way. If any of the core
> mm try to reclaim memory from the original vma then you will get mmu
> notifier callbacks and all you have to do is clear the page table of your
> device vma.
> 
> I did exactly that as a tools in the past and it works just fine with
> no change to core mm whatsoever.
> 
> Cheers,
> Jérôme
>