Re: [PATCH 3/5] live migration support for VM dirty log management

Marc Zyngier <marc.zyngier@xxxxxxx> · Thu, 17 Apr 2014 09:00:18 +0100

On Thu, Apr 17 2014 at  2:34:19 am BST, Mario Smarduch <m.smarduch@xxxxxxxxxxx> wrote:
> Add support for dirty bitmap management. Wanted to make it generic but function
> does a couple things different then the x86 version.
>
>
> Signed-off-by: Mario Smarduch <m.smarduch@xxxxxxxxxxx>
> ---
>  arch/arm/include/asm/kvm_host.h |    3 ++
>  arch/arm/kvm/arm.c              |   71 ++++++++++++++++++++++++++++++++++++++-
>  arch/arm/kvm/mmu.c              |   53 +++++++++++++++++++++++++++++
>  3 files changed, 126 insertions(+), 1 deletion(-)
>
> diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
> index 7ac1fdc..16ed4e4 100644
> --- a/arch/arm/include/asm/kvm_host.h
> +++ b/arch/arm/include/asm/kvm_host.h
> @@ -230,5 +230,8 @@ int kvm_arm_timer_set_reg(struct kvm_vcpu *, u64 regid, u64 value);
>  
>  void kvm_tlb_flush_vm(struct kvm *kvm);
>  void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot);
> +void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
> +		struct kvm_memory_slot *slot,
> +		gfn_t gfn_offset, unsigned long mask);
>  
>  #endif /* __ARM_KVM_HOST_H__ */
> diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
> index 7714cc6..7882343 100644
> --- a/arch/arm/kvm/arm.c
> +++ b/arch/arm/kvm/arm.c
> @@ -785,9 +785,78 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
>  	}
>  }
>  
> +
> +/**
> + * kvm_mmu_slot_remove_access - retrieves the log of dirty pages for a memslot.
> + *	It's itteratively during migration to retrieve pages written since
> + *	last call. In the process write protects ptes that are dirty for next
> + *	time, holds the mmu_lock while write protecting dirty pages.
> + *
> + * @kvm:        The KVM pointer
> + * @log:        Bitmap of dirty pages return.
> + */

So let's play the difference game with x86:

>  int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
>  {
> -	return -EINVAL;
> +	int r;
> +	struct kvm_memory_slot *memslot;
> +	unsigned long n, i;
> +	unsigned long *dirty_bitmap;
> +	unsigned long *dirty_bitmap_buffer;
> +	bool is_dirty = false;
> +	gfn_t offset;

You've moved offset out of the loop.

> +	mutex_lock(&kvm->slots_lock);
> +	r = -EINVAL;
> +
> +	/* Return with error code will cause migration to abort, this happens
> +	 * when initial write protection of VM to manage dirty pages fails
> +	 */
> +	if (kvm->arch.migration_in_progress == -1)
> +		goto out;

You've added this test. How does x86 cope with the same case?

> +	if (log->slot >= KVM_USER_MEM_SLOTS)
> +		goto out;
> +
> +	memslot = id_to_memslot(kvm->memslots, log->slot);
> +	dirty_bitmap = memslot->dirty_bitmap;
> +
> +	r = -ENOENT;
> +	if (!dirty_bitmap)
> +		goto out;
> +
> +	n = kvm_dirty_bitmap_bytes(memslot);
> +	dirty_bitmap_buffer = dirty_bitmap + n / sizeof(long);
> +	memset(dirty_bitmap_buffer, 0, n);
> +
> +	spin_lock(&kvm->mmu_lock);
> +	for (i = 0; i < n / sizeof(long); i++) {
> +		unsigned long mask;
> +
> +		if (!dirty_bitmap[i])
> +			continue;
> +
> +		is_dirty = true;
> +		offset = i * BITS_PER_LONG;
> +		kvm_mmu_write_protect_pt_masked(kvm, memslot, offset,
> +							dirty_bitmap[i]);
> +		mask = dirty_bitmap[i];
> +		dirty_bitmap_buffer[i] = mask;
> +		dirty_bitmap[i] = 0;

You've expanded the xchg macro, and moved it around.

> +	}
> +
> +	if (is_dirty)
> +		kvm_tlb_flush_vm(kvm);

This can be easily abstracted to be a kvm_flush_remote_tlbs on x86, and
a HW broadcast on ARM.

> +
> +	spin_unlock(&kvm->mmu_lock);
> +	r = -EFAULT;
> +
> +	if (copy_to_user(log->dirty_bitmap, dirty_bitmap_buffer, n))
> +		goto out;
> +
> +	r = 0;
> +out:
> +	mutex_unlock(&kvm->slots_lock);
> +	return r;
>  }

So only two small differences that can be easily factored out. I stand
by my initial analysis, and ask again you move this function to the
generic code as a weak symbol. There isn't any value in duplicating it.

>  static int kvm_vm_ioctl_set_device_addr(struct kvm *kvm,
> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
> index b85ab56..47bec1c 100644
> --- a/arch/arm/kvm/mmu.c
> +++ b/arch/arm/kvm/mmu.c
> @@ -773,6 +773,59 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm, int slot)
>  	spin_unlock(&kvm->mmu_lock);
>  }
>  
> +
> +/**
> + * kvm_mmu_write_protected_pt_masked - after migration thread write protects
> + *  the entire VM address space itterative calls are made to get diry pages
> + *  as the VM pages are being migrated. New dirty pages may be subset
> + *  of initial WPed VM or new writes faulted in. Here write protect new
> + *  dirty pages again in preparation of next dirty log read. This function is
> + *  called as a result KVM_GET_DIRTY_LOG ioctl, to determine what pages
> + *  need to be migrated.
> + *   'kvm->mmu_lock' must be  held to protect against concurrent modification
> + *   of page tables (2nd stage fault, mmu modifiers, device writes)
> + *
> + * @kvm:        The KVM pointer
> + * @slot:	The memory slot the dirty log is retrieved for
> + * @gfn_offset: The gfn offset in memory slot
> + * @mask:       The mask of dirty pages at offset 'gnf_offset in this memory
> + *              slot to be writ protect
> + */
> +
> +void kvm_mmu_write_protect_pt_masked(struct kvm *kvm,
> +		struct kvm_memory_slot *slot,
> +		gfn_t gfn_offset, unsigned long mask)
> +{
> +	phys_addr_t ipa;
> +	pgd_t *pgdp = kvm->arch.pgd, *pgd;
> +	pud_t *pud;
> +	pmd_t *pmd;
> +	pte_t *pte, new_pte;
> +
> +	/* walk set bits in the mask and write protect corresponding pages */
> +	while (mask) {
> +		ipa = (slot->base_gfn + gfn_offset + __ffs(mask)) << PAGE_SHIFT;
> +		pgd = pgdp + pgd_index(ipa);
> +		if (!pgd_present(*pgd))
> +			goto update_mask;

I think something is wrong in your logic. If there is no PGD, it means a
whole 1GB isn't present. Yet you're just clearing one bit from the mask
and doing it again. As you're only looking at BITS_PER_LONG contiguous pages at a
time, it is likely that the same thing will happen for the other pages,
and you're just wasting precious CPU cycles here.

> +		pud = pud_offset(pgd, ipa);
> +		if (!pud_present(*pud))
> +			goto update_mask;
> +		pmd = pmd_offset(pud, ipa);
> +		if (!pmd_present(*pmd))
> +			goto update_mask;
> +		pte = pte_offset_kernel(pmd, ipa);
> +		if (!pte_present(*pte))
> +			goto update_mask;
> +		if ((*pte & L_PTE_S2_RDWR) == L_PTE_S2_RDONLY)
> +			goto update_mask;
> +		new_pte = pfn_pte(pte_pfn(*pte), PAGE_S2);
> +		*pte = new_pte;

I'd like to see these two lines in a separate function (something like
"stage2_mark_pte_ro")...

> +update_mask:
> +		mask &= mask - 1;
> +	}
> +}
> +
>  static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
>  			  struct kvm_memory_slot *memslot,
>  			  unsigned long fault_status)

-- 
Jazz is not dead. It just smells funny.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html