Re: [Qemu-devel] [RFC PATCH 00/20] Kemari for KVM v0.1

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Dor Laor wrote:
On 04/22/2010 04:16 PM, Yoshiaki Tamura wrote:
2010/4/22 Dor Laor<dlaor@xxxxxxxxxx>:
On 04/22/2010 01:35 PM, Yoshiaki Tamura wrote:

Dor Laor wrote:

On 04/21/2010 08:57 AM, Yoshiaki Tamura wrote:

Hi all,

We have been implementing the prototype of Kemari for KVM, and we're
sending
this message to share what we have now and TODO lists. Hopefully, we
would like
to get early feedback to keep us in the right direction. Although
advanced
approaches in the TODO lists are fascinating, we would like to run
this project
step by step while absorbing comments from the community. The current
code is
based on qemu-kvm.git 2b644fd0e737407133c88054ba498e772ce01f27.

For those who are new to Kemari for KVM, please take a look at the
following RFC which we posted last year.

http://www.mail-archive.com/kvm@xxxxxxxxxxxxxxx/msg25022.html

The transmission/transaction protocol, and most of the control
logic is
implemented in QEMU. However, we needed a hack in KVM to prevent rip
from
proceeding before synchronizing VMs. It may also need some
plumbing in
the
kernel side to guarantee replayability of certain events and
instructions,
integrate the RAS capabilities of newer x86 hardware with the HA
stack, as well
as for optimization purposes, for example.

[ snap]


The rest of this message describes TODO lists grouped by each topic.

=== event tapping ===

Event tapping is the core component of Kemari, and it decides on
which
event the
primary should synchronize with the secondary. The basic assumption
here is
that outgoing I/O operations are idempotent, which is usually true
for
disk I/O
and reliable network protocols such as TCP.

IMO any type of network even should be stalled too. What if the VM
runs
non tcp protocol and the packet that the master node sent reached some
remote client and before the sync to the slave the master failed?

In current implementation, it is actually stalling any type of network
that goes through virtio-net.

However, if the application was using unreliable protocols, it should
have its own recovering mechanism, or it should be completely
stateless.

Why do you treat tcp differently? You can damage the entire VM this
way -
think of dhcp request that was dropped on the moment you switched
between
the master and the slave?

I'm not trying to say that we should treat tcp differently, but just
it's severe.
In case of dhcp request, the client would have a chance to retry after
failover, correct?

But until it timeouts it won't have networking.

BTW, in current implementation, it's synchronizing before dhcp ack is
sent.
But in case of tcp, once you send ack to the client before sync, there
is no way to recover.

What if the guest is running dhcp server? It we provide an IP to a
client and then fail to the secondary that will run without knowing the
master allocated this IP

That's problematic.  So it needs to sync when dhcp ack is sent.

I should apologize for my misunderstanding and explanation. I agree that we should stall every type of network output.



[snap]


=== clock ===

Since synchronizing the virtual machines every time the TSC is
accessed would be
prohibitive, the transmission of the TSC will be done lazily, which
means
delaying it until there is a non-TSC synchronization point arrives.

Why do you specifically care about the tsc sync? When you sync all the
IO model on snapshot it also synchronizes the tsc.

So, do you agree that an extra clock synchronization is not needed
since it
is done anyway as part of the live migration state sync?

I agree that its sent as part of the live migration.
What I wanted to say here is that this is not something for real time
applications.
I usually get questions like can this guarantee fault tolerance for
real time applications.

First the huge cost of snapshots won't match to any real time app.

I see.

Second, even if it wasn't the case, the tsc delta and kvmclock are
synchronized as part of the VM state so there is no use of trapping it
in the middle.

I should study the clock in KVM, but won't tsc get updated by the HW after migration?
I was wondering the following case for example:

1. The application on the guest calls rdtsc on host A.
2. The application uses rdtsc value for something.
3. Failover to host B.
4. The application on the guest replays the rdtsc call on host B.
5. If the rdtsc value is different between A and B, the application may get into trouble because of it.

If I were wrong, my apologies.



In general, can you please explain the 'algorithm' for continuous
snapshots (is that what you like to do?):

Yes, of course.
Sorry for being less informative.

A trivial one would we to :
- do X online snapshots/sec

I currently don't have good numbers that I can share right now.
Snapshots/sec depends on what kind of workload is running, and if the
guest was almost idle, there will be no snapshots in 5sec. On the other
hand, if the guest was running I/O intensive workloads (netperf, iozone
for example), there will be about 50 snapshots/sec.

- Stall all IO (disk/block) from the guest to the outside world
until the previous snapshot reaches the slave.

Yes, it does.

- Snapshots are made of

Full device model + diff of dirty pages from the last snapshot.

- diff of dirty pages from last snapshot

This also depends on the workload.
In case of I/O intensive workloads, dirty pages are usually less
than 100.

The hardest would be memory intensive loads.
So 100 snap/sec means latency of 10msec right?
(not that it's not ok, with faster hw and IB you'll be able to get much
more)

Doesn't 100 snap/sec mean the interval of snap is 10msec?
IIUC, to get the latency, you need to get, Time to transfer VM + Time
to get response from the receiver.

It's hard to say which load is the hardest.
Memory intensive load, who don't generate I/O often, will suffer from
long sync time for that moment, but would have chances to continue its
process until sync.
I/O intensive load, who don't dirty much pages, will suffer from
getting VPU stopped often, but its sync time is relatively shorter.

- Qemu device model (+kvm's) diff from last.

We're currently sending full copy because we're completely reusing this
part of existing live migration framework.

Last time we measured, it was about 13KB.
But it varies by which QEMU version is used.

You can do 'light' snapshots in between to send dirty pages to reduce
snapshot time.

I agree. That's one of the advanced topic we would like to try too.

I wrote the above to serve a reference for your comments so it will
map
into my mind. Thanks, dor

Thank your for the guidance.
I hope this answers to your question.

At the same time, I would also be happy it we could discuss how to
implement too. In fact, we needed a hack to prevent rip from proceeding
in KVM, which turned out that it was not the best workaround.

There are brute force solutions like
- stop the guest until you send all of the snapshot to the remote (like
standard live migration)

We've implemented this way so far.

- Stop + fork + cont the father

Or mark the recent dirty pages that were not sent to the remote as write
protected and copy them if touched.

I think I had that suggestion from Avi before.
And yes, it's very fascinating.

Meanwhile, if you look at the diffstat, it needed to touch many parts
of QEMU.
Before going into further implementation, I wanted to check that I'm
in the right track for doing this project.


Thanks,

Yoshi



TODO:
- Synchronization of clock sources (need to intercept TSC reads,
etc).

=== usability ===

These are items that defines how users interact with Kemari.

TODO:
- Kemarid daemon that takes care of the cluster management/monitoring
side of things.
- Some device emulators might need minor modifications to work well
with Kemari. Use white(black)-listing to take the burden of
choosing the right device model off the users.

=== optimizations ===

Although the big picture can be realized by completing the TODO list
above, we
need some optimizations/enhancements to make Kemari useful in real
world, and
these are items what needs to be done for that.

TODO:
- SMP (for the sake of performance might need to implement a
synchronization protocol that can maintain two or more
synchronization points active at any given moment)
- VGA (leverage VNC's subtilting mechanism to identify fb pages that
are really dirty).


Any comments/suggestions would be greatly appreciated.

Thanks,

Yoshi

--

Kemari starts synchronizing VMs when QEMU handles I/O requests.
Without this patch VCPU state is already proceeded before
synchronization, and after failover to the VM on the receiver, it
hangs because of this.

Signed-off-by: Yoshiaki Tamura<tamura.yoshiaki@xxxxxxxxxxxxx>
---
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/svm.c | 11 ++++++++---
arch/x86/kvm/vmx.c | 11 ++++++++---
arch/x86/kvm/x86.c | 4 ++++
4 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h
b/arch/x86/include/asm/kvm_host.h
index 26c629a..7b8f514 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -227,6 +227,7 @@ struct kvm_pio_request {
int in;
int port;
int size;
+ bool lazy_skip;
};

/*
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index d04c7ad..e373245 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -1495,7 +1495,7 @@ static int io_interception(struct vcpu_svm
*svm)
{
struct kvm_vcpu *vcpu =&svm->vcpu;
u32 io_info = svm->vmcb->control.exit_info_1; /* address size bug? */
- int size, in, string;
+ int size, in, string, ret;
unsigned port;

++svm->vcpu.stat.io_exits;
@@ -1507,9 +1507,14 @@ static int io_interception(struct vcpu_svm
*svm)
port = io_info>> 16;
size = (io_info& SVM_IOIO_SIZE_MASK)>> SVM_IOIO_SIZE_SHIFT;
svm->next_rip = svm->vmcb->control.exit_info_2;
- skip_emulated_instruction(&svm->vcpu);

- return kvm_fast_pio_out(vcpu, size, port);
+ ret = kvm_fast_pio_out(vcpu, size, port);
+ if (ret)
+ skip_emulated_instruction(&svm->vcpu);
+ else
+ vcpu->arch.pio.lazy_skip = true;
+
+ return ret;
}

static int nmi_interception(struct vcpu_svm *svm)
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 41e63bb..09052d6 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2975,7 +2975,7 @@ static int handle_triple_fault(struct kvm_vcpu
*vcpu)
static int handle_io(struct kvm_vcpu *vcpu)
{
unsigned long exit_qualification;
- int size, in, string;
+ int size, in, string, ret;
unsigned port;

exit_qualification = vmcs_readl(EXIT_QUALIFICATION);
@@ -2989,9 +2989,14 @@ static int handle_io(struct kvm_vcpu *vcpu)

port = exit_qualification>> 16;
size = (exit_qualification& 7) + 1;
- skip_emulated_instruction(vcpu);

- return kvm_fast_pio_out(vcpu, size, port);
+ ret = kvm_fast_pio_out(vcpu, size, port);
+ if (ret)
+ skip_emulated_instruction(vcpu);
+ else
+ vcpu->arch.pio.lazy_skip = true;
+
+ return ret;
}

static void
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index fd5c3d3..cc308d2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4544,6 +4544,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu
*vcpu, struct kvm_run *kvm_run)
if (!irqchip_in_kernel(vcpu->kvm))
kvm_set_cr8(vcpu, kvm_run->cr8);

+ if (vcpu->arch.pio.lazy_skip)
+ kvm_x86_ops->skip_emulated_instruction(vcpu);
+ vcpu->arch.pio.lazy_skip = false;
+
if (vcpu->arch.pio.count || vcpu->mmio_needed ||
vcpu->arch.emulate_ctxt.restart) {
if (vcpu->mmio_needed) {


















--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [KVM ARM]     [KVM ia64]     [KVM ppc]     [Virtualization Tools]     [Spice Development]     [Libvirt]     [Libvirt Users]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite Questions]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux