Re: [Qemu-devel] [RFC] COLO HA Project proposal

"Dr. David Alan Gilbert" <dgilbert@xxxxxxxxxx> · Fri, 4 Jul 2014 09:35:46 +0100

* Dong, Eddie (eddie.dong@xxxxxxxxx) wrote:
> > >
> > > I didn't quite understand a couple of things though, perhaps you can
> > > explain:
> > >    1) If we ignore the TCP sequence number problem, in an SMP machine
> > > don't we get other randomnesses - e.g. which core completes something
> > > first, or who wins a lock contention, so the output stream might not
> > > be identical - so do those normal bits of randomness cause the
> > > machines to flag as out-of-sync?
> > 
> > It's about COLO agent, CCing Congyang, he can give the detailed
> > explanation.
> > 
> 
> Let me clarify on this issue. COLO didn't ignore the TCP sequence number, but uses a 
> new implementation to make the sequence number to be best effort identical 
> between the primary VM (PVM) and secondary VM (SVM). Likely, VMM has to synchronize 
> the emulation of randomization number generation mechanism between the 
> PVM and SVM, like the lock-stepping mechanism does. 
> 
> Further mnore, for long TCP connection, we can rely on the (on-demand) VM checkpoint to get the 
> identical Sequence number both in PVM and SVM. 

That wasn't really my question; I was worrying about other forms of randomness,
such as winners of lock contention, and other SMP non-determinisms,
and I'm also worried by what proportion of time the system can't recover
from a failure due to being unable to distinguish an SVM failure from
a randomness issue.

Dave

> 
> 
> Thanks, Eddie
--
Dr. David Alan Gilbert / dgilbert@xxxxxxxxxx / Manchester, UK
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html