----- "Lucas Meneghel Rodrigues" <lmr@xxxxxxxxxx> wrote: > Hi Michael, I am reviewing your patchset and have just a minor remark > to make here: > > On Wed, Oct 7, 2009 at 2:54 PM, Michael Goldish <mgoldish@xxxxxxxxxx> > wrote: > > This patch adds a new test that checks the timedrift introduced by > migrations. > > It uses the same parameters used by the timedrift test to get the > guest time. > > In addition, the number of migrations the test performs is > controlled by the > > parameter 'migration_iterations'. > > > > Signed-off-by: Michael Goldish <mgoldish@xxxxxxxxxx> > > --- > > client/tests/kvm/kvm_tests.cfg.sample | 33 ++++--- > > client/tests/kvm/tests/timedrift_with_migration.py | 95 > ++++++++++++++++++++ > > 2 files changed, 115 insertions(+), 13 deletions(-) > > create mode 100644 > client/tests/kvm/tests/timedrift_with_migration.py > > > > diff --git a/client/tests/kvm/kvm_tests.cfg.sample > b/client/tests/kvm/kvm_tests.cfg.sample > > index 540d0a2..618c21e 100644 > > --- a/client/tests/kvm/kvm_tests.cfg.sample > > +++ b/client/tests/kvm/kvm_tests.cfg.sample > > @@ -100,19 +100,26 @@ variants: > > type = linux_s3 > > > > - timedrift: install setup > > - type = timedrift > > extra_params += " -rtc-td-hack" > > - # Pin the VM and host load to CPU #0 > > - cpu_mask = 0x1 > > - # Set the load and rest durations > > - load_duration = 20 > > - rest_duration = 20 > > - # Fail if the drift after load is higher than 50% > > - drift_threshold = 50 > > - # Fail if the drift after the rest period is higher than > 10% > > - drift_threshold_after_rest = 10 > > - # For now, make sure this test is executed alone > > - used_cpus = 100 > > + variants: > > + - with_load: > > + type = timedrift > > + # Pin the VM and host load to CPU #0 > > + cpu_mask = 0x1 > > + # Set the load and rest durations > > + load_duration = 20 > > + rest_duration = 20 > > Even the default duration here seems way too brief here, is there any > reason why 20s was chosen instead of, let's say, 1800s? I am under > the > impression that 20s of load won't be enough to cause any noticeable > drift... Apparently I've been working with a bad qemu version for quite a while, because after 20s of load I often get a huge (80%) drift. This normally shouldn't happen. We might want to wait a little longer than 20s, but there's no need to wait as long as 1800s AFAIK. The test is meant to catch drift problems, and apparently when there's a problem, it reveals itself quickly. I'm not sure there are drift problems that reveal themselves after only 1800s of load, but quite frankly, I know very little about this, so the timeout value should be changed by the user. > > + # Fail if the drift after load is higher than 50% > > + drift_threshold = 50 > > + # Fail if the drift after the rest period is higher > than 10% > > + drift_threshold_after_rest = 10 > > I am also curious about those tresholds and the reasoning behind > them. > Is there any official agreement on what we consider to be an > unreasonable drift? I really don't know. After asking around I got the impression that the threshold should depend on the load. Theoretically it should be possible to get any amount of drift with enough load -- that's the way I see it, but I'm really not sure. Maybe the best thing for us to do is run the time drift test a few times with functional qemu versions as well as with broken ones (but not as broken as the one I'm using), so we can see what thresholds best differentiate between functional and broken code. > Another thing that struck me out is drift calculation: On the > original > timedrift test, the guest drift is normalized against the host drift: > > drift = 100.0 * (host_delta - guest_delta) / host_delta > > While in the new drift tests, we consider only the guest drift. I > believe is better to normalize all tests based on one drift > calculation criteria, and those values should be reviewed, and at > least a certain level of agreement on our development community > should be reached. The new tests use the host clock as reference like the original test. We check how much time passed on the host, and then how much time passed in the guest, and take the difference between those two. The result is an absolute number of seconds because there's no "load duration" with which to normalize the result. We're just interested in how much drift is caused by a single reboot or a single migration (or a few). I don't think it matters how long the reboot/migration procedure took -- we treat it as a single discrete event. > Other than this concern that came to my mind, the new tests look good > and work fine here. I had to do a slight rebase in one of the > patches, > very minor stuff. The default values and the drift calculation can be > changed on a later time. Thanks! Since the two new tests are similar it may be a good idea to merge them into one and put just the differing code in conditional blocks, e.g. (common code) if op == "migration": (migration code) elif op == "reboot": (reboot code) (common code) And where we use logging.info() we can do something like: logging.info("Drift after %d %ss: %s seconds" % (iterations, op, drift)) so for reboot we get "Drift after 5 reboots: 3 seconds". If you think this is a good idea, we can do it in a different patch, or we can do it now. I'm not even sure it's nececssary because the tests are rather simple as they are. Another question is what to call the merged test -- timedrift with what? Thanks, Michael -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html