On Fri, Nov 11, 2011 at 01:03:20PM +0000, Daniel P. Berrange wrote:
> Libvirt recently introduced a change to the way it does 'save to file'
> with QEMU. Historically QEMU has a 32MB/s I/O limit on migration by
> default. When saving to file, we didn't want any artificial limit,
> but rather to max out the underlying storage. So when doing save to
> file, we set a large bandwidth limit (INT64_MAX / (1024 * 1024)) so
> it is effectively unlimited.
>
> After doing this, we discovered that the QEMU monitor was becoming
> entirely blocked. It did not even return from the 'migrate' command
> until migration was complete despite the 'detach' flag being set.
> This was a bug in libvirt, because we passed a plain file descriptor
> which does not support EAGAIN. Thank you POSIX.
>
> Libvirt has another mode where it uses an I/O helper command so get
> O_DIRECT, and in this mode we pass a pipe() FD to QEMU. After ensuring
> that this pipe FD really does have O_NONBLOCK set, we still saw some
> odd behaviour.
>
> I'm not sure whether what I describe can neccessarily be called a QEMU
> bug, but I wanted to raise it for discussion anyway....
>
> The sequence of steps is
>
> - libvirt sets qemu migration bandwidth to "unlimited"
> - libvirt opens a pipe() and sets O_NONBLOCK on the write end
> - libvirt spawns libvirt-iohelper giving it the target file
> on disk, and the read end of the pipe
> - libvirt does 'getfd migfile' monitor command to give QEMU
> the write end of the pipe
> - libvirt does 'migrate fd:migfile -d' to run migration
> - In parallel
> - QEMU is writing to the pipe (which is non-blocking)
> - libvirt_helper reading the pipe & writing to disk with O_DIRECT
I should have mentioned that the way I'm testing this is with
libvirt 0.9.7, with both QEMU 0.14 and QEMU GIT master, using
a guest with 2 GB of RAM:
$ virsh start l3
Domain l3 started
$ virsh dominfo l3
Id: 17
Name: l3
UUID: c7a3edbd-edaf-9455-926a-d65c16db1803
OS Type: hvm
State: running
CPU(s): 1
CPU time: 1.1s
Max memory: 2292000 kB
Used memory: 2292736 kB
Persistent: yes
Autostart: disable
Managed save: no
Security model: selinux
Security DOI: 0
Security label: system_u:system_r:unconfined_t:s0:c94,c700 (permissive)
To actually perform the save-to-file, I use the '--bypass-cache' flag
for libvirt, which ensures we pass a pipe to QEMU and run our I/O
helper for O_DIRECT, instead of directly giving QEMU a plain file
$ virsh save --bypass-cache l3 l3.image
Domain l3 saved to l3.image
> - Most of the qemu_savevm_state_iterate() calls complete in 10-20 ms
>
> - Reasonably often a qemu_savevm_state_iterate() call takes 300-400 ms
>
> - Fairly rarely a qemu_savevm_state_iterate() call takes 10-20 *seconds*
I use the attached systemtap script for determining these
eg run this before starting the migration to disk:
# stap qemu-mig.stp
Begin
0.000 Start
5.198 > Begin
5.220 < Begin 0.022
5.220 > Iterate
5.224 < Iterate 0.004
...snip..
6.299 > Iterate
6.314 < Iterate 0.015
6.314 > Iterate
6.319 < Iterate 0.005
6.409 > Iterate
8.139 < Iterate 1.730 <<< very slow iteration
8.152 > Iterate
13.078 < Iterate 4.926 <<< very slow iteration
13.963 > Iterate
14.248 < Iterate 0.285
14.441 > Iterate
14.448 < Iterate 0.007
...snip...
24.171 > Iterate
24.178 < Iterate 0.007
24.178 > Complete
24.588 < Complete 0.410
<Ctrl-C>
avg 79 = sum 8033 / count 101; min 3 max 4926
value |-------------------------------------------------- count
0 | 0
1 | 0
2 | 1
4 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 74
8 |@@@@@@@@@ 19
16 |@ 3
32 | 0
64 | 0
128 | 0
256 |@ 2
512 | 0
1024 | 1
2048 | 0
4096 | 1
8192 | 0
16384 | 0
Regards,
Daniel
--
|: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org -o- http://virt-manager.org :|
|: http://autobuild.org -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
global then;
global deltas;
global start;
function print_ts(str) {
now = gettimeofday_ns() / (1000*1000)
delta = (now - start)
printf("%3d.%03d %s\n",
(delta / 1000), (delta % 1000), str);
}
probe begin {
printf("Begin\n");
then = 0;
start = gettimeofday_ns() / (1000*1000);
print_ts("Start");
}
probe process("/home/berrange/usr/qemu-git/bin/qemu-system-x86_64").function("qemu_savevm_state_begin") {
then = gettimeofday_ns() / (1000*1000);
print_ts("> Begin");
}
probe process("/home/berrange/usr/qemu-git/bin/qemu-system-x86_64").function("qemu_savevm_state_begin").return {
now = gettimeofday_ns() / (1000*1000);
if (then != 0) {
delta = now - then;
deltas <<< delta;
print_ts(sprintf("< Begin %3d.%03d",
(delta / 1000), (delta % 1000)));
}
}
probe process("/home/berrange/usr/qemu-git/bin/qemu-system-x86_64").function("qemu_savevm_state_iterate") {
then = gettimeofday_ns() / (1000*1000);
print_ts("> Iterate");
}
probe process("/home/berrange/usr/qemu-git/bin/qemu-system-x86_64").function("qemu_savevm_state_iterate").return {
now = gettimeofday_ns() / (1000*1000);
if (then != 0) {
delta = now - then;
deltas <<< delta;
print_ts(sprintf("< Iterate %3d.%03d",
(delta / 1000), (delta % 1000)));
}
}
probe process("/home/berrange/usr/qemu-git/bin/qemu-system-x86_64").function("qemu_savevm_state_complete") {
then = gettimeofday_ns() / (1000*1000);
print_ts("> Complete");
}
probe process("/home/berrange/usr/qemu-git/bin/qemu-system-x86_64").function("qemu_savevm_state_complete").return {
now = gettimeofday_ns() / (1000*1000);
if (then != 0) {
delta = now - then;
deltas <<< delta;
print_ts(sprintf("< Complete %3d.%03d",
(delta / 1000), (delta % 1000)));
}
}
probe end {
printf ("avg %d = sum %d / count %d; min %d max %d\n",
@avg(deltas), @sum(deltas), @count(deltas), @min(deltas), @max(deltas));
print (@hist_log(deltas));
}
--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list