On 01/13/2016 06:45 PM, Cole Robinson wrote: > On 01/13/2016 05:18 AM, Richard W.M. Jones wrote: >> As people may know, we frequently encounter errors caused by libvirt >> when running the libguestfs appliance. >> >> I wanted to find out exactly how frequently these happen and classify >> the errors, so I ran the 'virt-df' tool overnight 1700 times. This >> tool runs several parallel qemu:///session libvirt connections both >> creating a short-lived appliance guest. >> >> Note that I have added Cole's patch to fix https://bugzilla.redhat.com/1271183 >> "XML-RPC error : Cannot write data: Transport endpoint is not connected" >> >> Results: >> >> The test failed 538 times (32% of the time), which is pretty dismal. >> To be fair, virt-df is aggressive about how it launches parallel >> libvirt connections. Most other virt-* tools use only a single >> libvirt connection and are consequently more reliable. >> >> Of the failures, 518 (96%) were of the form: >> >> process exited while connecting to monitor: qemu: could not load kernel '/home/rjones/d/libguestfs/tmp/.guestfs-1000/appliance.d/kernel': Permission denied >> >> which is https://bugzilla.redhat.com/921135 or maybe >> https://bugzilla.redhat.com/1269975. It's not clear to me if these >> bugs have different causes, but if they do then potentially we're >> seeing a mix of both since my test has no way to distinguish them. >> > > I just experimented with this, I think it's the issue I suggested at: > > https://bugzilla.redhat.com/show_bug.cgi?id=1269975#c4 > > I created two VMs, kernel1 and kernel2, just booting off a kernel in > $HOME/session-kernel/vmlinuz. Then I added this patch: > > diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c > index f083f3f..5d9f0fa 100644 > --- a/src/qemu/qemu_process.c > +++ b/src/qemu/qemu_process.c > @@ -4901,6 +4901,13 @@ qemuProcessLaunch(virConnectPtr conn, > incoming ? incoming->path : NULL) < 0) > goto cleanup; > > + if (STREQ(vm->def->name, "kernel1")) { > + for (int z = 0; z < 30; z++) { > + printf("kernel1: sleeping %d of 30\n", z + 1); > + sleep(1); > + } > + } > + > /* Security manager labeled all devices, therefore > * if any operation from now on fails, we need to ask the caller to > * restore labels. > > > Which is right after selinux labels are set on VM startup. This is then easy > to reproduce with: > > virsh start kernel1 (sleeps) > virsh start kernel2 && virsh destroy kernel2 > > The shared vmlinuz is reset to user_home_t after kernel2 is shut down, so > kernel1 fails to start after the patch's timeout > > When we detect similar issues with <disk> devices, like when the media already > has the expected label, we encode 'relabel=no' in the disk XML, which tells > libvirt not to run restorecon on the disks path when the VM is shutdown. > However kernel/initrd XML doesn't have support for this XML, so it won't work > there. Adding that could be one fix. > > But I think there's longer term plans for this type of issue by using ACLs, or > virtlockd or something, Michal had patches but I don't know the specifics. > > Unfortunately even hardlinks share selinux labels so I don't think there's any > workaround on the libguestfs side short of using a separate copy of the > appliance kernel for each VM > Whoops, should have checked my libvirt mail first, you guys already came to this conclusion elsewhere in the thread :) - Cole -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list