If there is an unresponsive qemu process and libvirt access it's monitor, it will not get any response and this thread will block indefinitely, until the qemu process resumes or it's destroyed. If users continues executing APIs against that domain, libvirt will run out of worker threads and hangs (if those APIs will access monitor as well). Although, they will timeout in approx 30 seconds, which will free some workers, during that time is libvirt unable to process any request. Even worse - if the number of unresponsive qemu exceeds the size of worker thread pool, libvirt will hangs forever, and even restarting the daemon will not make it any better. This patch set heals the daemon on several levels, so nothing from written above will cause it to hangs: 1. RPC dispatching - all APIs are now annotated as 'high' or 'low' priority. Then a special thread pool is created. Low priority APIs will be still placed into usual pool, but high priority can be placed into this new pool if the former has no free worker. Which APIs should be marked high and which low? The splitting presented here is my guessing. It is not something written in stone, but from the logic of things it is not safe to annotate any API which is NOT guaranteed to end in reasonable small time as high priority call. 2. Job queue size limit - this sets bound on the number of threads blocked by a stuck qemu. Okay, there exists timeout on this, but if user application continue dispatching low priority calls it can still consume all (low priority) worker threads and therefore affect other users/VMs. Even if they timeout in approx 30 secs. 3. Run monitor re-connect in a separate thread per VM. If libvirtd is restarted, it tries to reconnect to all running qemu processes. This is potentially risky - one stuck qemu block daemon startup. However, putting the monitor startup code into one thread per VM allows libvirtd to startup, accept client connections and work with all VMs which monitor was successfully re-opened. Unresponsive qemu will hold job until we open the monitor. So clever user application can destroy such domain. All APIs requiring job will just fail in acquiring lock. Michal Privoznik (3): daemon: Create priority workers pool qemu: Introduce job queue size limit qemu: Deal with stucked qemu on daemon startup daemon/libvirtd.aug | 1 + daemon/libvirtd.c | 10 +- daemon/libvirtd.conf | 6 + daemon/remote.c | 26 ++ daemon/remote.h | 2 + src/qemu/libvirtd_qemu.aug | 1 + src/qemu/qemu.conf | 7 + src/qemu/qemu_conf.c | 4 + src/qemu/qemu_conf.h | 2 + src/qemu/qemu_domain.c | 17 ++ src/qemu/qemu_domain.h | 2 + src/qemu/qemu_driver.c | 23 +-- src/qemu/qemu_process.c | 89 ++++++- src/remote/qemu_protocol.x | 13 +- src/remote/remote_protocol.x | 544 +++++++++++++++++++++--------------------- src/rpc/gendispatch.pl | 48 ++++- src/rpc/virnetserver.c | 32 +++- src/rpc/virnetserver.h | 6 +- src/util/threadpool.c | 38 ++- src/util/threadpool.h | 1 + 20 files changed, 554 insertions(+), 318 deletions(-) -- 1.7.3.4 -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list