Re: [PATCH] libvirtd: fix potential deadlock when starting vm

"owen.si@xxxxxxxxx" <owen.si@xxxxxxxxx> · Fri, 19 Oct 2018 10:05:21 +0800

Thanks for your detailed explanation, Maybe i can set udev_monitor to nonblocking on centos6.

owen.si@xxxxxxxxx

From: Erik Skultety
Date: 2018-10-18 20:23
To: John Ferlan
CC: Bingsong Si; libvir-list
Subject: Re:  [PATCH] libvirtd: fix potential deadlock when starting vm
On Tue, Oct 16, 2018 at 05:57:17PM -0400, John Ferlan wrote:
>
>
> On 10/11/18 4:13 AM, Bingsong Si wrote:
> > On CentOS 6, udev_monitor_receive_device will block until the socket becomes
>
> Is this really CentOS6 only or just where you've seen it?
>
> > readable, udevEventHandleThread will hold the lock all the time and
> > udevEventHandleCallback hard to get the lock, will block the event poll.
> > To fix this, set dataReady to false after receive an udev event.
> >
> > Signed-off-by: Bingsong Si <owen.si@xxxxxxxxx>
> > ---
> >  src/node_device/node_device_udev.c | 5 +----
> >  1 file changed, 1 insertion(+), 4 deletions(-)
> >
>
> I've CC'd Erik since he wrote and perhaps remembers all the "gotchas" he
> discovered in the udev callback code.
>
> I wonder if this has to do with the EAGAIN and EWOULDBLOCK @errno checks
> done in the !device loop that are different in "older" (much older) code.
>
> Although I have this very vague recollection that there was some problem
> with centos6 that was fixed by some OS patch.  Hopefully Erik remembers
> (and maybe we should log it in the code at this point ;-)) - I did do
> some searching, but came up empty.

Remembering a year old issue, let me tell you, my head hurts :) (and we probably
should put a note somewhere, so that we don't have to dig out dinosaurs
again)...the only thing I remember is that there was a reason why I did things
this way and not the way this patch is proposing, and indeed I then found this:

https://www.redhat.com/archives/libvir-list/2017-September/msg00683.html

TL;DR:
The scheduler comes into play here. The problem I had was that the event loop
could be scheduled (and it in fact was) earlier than the handler thread here.
What that essentially means is that by the time the thread actually handled the
event and read the data from the monitor, the event loop fired the very same
event, simply because the data hadn't been retrieved from the socket at that
point yet.
This was mainly connected to the design flaw of that specific version of patch
series. With the current design, setting dataReady immediately after reading the
data or after encoutering the first EAGAIN doesn't matter and the scheduler
wouldn't have an impact either way, that's true. However, with CentOS 6 the
scheduler would still come into play even with your patch (it was much more
noticeable the more devices you had in/added into the system), you'd still
remain blocking on the recv call. The correct fix would be more
complex and IIRC it would involve pulling the monitor object out of the private
data lockable object and would need to be guarded by a separate lock (I haven't
thought about it much though, so I might be wrong).

That said, we already dropped upstream support for CentOS 6, so I'm
not really keen on "fixing" anything, unless the currently supported platforms
suffer from a related issue which would require code changes in which case we
could merge a patch like this upstream. You should upgrade your platform to a
newer CentOS if you want to rely on features provided by new(ish) libvirt.

Erik

--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list