Thanks for your detailed explanation, Maybe i can set udev_monitor to nonblocking on centos6.
owen.si@xxxxxxxxx
From: Erik SkultetyDate: 2018-10-18 20:23To: John FerlanCC: Bingsong Si; libvir-listSubject: Re: [PATCH] libvirtd: fix potential deadlock when starting vmOn Tue, Oct 16, 2018 at 05:57:17PM -0400, John Ferlan wrote:>>> On 10/11/18 4:13 AM, Bingsong Si wrote:> > On CentOS 6, udev_monitor_receive_device will block until the socket becomes>> Is this really CentOS6 only or just where you've seen it?>> > readable, udevEventHandleThread will hold the lock all the time and> > udevEventHandleCallback hard to get the lock, will block the event poll.> > To fix this, set dataReady to false after receive an udev event.> >> > Signed-off-by: Bingsong Si <owen.si@xxxxxxxxx>> > ---> > src/node_device/node_device_udev.c | 5 +----> > 1 file changed, 1 insertion(+), 4 deletions(-)> >>> I've CC'd Erik since he wrote and perhaps remembers all the "gotchas" he> discovered in the udev callback code.>> I wonder if this has to do with the EAGAIN and EWOULDBLOCK @errno checks> done in the !device loop that are different in "older" (much older) code.>> Although I have this very vague recollection that there was some problem> with centos6 that was fixed by some OS patch. Hopefully Erik remembers> (and maybe we should log it in the code at this point ;-)) - I did do> some searching, but came up empty.Remembering a year old issue, let me tell you, my head hurts :) (and we probablyshould put a note somewhere, so that we don't have to dig out dinosaursagain)...the only thing I remember is that there was a reason why I did thingsthis way and not the way this patch is proposing, and indeed I then found this:https://www.redhat.com/archives/libvir-list/2017-September/msg00683.htmlTL;DR:The scheduler comes into play here. The problem I had was that the event loopcould be scheduled (and it in fact was) earlier than the handler thread here.What that essentially means is that by the time the thread actually handled theevent and read the data from the monitor, the event loop fired the very sameevent, simply because the data hadn't been retrieved from the socket at thatpoint yet.This was mainly connected to the design flaw of that specific version of patchseries. With the current design, setting dataReady immediately after reading thedata or after encoutering the first EAGAIN doesn't matter and the schedulerwouldn't have an impact either way, that's true. However, with CentOS 6 thescheduler would still come into play even with your patch (it was much morenoticeable the more devices you had in/added into the system), you'd stillremain blocking on the recv call. The correct fix would be morecomplex and IIRC it would involve pulling the monitor object out of the privatedata lockable object and would need to be guarded by a separate lock (I haven'tthought about it much though, so I might be wrong).That said, we already dropped upstream support for CentOS 6, so I'mnot really keen on "fixing" anything, unless the currently supported platformssuffer from a related issue which would require code changes in which case wecould merge a patch like this upstream. You should upgrade your platform to anewer CentOS if you want to rely on features provided by new(ish) libvirt.Erik
-- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list