On 10/18/18 8:23 AM, Erik Skultety wrote: > On Tue, Oct 16, 2018 at 05:57:17PM -0400, John Ferlan wrote: >> >> >> On 10/11/18 4:13 AM, Bingsong Si wrote: >>> On CentOS 6, udev_monitor_receive_device will block until the socket becomes >> >> Is this really CentOS6 only or just where you've seen it? >> >>> readable, udevEventHandleThread will hold the lock all the time and >>> udevEventHandleCallback hard to get the lock, will block the event poll. >>> To fix this, set dataReady to false after receive an udev event. >>> >>> Signed-off-by: Bingsong Si <owen.si@xxxxxxxxx> >>> --- >>> src/node_device/node_device_udev.c | 5 +---- >>> 1 file changed, 1 insertion(+), 4 deletions(-) >>> >> >> I've CC'd Erik since he wrote and perhaps remembers all the "gotchas" he >> discovered in the udev callback code. >> >> I wonder if this has to do with the EAGAIN and EWOULDBLOCK @errno checks >> done in the !device loop that are different in "older" (much older) code. >> >> Although I have this very vague recollection that there was some problem >> with centos6 that was fixed by some OS patch. Hopefully Erik remembers >> (and maybe we should log it in the code at this point ;-)) - I did do >> some searching, but came up empty. > > Remembering a year old issue, let me tell you, my head hurts :) (and we probably > should put a note somewhere, so that we don't have to dig out dinosaurs > again)...the only thing I remember is that there was a reason why I did things > this way and not the way this patch is proposing, and indeed I then found this: > > https://www.redhat.com/archives/libvir-list/2017-September/msg00683.html > > TL;DR: > The scheduler comes into play here. The problem I had was that the event loop > could be scheduled (and it in fact was) earlier than the handler thread here. > What that essentially means is that by the time the thread actually handled the > event and read the data from the monitor, the event loop fired the very same > event, simply because the data hadn't been retrieved from the socket at that > point yet. > This was mainly connected to the design flaw of that specific version of patch > series. With the current design, setting dataReady immediately after reading the > data or after encoutering the first EAGAIN doesn't matter and the scheduler > wouldn't have an impact either way, that's true. However, with CentOS 6 the > scheduler would still come into play even with your patch (it was much more > noticeable the more devices you had in/added into the system), you'd still > remain blocking on the recv call. The correct fix would be more > complex and IIRC it would involve pulling the monitor object out of the private > data lockable object and would need to be guarded by a separate lock (I haven't > thought about it much though, so I might be wrong). > > That said, we already dropped upstream support for CentOS 6, so I'm > not really keen on "fixing" anything, unless the currently supported platforms > suffer from a related issue which would require code changes in which case we > could merge a patch like this upstream. You should upgrade your platform to a > newer CentOS if you want to rely on features provided by new(ish) libvirt. > > Erik > Thanks for the details - I could support such a patch (or write it using the above description). The one thing I have forgotten or perhaps I should say struck me - should this code use virCondWaitUntil or would it just not matter? Let's say 10 devices were added and on the 10th one we had this issue, set dataReady to false, but then didn't get another udev device ready event for "a while" leaving that 10th one in limbo until such time a new device was available (e.g. udevEventHandleCallback is called by udev). Usage of the *Until would only occur if we've fallen into the !device and continue code path and would be cleared prior before getting @device again. Too much over thinking ;->? John John -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list