On Fri, Oct 08, 2021 at 03:13:27PM +0200, Michal Prívozník wrote: > On 10/8/21 2:23 PM, Daniel P. Berrangé wrote: > > On Fri, Oct 08, 2021 at 01:56:24PM +0200, Michal Prívozník wrote: > >> On 10/8/21 1:36 PM, Daniel P. Berrangé wrote: > > So, ACK to this patch as a quick fix, but it feels like we > > ultimately have a more serious problem. > > > > I wonder if we should temporarily stop watching events on > > the incoming server if we get EMFILE first time on accept, > > and set a timer to re-enable events 60 seconds later. Rinse, > > repeat. > > Yes, good idea. But I worry that 60 seconds may be too long. Or > needlessly short - depending on usage. Nevertheless, let us start with > 60 seconds and we can fine tune later. Once we're in this out of FDs situation, we're in a bad mess no matter what, so the length of timeout isn't a big issue IMHO. Essentially we just need to stop burning CPU gratuitously. Even 5 seconds is long enough to achieve that goal, so just pick a number you're happy with - I don't feel strongly about it being 60 seconds, it was just an arbitrary number i picked. > So are you okay with me pushing this patch and sending this ^^ as a > follow up? Yes, consider is Reviewed-by: Daniel P. Berrangé <berrange@xxxxxxxxxx> > > > > > > >>>> * Each event source is assigned a priority. The default priority, > >>>> * %G_PRIORITY_DEFAULT, is 0. Values less than 0 denote higher priorities. > >>>> * Values greater than 0 denote lower priorities. Events from high priority > >>>> * sources are always processed before events from lower priority sources. > >>>> > >>>> and per g_idle_add() documentation: > >>>> > >>>> * Adds a function to be called whenever there are no higher priority > >>>> * events pending to the default main loop. The function is given the > >>>> * default idle priority, %G_PRIORITY_DEFAULT_IDLE. > >>>> > >>>> Now, because we did not accept() the client we are constantly > >>>> seeing POLLIN on the main socket and thus the removal of the > >>>> client socket won't ever happen. > >>>> > >>>> The fix is to set at least the same priority as other sources, > >>>> but since we want to just close an FD, let's give it the highest > >>>> priority and call it before handling other events. > >>>> > >>>> This issue can be easily reproduced, for instance: > >>>> > >>>> # ulimit -S -n 40 (tweak this number if needed) > >>>> # ./src/libvirtd > >>>> > >>>> from another terminal: > >>>> > >>>> # for ((i=0; i<100; i++)); do virsh list & done; virsh list > >>>> > >>>> The last `virsh list` must not get stuck. > >>> > >>> The bug below describes libvirt core dumps rather than > >>> hangs. Do you know why it dumped ? > >> > >> Because of glib. We called some glib function which internally wanted to > >> open a pipe which failed. Thus glib aborted. I don't remember the exact > >> function, but I can look it up if you want. > > > > Oh, so this patch is not actually fixing the core dump scenario > > then. Even after we fix the bug that prevents clients disconnecting, > > it will still dump if this API is called at a point where we are in > > a EMFILE state ? > > Yes. But I am afraid there's no way around it. Found the stack trace: > > Stack trace of thread 52231: > #0 0x000003ffa90be9e4 raise (libc.so.6) > #1 0x000003ffa96590ee g_log_default_handler (libglib-2.0.so.0) > #2 0x000003ffa965935e g_logv (libglib-2.0.so.0) > #3 0x000003ffa965955a g_log (libglib-2.0.so.0) > #4 0x000003ffa969dd70 g_wakeup_new (libglib-2.0.so.0) > #5 0x000003ffa964eb38 g_main_context_new (libglib-2.0.so.0) > #6 0x000003ffa9946886 vir_event_thread_init (libvirt.so.0) > #7 0x000003ffa95b5ff8 g_type_create_instance (libgobject-2.0.so.0) > #8 0x000003ffa95978cc g_object_new_internal (libgobject-2.0.so.0) > #9 0x000003ffa95990de g_object_new_with_properties (libgobject-2.0.so.0) > #10 0x000003ffa9599ab6 g_object_new (libgobject-2.0.so.0) > #11 0x000003ffa9946b78 virEventThreadNew (libvirt.so.0) > #12 0x000003ff6c2c8bd8 qemuProcessQMPNew (libvirt_driver_qemu.so) > #13 0x000003ff6c1f57b8 virQEMUCapsInitQMPSingle (libvirt_driver_qemu.so) > #14 0x000003ff6c1f59ec virQEMUCapsNewForBinaryInternal > (libvirt_driver_qemu.so) > #15 0x000003ff6c1f5c92 virQEMUCapsNewData (libvirt_driver_qemu.so) > #16 0x000003ffa99510f4 virFileCacheValidate (libvirt.so.0) > #17 0x000003ffa995133c virFileCacheLookup (libvirt.so.0) > #18 0x000003ff6c1f5e4a virQEMUCapsCacheLookup (libvirt_driver_qemu.so) > #19 0x000003ff6c1f621e virQEMUCapsCacheLookupCopy (libvirt_driver_qemu.so) > #20 0x000003ff6c2c4a44 qemuProcessInit (libvirt_driver_qemu.so) > #21 0x000003ff6c2c67f2 qemuProcessStart (libvirt_driver_qemu.so) > #22 0x000003ff6c26358c qemuDomainObjStart.constprop.55 > (libvirt_driver_qemu.so) > #23 0x000003ff6c2639c8 qemuDomainCreateWithFlags (libvirt_driver_qemu.so) > #24 0x000003ffa9b77902 virDomainCreate (libvirt.so.0) > #25 0x000002aa34ad2d52 remoteDispatchDomainCreateHelper (libvirtd) > #26 0x000003ffa9a6094a virNetServerProgramDispatch (libvirt.so.0) > #27 0x000003ffa9a665b6 virNetServerHandleJob (libvirt.so.0) > #28 0x000003ffa99a0d32 virThreadPoolWorker (libvirt.so.0) > #29 0x000003ffa99a02d6 virThreadHelper (libvirt.so.0) > #30 0x000003ffa85880c6 start_thread (libpthread.so.0) > #31 0x000003ffa917bd36 thread_start (libc.so.6) > > So my bad, it failed opening eventfd (in frame #4 g_wakeup_new()) but > anyway, I don't think we can mitigate this. But I think there was > another stacktrace attached to BZ where it failed in glib in pipe(). > Doesn't really matter though. It's glib - we accepted abort policy when > adopting it. Yeah, we're basically doomed with FD exhaustion. If a host got into this state once, its likely todo it again, so the only real fix is to re-configure the host to have sufficient FDs. Anything we do to cope with FD exhaustion is just a "nice to have", not a real fix Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|