Re: [PATCH] vireventglib: Remove handles with the highest priority

Daniel P. Berrangé <berrange@xxxxxxxxxx> · Fri, 8 Oct 2021 14:20:14 +0100

On Fri, Oct 08, 2021 at 03:13:27PM +0200, Michal Prívozník wrote:
> On 10/8/21 2:23 PM, Daniel P. Berrangé wrote:
> > On Fri, Oct 08, 2021 at 01:56:24PM +0200, Michal Prívozník wrote:
> >> On 10/8/21 1:36 PM, Daniel P. Berrangé wrote:
> > So, ACK to this patch as a quick fix, but it feels like we
> > ultimately have a more serious problem.
> > 
> > I wonder if we should temporarily stop watching events on
> > the incoming server if we get EMFILE first time on accept,
> > and set a timer to re-enable events 60 seconds later. Rinse,
> > repeat.
> 
> Yes, good idea. But I worry that 60 seconds may be too long. Or
> needlessly short - depending on usage. Nevertheless, let us start with
> 60 seconds and we can fine tune later.

Once we're in this out of FDs situation, we're in a bad mess no
matter what, so the length of timeout isn't a big issue IMHO.

Essentially we just need to stop burning CPU gratuitously.
Even 5 seconds is long enough to achieve that goal, so just
pick a number you're happy with - I don't feel strongly about
it being 60 seconds, it was just an arbitrary number i picked.

> So are you okay with me pushing this patch and sending this ^^ as a
> follow up?

Yes, consider is Reviewed-by: Daniel P. Berrangé <berrange@xxxxxxxxxx>

> 
> > 
> > 
> >>>>  * Each event source is assigned a priority. The default priority,
> >>>>  * %G_PRIORITY_DEFAULT, is 0. Values less than 0 denote higher priorities.
> >>>>  * Values greater than 0 denote lower priorities. Events from high priority
> >>>>  * sources are always processed before events from lower priority sources.
> >>>>
> >>>> and per g_idle_add() documentation:
> >>>>
> >>>>  * Adds a function to be called whenever there are no higher priority
> >>>>  * events pending to the default main loop. The function is given the
> >>>>  * default idle priority, %G_PRIORITY_DEFAULT_IDLE.
> >>>>
> >>>> Now, because we did not accept() the client we are constantly
> >>>> seeing POLLIN on the main socket and thus the removal of the
> >>>> client socket won't ever happen.
> >>>>
> >>>> The fix is to set at least the same priority as other sources,
> >>>> but since we want to just close an FD, let's give it the highest
> >>>> priority and call it before handling other events.
> >>>>
> >>>> This issue can be easily reproduced, for instance:
> >>>>
> >>>>  # ulimit -S -n 40 (tweak this number if needed)
> >>>>  # ./src/libvirtd
> >>>>
> >>>> from another terminal:
> >>>>
> >>>>  # for ((i=0; i<100; i++)); do virsh list & done; virsh list
> >>>>
> >>>> The last `virsh list` must not get stuck.
> >>>
> >>> The bug below describes libvirt core dumps rather than
> >>> hangs. Do you know why it dumped ?
> >>
> >> Because of glib. We called some glib function which internally wanted to
> >> open a pipe which failed. Thus glib aborted. I don't remember the exact
> >> function, but I can look it up if you want.
> > 
> > Oh, so this patch is not actually fixing the core dump scenario
> > then.  Even after we fix the bug that prevents clients disconnecting,
> > it will still dump if this API is called at a point where we are in
> > a EMFILE state ?
> 
> Yes. But I am afraid there's no way around it. Found the stack trace:
> 
> Stack trace of thread 52231:
> #0  0x000003ffa90be9e4 raise (libc.so.6)
> #1  0x000003ffa96590ee g_log_default_handler (libglib-2.0.so.0)
> #2  0x000003ffa965935e g_logv (libglib-2.0.so.0)
> #3  0x000003ffa965955a g_log (libglib-2.0.so.0)
> #4  0x000003ffa969dd70 g_wakeup_new (libglib-2.0.so.0)
> #5  0x000003ffa964eb38 g_main_context_new (libglib-2.0.so.0)
> #6  0x000003ffa9946886 vir_event_thread_init (libvirt.so.0)
> #7  0x000003ffa95b5ff8 g_type_create_instance (libgobject-2.0.so.0)
> #8  0x000003ffa95978cc g_object_new_internal (libgobject-2.0.so.0)
> #9  0x000003ffa95990de g_object_new_with_properties (libgobject-2.0.so.0)
> #10 0x000003ffa9599ab6 g_object_new (libgobject-2.0.so.0)
> #11 0x000003ffa9946b78 virEventThreadNew (libvirt.so.0)
> #12 0x000003ff6c2c8bd8 qemuProcessQMPNew (libvirt_driver_qemu.so)
> #13 0x000003ff6c1f57b8 virQEMUCapsInitQMPSingle (libvirt_driver_qemu.so)
> #14 0x000003ff6c1f59ec virQEMUCapsNewForBinaryInternal
> (libvirt_driver_qemu.so)
> #15 0x000003ff6c1f5c92 virQEMUCapsNewData (libvirt_driver_qemu.so)
> #16 0x000003ffa99510f4 virFileCacheValidate (libvirt.so.0)
> #17 0x000003ffa995133c virFileCacheLookup (libvirt.so.0)
> #18 0x000003ff6c1f5e4a virQEMUCapsCacheLookup (libvirt_driver_qemu.so)
> #19 0x000003ff6c1f621e virQEMUCapsCacheLookupCopy (libvirt_driver_qemu.so)
> #20 0x000003ff6c2c4a44 qemuProcessInit (libvirt_driver_qemu.so)
> #21 0x000003ff6c2c67f2 qemuProcessStart (libvirt_driver_qemu.so)
> #22 0x000003ff6c26358c qemuDomainObjStart.constprop.55
> (libvirt_driver_qemu.so)
> #23 0x000003ff6c2639c8 qemuDomainCreateWithFlags (libvirt_driver_qemu.so)
> #24 0x000003ffa9b77902 virDomainCreate (libvirt.so.0)
> #25 0x000002aa34ad2d52 remoteDispatchDomainCreateHelper (libvirtd)
> #26 0x000003ffa9a6094a virNetServerProgramDispatch (libvirt.so.0)
> #27 0x000003ffa9a665b6 virNetServerHandleJob (libvirt.so.0)
> #28 0x000003ffa99a0d32 virThreadPoolWorker (libvirt.so.0)
> #29 0x000003ffa99a02d6 virThreadHelper (libvirt.so.0)
> #30 0x000003ffa85880c6 start_thread (libpthread.so.0)
> #31 0x000003ffa917bd36 thread_start (libc.so.6)
> 
> So my bad, it failed opening eventfd (in frame #4 g_wakeup_new()) but
> anyway, I don't think we can mitigate this. But I think there was
> another stacktrace attached to BZ where it failed in glib in pipe().
> Doesn't really matter though. It's glib - we accepted abort policy when
> adopting it.

Yeah, we're basically doomed with FD exhaustion. If a host got into
this state once, its likely todo it again, so the only real fix is
to re-configure the host to have sufficient FDs. Anything we do to
cope with FD exhaustion is just a "nice to have", not a real fix

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|