Re: [PATCH 3/3] nodedev: Move the udevPCITranslateInit call

Erik Skultety <eskultet@xxxxxxxxxx> · Thu, 14 Dec 2017 14:19:15 +0100

On Sat, Dec 09, 2017 at 12:29:14PM -0500, John Ferlan wrote:
> If the timing is "just right", there is a possibility that the
> udev nodeStateInitialize conflicts with another systemd thread
> running an lspci command leaving both waiting for "something",
> but resulting in a hung libvirtd (and hung lspci thread) from
> which the only recovery is a reboot because killing either thread
> is impossible and results in a defunct libvirtd process if a
> SIGKILL is performed.
>
> In order to avoid this let's move where the PCI initialization
> is done to be where it's actually needed. Ensure we only perform
> the initialization once via a driver bool.  Likewise, during
> cleanup ensure we only call udevPCITranslateDeinit once the
> initialization is successful.
>
> At least a failure for this driver won't hang out the rest of the
> the libvirt event loop. May not make certain things usable though.
> Still a libvirtd restart is far easier than a host reboot.

Is there a BZ for this or can you at least share what steps are necessary to
have a chance of hitting this issue? I'm asking because it sounds like we
should file a BZ against udev as well (possibly kernel) and a thorough
investigation of where the deadlock happens is necessary because I don't see a
any guarantee that just with a simple logic movement (and adding a trigger
condition) we can make disappear a race outside of our scope for good. On the
other hand, having to choose between a hung process requiring a host restart and
a hung worker thread requiring a service restart, I'd obviously opt for the
latter. So I'd say the next steps depend on how frequently and under what
circumstances (specific host devices, kernel version, etc.) this happens,
because to me it sounds odd how systemd and libpciaccess clash here.

Erik

--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list