On Mon, Apr 1, 2019 at 4:35 PM Michal Privoznik <mprivozn@xxxxxxxxxx> wrote: > > On 4/1/19 4:25 PM, Christian Ehrhardt wrote: > > Hi, > > I happened to analyze a bug [1] report I got from a friend and for > > quite a while it was rather elusive. But I now finally got it > > reproducible [2] enough to share it with the community. > > > > The TL;DR of what I see is: > > - an automation with python-libvirt gets a SIGINT > > - cleanup runs destroy and further undefine > > - the guest closes FDs due to SIGINT and/or destroy which triggers > > daemonStreamHandleAbort > > - those two fight over the lock > > > > There I get libvirtd into a deadlock which ends up with all threads > > dead [4] and two of them fighting [3] (details) in particular. > > > > The to related stacks summarized are like: > > > > daemonStreamHandleWrite (failing to write) > > -> daemonStreamHandleAbort (closing things and cleaning up) > > -> ... virChrdevFDStreamCloseCb > > virMutexLock(&priv->devs->lock); > > > > # there is code meant to avoid such issues emitting "Unable to close" > > if a lock is held > > # but the log doesn't show this triggering with debug enabled > > > > #10 seems triggered via an "undefine" call > > remoteDispatchDomainUndefine > > ... -> virChrdevFree > > ... -> virFDStreamSetInternalCloseCb > > -> virObjectLock(virFDStreamDataPtr fdst) > > -> virMutexLock(&obj->lock); > > # closing all streams of a guest (requiring the same locks) > > > > While that already feels quite close I struggle to see where exactly > > we'd want to fix it. > > But finally having a repro-script [2] I hope that someone else here > > might be able to help me with that. > > > > After all it is a race - on my s390x system it triggers usually <5 > > tries, while on x86 I have needed up to 18 runs of the test to hang. > > Given different system configs it might be better or worse for you. > > > > FYI we hit this with libvirt 4.0 initially but libvirt 5.0 was just the same. > > I haven't built 5.1 or a recent master, but the commits since 5.0 > > didn't mention any issue that seems related. OTOH I'm willing and able > > to build and try suggestions if anyone comes up with ideas. > > > > [1]: https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1822096 > > [2]: https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1822096/+attachment/5251655/+files/test4.py > > [3]: https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1822096/comments/3 > > [4]: https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1822096/comments/17 > > > > You may want to look at d63c82df8b11b583dec8e72dfb216d8c14783876 > (contained in 5.1.0) beause this smells like the issue you're facing. Thanks Michal, I agree that this appears to be similar. But unfortunately with 5.0 + the full 9 patch series leading into d63c82df still triggers the deadlock that we found. So it seems to be a new issue :-/ As I said before - any further suggestions (on commits to test and/or how to resolve with new changes) are welcome. Thanks in advance! > Michal -- Christian Ehrhardt Software Engineer, Ubuntu Server Canonical Ltd -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list