On Mon, Dec 13, 2021 at 09:42:04AM -0700, Jim Fehlig wrote:
On 12/12/21 14:15, Martin Kletzander wrote:On Sun, Dec 12, 2021 at 10:40:46AM -0700, Jim Fehlig wrote:On 12/11/21 03:28, Martin Kletzander wrote:On Sat, Dec 11, 2021 at 11:16:13AM +0100, Martin Kletzander wrote:On Fri, Dec 10, 2021 at 05:48:03PM -0700, Jim Fehlig wrote:Hi Martin! I recently received a bug report (sorry, not public) about simple operations like 'virsh list' hanging when invoked with an internal test tool. I found this commit to be the culprit.OK, one more thing though, the fact that pkttyagent is spawned cannot cause virsh to hang. If the authentications is not required, then it will just wait there for a while and then be killed. If authentication *is* required, then either you already have an agent running and that one should be used since we're starting pkttyagent with `--fallback` or you do not have any agent running in which case virsh list would fail to connect. Where does the virsh hang, what's the backtrace?The last scenario you describe appears to be the case. virsh fails to connect then gets stuck trying to kill off pkttyagent #0 0x00007f9f07530241 in clock_nanosleep@GLIBC_2.2.5 () from /lib64/libc.so.6 #1 0x00007f9f07535ad3 in nanosleep () from /lib64/libc.so.6 #2 0x00007f9f07f478af in g_usleep () from /usr/lib64/libglib-2.0.so.0 #3 0x00007f9f086694fa in virProcessAbort (pid=367) at ../src/util/virprocess.c:187 #4 0x00007f9f0861ed9b in virCommandAbort (cmd=cmd@entry=0x55a798660c50) at ../src/util/vircommand.c:2774 #5 0x00007f9f08621478 in virCommandFree (cmd=0x55a798660c50) at ../src/util/vircommand.c:3061 #6 0x00007f9f08668581 in virPolkitAgentDestroy (agent=0x55a7986426e0) at ../src/util/virpolkit.c:164 #7 0x000055a797836d93 in virshConnect (ctl=ctl@entry=0x7ffc551dd980, uri=0x0, readonly=readonly@entry=false) at ../tools/virsh.c:187 #8 0x000055a797837007 in virshReconnect (ctl=ctl@entry=0x7ffc551dd980, name=name@entry=0x0, readonly=<optimized out>, readonly@entry=false, force=force@entry=false) at ../tools/virsh.c:223 #9 0x000055a7978371e0 in virshConnectionHandler (ctl=0x7ffc551dd980) at ../tools/virsh.c:325 #10 0x000055a797880172 in vshCommandRun (ctl=ctl@entry=0x7ffc551dd980, cmd=0x55a79865f580) at ../tools/vsh.c:1308 #11 0x000055a7978367b7 in main (argc=2, argv=<optimized out>) at ../tools/virsh.c:907 Odd thing is, I attached gdb to this virsh process several minutes after invoking the test tool that calls 'virsh list'. I can't explain why the process is still blocked in g_usleep, which should only have slept for 10 milliseconds. Even odder, detaching from the process appears to awaken g_usleep and allows process shutdown to continue. The oddness can also be seen in the debug output 2021-12-12 16:35:38.783+0000: 5912: debug : virCommandRunAsync:2629 : About to run /usr/bin/pkttyagent --process 5912 --notify-fd 4 --fallback 2021-12-12 16:35:38.787+0000: 5912: debug : virCommandRunAsync:2632 : Command result 0, with PID 5914 ... 2021-12-12 16:35:38.830+0000: 5912: debug : virProcessAbort:177 : aborting child process 5914 2021-12-12 16:35:38.830+0000: 5912: debug : virProcessAbort:185 : trying SIGTERM to child process 5914 Attach gdb to the process, observe above backtrace, quit gdb. 2021-12-12 16:44:18.059+0000: 5912: debug : virProcessAbort:195 : trying SIGKILL to child process 5914 2021-12-12 16:44:18.061+0000: 5912: debug : virProcessAbort:201 : process has ended: fatal signal 9The stuck g_usleep() is weird. Isn't there a tremendous load on the machine? I can't think of much else.The machine is mostly idle.Also I am looking at the pkttyagent code and it looks like it blocks the first SIGTERM and sending two of them should help in that case, but if we want to wait for few ms between them, than we;ll be in the same pickle.Anyway, if just adding: if (!isatty(STDIN_FILENO)) return false;This indeed fixes the regression in the test tool.That just means that it won't start the agent. Let's do this, but I would really, *really* like to figure out what the heck is happening there, because there has to be something wrong and it might just be waiting around the corner for us and bite us in the back in a year or so. Although I understand how improbable that is.Do you have additional suggestions that may help us gain a better understanding of the problem?
Unfortunately I cannot think of anything else than trying to figure out whether the kernel is scheduling that task at all and if not then why. How to do that however is unfortunately out of my territory :(
Regards, Jim
Attachment:
signature.asc
Description: PGP signature