On Wed, 6 May 2020 at 15:58, Mark Bannister <mbannister@xxxxxxxxxxxxxx> wrote: > > I repeated the exact same test with a different user account, one > set-up specifically for this test and nothing else, so there was only > one SSH session. But 'systemctl list-jobs' always reported 'No jobs > running' even after the SSH session was terminated. My script kept > running with the parent PID of 1. Thanks for your help so far, but I > have still been unable to reproduce the problem. What am I doing > wrong? > Alright, so I dug into this, and managed to reproduce it at last. This issue is NOT reproducible with the version you reported originally. The stock CentOS image on my cloud provider comes with the same version (219-67.el7_7.4) that you said you see this issue with, but that can't be, as it doesn't include this change: https://github.com/systemd-rhel/rhel-7/commit/0600681f04e3818282a2d518ec3e6afee85f7978, unless I'm missing something. It took me a while to figure this out... This was added in 219-69 (see the tag). I guess you built your copy with some custom patches, which pulled it in. This is the reason essentially why the transaction becomes destructive. Prior to this change, the session scope added a Wants= dependency, instead of Requires=, on user-UID.slice, so the conflicting start job could simply be *dropped* instead of failing the transaction on a login attempt, but the change upgrades Slice= deps to Requires=, so they start failing when stop is waiting and start is included in a newly generated transaction. You can verify this with systemctl show session-N.scope, the 219-67 says Wants=user-UID.slice, and 219-69 and upwards say Requires=user-UID.slice. One key thing I missed was that systemd on CentOS wouldn't clean up the session for the user since KillUserProcesses=no by default, and also the fact that the user slice's stop job is not added until all session scopes finish cleaning up. So the scope is simply abandoned, and should exist until all processes die themselves. So, observing this issue is not easy, it would only trigger if you try to login right when systemd tries to stop the user slice and the job exists in queue (which is a very short window), and it took a lot of tries before I could reproduce it (for a start I manually queued the stop job for the slice with a hung session scope and tried logging in to verify the transaction problem). So this is racy in nature, the sequence of events can be that your processes die, session scopes are gone, and now the manager will try to shut down the user slice, and if you happen to be amidst a login at this moment, pam_systemd's call for CreateSession would conflict with the already waiting stop job. It doesn't matter if KillUserProcesses= is set or not, that just decides if the scopes are cleaned up on logout (they would become inactive when all session processes are gone anyway). It does decide when the user-UID.slice would be stopped, however. The reason you didn't see any jobs was that the scope was abandoned, and your long-lived process pinned it. So the session scope should never stop, as a consequence neither the slice. Here are a series of steps to widen the race (by queuing a stop job for user slice) and reproduce it: 1. Create a script badproc.sh; chmod +x badproc.sh #!/bin/bash trap "" SIGHUP trap "" SIGINT trap "" SIGTERM exec "$@" 2. From user's shell, do: ./badproc.sh sleep infinity &; disown; ^D (logout) The scope is now abandoned, or being stopped (if KillUserProcesses=yes) 3. From root's shell, do: systemctl --no-block stop session-N.scope; systemctl --no-block stop user-UID.slice; systemctl list-jobs should list both jobs, the scope job as running, and the slice job as waiting. The sequence that actually happens is processes dying, (last) session scope stopping, user slice stopping, in order. The point at which it fails is the user slice stopping, which is a very very short time window, so it makes sense if you only see these sporadically. 4. Try a login, which gives me: May 06 17:03:12 systemd[1]: Requested transaction contradicts existing jobs: Transaction is destructive. May 06 17:03:12 sshd[3134]: pam_systemd(sshd:session): Failed to create session: Resource deadlock avoided May 06 17:03:12 systemd-logind[3078]: Failed to start session scope session-37.scope: Transaction is destructive. All in all, you can't do much about these errors, they're mostly harmless (at worst you need to retry the login), and upgrading to v228 (the patchset mentioned in the previous exchange) or higher should fix all of this for you. -- Kartikeya _______________________________________________ systemd-devel mailing list systemd-devel@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/systemd-devel