On Mon, 4 May 2020 at 16:02, Mark Bannister <mbannister@xxxxxxxxxxxxxx> wrote: > > ... > > When you say the session slice is taking time to clean up, please > excuse my ignorance as > I'm not really up to speed on how session slices are being managed by > systemd, but why > would it matter if a slice takes time to clean up? Is there a limit > on how many SSH session > slices a user can have or something? I don't see how that would cause > this particular error. > I think session slice was poorly worded, I only realize that now. I should have said user slice. This is the cgroup hierarchy after a few logins: user-UID.slice `- user@UID.service `- session-1.scope `- ... `- session-N.scope I see that CentOS 7 doesn't really support systemd --user, so user@UID.service isn't present on your system, but I'll include it in the complete picture anyway. You can have multiple session scopes, one per login. Unless linger is enabled, when the last session goes away, systemd-logind will GC the user object it keeps, by stopping the session-N.scope(s), user@UID.service and user-UID.slice, and then freeing the internal bookkeeping structures, which means the session scopes, user-UID.slice and user@UID.service will have stop jobs enqueued, and since session scopes and user@UID.service come after their parent slice, they will be stopped before and user-UID.slice will wait for them to stop before its stop job is run by systemd. For your case, it could be that some process running under session-N.scope is not responding to SIGTERM, and this in turn is making session-N.scope take time to stop (it has a default timeout of 90s), because it has to wait for all processes run under it to die before exiting. This also means that your user-UID.slice is waiting for session-N.scope before being stopped. Now, if you try to login at this point, v219 has no checks to prevent this problem: UNIT JOB STATUS session-N.scope : stop : running user-UID.slice : stop : waiting and logind tries to start session-N+1.scope, which needs user-UID.slice/start. The job mode used is "fail", which means if a conflicting job which start cannot merge into is queued (like stop), fail the transaction. This is why you see "user-17132.slice has 'stop' job queued, but 'start' is included in transaction". The transaction being mentioned in this error is the one generated from session-N+1.scope/start. The fixes landed in v228 (see https://github.com/systemd/systemd/pull/1918), as the job mode used by manager_start_unit is changed to use "replace" job mode, which ends up resolving the problem as by the time manager_start_scope is called (which still uses "fail"), user-UID.slice already has start job waiting which cancelled the sitting stop job from previous call, so login may delay but not fail. This problem you're seeing only happens if you login again in under 90s or less after logging out of the last user session (considering the default timeout). So, upgrading systemd should solve your problem. If that is not an option, I am not sure there is much you can do. You'd need to modify the TimeoutStopSec= value for scopes, which isn't configurable as the scope is created at runtime by logind, short of changing the default for PID 1, which affects every other service on the system. Enabling linger is another solution. It should fix the problem by not tearing down the user slice on last logout, but also means the user slice is created on boot. Though since you don't have systemd --user, it should not take up any resources (unless I'm missing something), otherwise it wouldn't be desirable at all, ofcourse. Note that you'd have to enable it for every user, or allow users to do so through policykit, if you want this to work for everyone. Creating a file for each username under /var/lib/systemd/linger (like /var/lib/systemd/linger/kkd), or allowing users to call loginctl enable-linger themselves should suffice. > ... > > ... we also see this message in /var/log/secure: > > 020-05-03T16:09:38.737122-04:00 jupiter sshd[11031]: > pam_systemd(sshd:session): Failed to create session: Resource deadlock > avoided Yes, this is the string for the errno returned from the bus call, EDEADLOCK, as the transaction is destructive in nature. > Does this 'Resource deadlock avoided' message from pam_systemd help > identify the root > cause, or is that just a side-effect? A side-effect, see above. -- Kartikeya _______________________________________________ systemd-devel mailing list systemd-devel@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/systemd-devel