Re: unable to attach pid to service delegated directory in unified mode after restart

Felip Moll <felip@xxxxxxxxxxx> · Wed, 16 Mar 2022 16:15:23 +0100

On Tue, Mar 15, 2022 at 5:24 PM Michal Koutný <mkoutny@xxxxxxxx> wrote:
On Tue, Mar 15, 2022 at 04:35:12PM +0100, Felip Moll <felip@xxxxxxxxxxx> wrote:

> Meaning that it would be great to have a delegated cgroup subtree without

> the need of a service or scope.

> Just an empty subtree.

It looks appealing to add Delegate= directive to slice units.

Firstly, that'd prevent the use of the slice by anything systemd.

Then some notion of owner of that subtree would have to be defined (if

only for cleanup).

That owner would be a process -- bang, you created a service with

delegation or a scope with "keepalive" process.

Correct, this is how the current systemd design works.
But... what if the concept of owner was irrelevant? What if we could just tell systemd, hey, give me /sys/fs/cgroup/mysubdir and never ever touch it or do anything to it or pids residing into it.

(The above is slightly misleading) there could be an alternative of

something like RemainAfterExit=yes for scopes, i.e. such scopes would

not be stopped after last process exiting (but systemd would still be in

charge of cleaning the cgroup after explicit stop request and that'd

also mark the scope as truly stopped).

Such a recycled scope would only be useful via

org.freedesktop.systemd1.Manager.AttachProcessesToUnit().

This is also a good idea.

BTW I'm also wondering how do you detect a job finishing in the case

original parent is gone (due to main service restart) and job's main

process reparented?

slurmstepd connects to slurmd through socket and sends an RPC.
If slurmd is gone, slurmstepd (child) will retry the RPC and remain until slurmd appears again and responds.

The main process doesn't wait for their child, but instead we do a double fork to make the child be parented by init process 1.

BTW 2 You didn't like having a scope for each job. Is it because of the

setup time (IOW jobs are short-lived) or persistent scopes overhead (too

many units, PID1 scalability)?

It is not that I didn't like it. It is that I observed a delay in step creation (fork slurmstepd) because sending an async dbus message required the stepd to wait for the systemd job to be executed, and it can take time; computationally a lot more than just a mkdir on the cgroup subtree. Just to put an example, a 'srun hostname' command starts a job which runs a hostname. Response is instantaneous with mkdir's but it takes almost 1 second with a call to systemd through dbus. Slurm is used for HPC, but also for HTC (High Throughput Computing), which means hundreds of jobs can be started in a short period of time, so yes, this delay is critical, and not only because jobs can be short-lived, but there can be a massive job finish + job start at the same time. I just ran one test of our regression and 'systemctl list-unit-files' responsiveness was compromised. Also from the point of view of a sysadmin this was not ideal, so as you say scalability of PID1 is also a concern.

This is the reason I will not be using 1 scope per job, and I prefer the other solution to have 1 single scope with Delegate=yes.

Does it make sense?