Hi folks, I wanted to keep the case as generic as possible but I think it is important at this point to comment on what we're talking about, so let me clarify a little bit the case I am dealing with at the moment.
In SchedMD, we want Slurm to support 'Cgroup v2'. As you may know Slurm is a HPC resource manager, and for the moment we're limited to Cgroup v1. We actually use the freezer, memory, cpuset, cpuacct and devices controllers in v1. We think it is already a good time to add a plugin to our software to make it capable to run on unified systems, and since systemd is widely used we want to do this integration as best as we can to coexist with systemd and not get our pids moved or make systemd mad.
We have a 'slurmd' daemon running on every compute node, waiting for communications from the controller. The controller submits different kinds of RPCs to slurmd and at one point one RPC can instruct slurmd to start a new job step for a specific uid. Slurmd then forks twice; the original slurmd just ends and goes back to other work. The first fork (child) sets a bunch of pipes and prepares initialization data, then forks again generating a grandchild. The grandchild finally exec's the slurmstepd daemon which will be receiving the initialization data, prepare the cgroups, and finally fork+exec the user software. This can happen many times in a second because a user can submit a "job array" which with one single RPC call can submit thousands of steps, and at the same time thousands of other steps can be finishing at the same time, so the work that systemd would need to do starting up new scopes/services and/or stopping them + monitoring all this stuff could be considerable.
After this introduction I have to say that we successfully managed to work following systemd rules by just starting a unit file for slurmd with Delegate=yes and creating our own hierarchy inside. Every slurmstepd would be forked and started in the delegated cgroup and would create its directory and move itself where it belongs to (always in the delegated cgroup), according to our needs. Everything ran smoothly until when I restarted slurmd and slurmstepds were still running in the cgroup, systemd was unable to start slurmd again because the cgroup was not deleted, since it was busy with directories and slurmstepds; main reason for this bug.
Note that one feature of Slurm is that one can upgrade/restart slurmd without affecting running jobs (slurmstepds) in the compute node.
I have read and studied all your suggestions and I understand them.
I also did some performance tests in which I fork+executed a systemd-run to launch a service for every step and I got bad performance overall.
One of our QA tests (test 9.8 of our testsuite) shows a decrease of performance of 3x.
But, the positive thing is that we did a test to manually fork+exec one new Delegated separate service when starting up slurmd, and we moved new forked slurmstepd pids *manually* into the new cgroup associated with the new service. This service contains a 'sleep infinity' as the main pid to make the cgroup not disappear even if no slurmstepds are running. As I say, this is a dirty test, which works.
After reading your last two emails, I think the most efficient way we need to go is this one:
Firing an async D-Bus packet to systemd should be hardly measurable.
But note that you can also run your main service as a service, and
then allocate a *single* scope unit for *all* your payloads. That way
you can restart your main service unit independently of the scope
unit, but you only have to issue a single request once for allocating
the scope, and not for each of your payloads.
My questions are, where would the scope reside? Does it have an associated cgroup?
If I am a new slurmstepd, can I attach myself to this scope or must I be attached by slurmd before being executed?
But that too means you have to issue a bus call. If you really don't
like talking to systemd this is not going to work of course, but quite
frankly, that's a problem you are making yourself, and I am not
particularly sympathetic to it.
I can study this option. It is not that I like or don't like talking to systemd, but the idea is that Slurm must work in other OSes, possibly without systemd but still with cgroup v2, and still be compatible with cgroup v1 and with no cgroup at all. It's thinking about the future, the less complexity and particularities it has, the more maintainable and flexible the software is. I think this is understandable, but if this is not possible at all we will have to adapt.
> DelegateCgroupLeaf=<yes|no>. If set to yes an extra directory will be
> created into the unit cgroup to place the newly spawned service process.
> This is useful for services which need to be restarted while its forked
> pids remain in the cgroup and the service cgroup is not a leaf
> anymore.
No. Let's not add that.
I could foresee the benefits of such an option, but I can also see the issues from a systemd perspective of having to deal with remaining processes of the unit which is being restarted.
I am not sure why you think it is not ok though, but one option like this would fix all my problems, which is only about the restart of slurmd :)
I am also curious of what this sentence does exactly mean:
"You might break systemd as a whole though (for example, add a process directly to a slice's cgroup and systemd will be very sad).".
Thank you for all your comments.
I hope to arrive at a good solution compliant to systemd and at the same time with the flexibility we want.
--
Felip Moll
SchedMD - http://schedmd.com