At the start of this discussion, some months ago, we offered to co-devel this with Lennart et al. They did not seem keen on the idea. If they have an established DBUS protocol spec, we should consider adopting it instead of a new one, but we CAN'T just play follow the leader and do whatever they do, change whenever they feel like changing. It would be best if we could get a common DBUS api specc'ed and all agree to it. Serge, do you feel up to that? On Mon, Nov 25, 2013 at 6:18 PM, Michael H. Warfield <mhw@xxxxxxxxxxxx> wrote: > Serge... > > You have no idea how much I dread mentioning this (well, after > LinuxPlumbers, maybe you can) but... You do realize that some of this > is EXACTLY what the systemd crowd was talking about there in NOLA back > then. I sat in those session grinding my teeth and listening to > comments from some others around me about when systemd might subsume > bash or even vi or quake. > > Somehow, you and others have tagged me as a "systemd expert" but I am > far from it and even you noted that Lennart and I were on the edge of a > physical discussion when I made some "off the cuff" remarks there about > systemd design during my talk. I personally rank systemd in the same > category as NetworkMangler (err, NetworkManager) in its propensity for > committing inexplicable random acts of terrorism and changing its > behavior from release to release to release. I'm not a fan and I'm not > an expert, but I have to be involved with it and watch the damned thing > like a trapped rat, like it or not. > > Like it or not, we can not go off on divergent designs. As much as they > have delusions of taking over the Linux world, they are still going to > be a major factor and this sort of thing needs to be coordinated. We > are going to need exactly what you are proposing whether we have systemd > in play or not. IF we CAN kick it to the curb, when we need to, we > still need to know how to without tearing shit up and breaking shit that > thinks it's there. Ideally, it shouldn't matter if systemd where in > play or not. > > All I ask is that we not get too far off track that we have a major > architectural divergence here. The risk is there. > > Mike > > > On Mon, 2013-11-25 at 22:43 +0000, Serge E. Hallyn wrote: >> Hi, >> >> as i've mentioned several times, I want to write a standalone cgroup >> management daemon. Basic requirements are that it be a standalone >> program; that a single instance running on the host be usable from >> containers nested at any depth; that it not allow escaping ones >> assigned limits; that it not allow subjegating tasks which do not >> belong to you; and that, within your limits, you be able to parcel >> those limits to your tasks as you like. >> >> Additionally, Tejun has specified that we do not want users to be >> too closely tied to the cgroupfs implementation. Therefore >> commands will be just a hair more general than specifying cgroupfs >> filenames and values. I may go so far as to avoid specifying >> specific controllers, as AFAIK there should be no redundancy in >> features. On the other hand, I don't want to get too general. >> So I'm basing the API loosely on the lmctfy command line API. >> >> One of the driving goals is to enable nested lxc as simply and safely as >> possible. If this project is a success, then a large chunk of code can >> be removed from lxc. I'm considering this project a part of the larger >> lxc project, but given how central it is to systems management that >> doesn't mean that I'll consider anyone else's needs as less important >> than our own. >> >> This document consists of two parts. The first describes how I >> intend the daemon (cgmanager) to be structured and how it will >> enforce the safety requirements. The second describes the commands >> which clients will be able to send to the manager. The list of >> controller keys which can be set is very incomplete at this point, >> serving mainly to show the approach I was thinking of taking. >> >> Summary >> >> Each 'host' (identified by a separate instance of the linux kernel) will >> have exactly one running daemon to manage control groups. This daemon >> will answer cgroup management requests over a dbus socket, located at >> /sys/fs/cgroup/manager. This socket can be bind-mounted into various >> containers, so that one daemon can support the whole system. >> >> Programs will be able to make cgroup requests using dbus calls, or >> indirectly by linking against lmctfy which will be modified to use the >> dbus calls if available. >> >> Outline: >> . A single manager, cgmanager, is started on the host, very early >> during boot. It has very few dependencies, and requires only >> /proc, /run, and /sys to be mounted, with /etc ro. It will mount >> the cgroup hierarchies in a private namespace and set defaults >> (clone_children, use_hierarchy, sane_behavior, release_agent?) It >> will open a socket at /sys/fs/cgroup/cgmanager (in a small tmpfs). >> . A client (requestor 'r') can make cgroup requests over >> /sys/fs/cgroup/manager using dbus calls. Detailed privilege >> requirements for r are listed below. >> . The client request will pertain an existing or new cgroup A. r's >> privilege over the cgroup must be checked. r is said to have >> privilege over A if A is owned by r's uid, or if A's owner is mapped >> into r's user namespace, and r is root in that user namespace. >> . The client request may pertain a victim task v, which may be moved >> to a new cgroup. In that case r's privilege over both the cgroup >> and v must be checked. r is said to have privilege over v if v >> is mapped in r's pid namespace, v's uid is mapped into r's user ns, >> and r is root in its userns. Or if r and v have the same uid >> and v is mapped in r's pid namespace. >> . r's credentials will be taken from socket's peercred, ensuring that >> pid and uid are translated. >> . r passes PID(v) as a SCM_CREDENTIAL, so that cgmanager receives the >> translated global pid. It will then read UID(v) from /proc/PID(v)/status, >> which is the global uid, and check /proc/PID(r)/uid_map to see whether >> UID is mapped there. >> . dbus-send can be enhanced to send a pid as SCM_CREDENTIAL to have >> the kernel translate it for the reader. Only 'move task v to cgroup >> A' will require a SCM_CREDENTIAL to be sent. >> >> Privilege requirements by action: >> * Requestor of an action (r) over a socket may only make >> changes to cgroups over which it has privilege. >> * Requestors may be limited to a certain #/depth of cgroups >> (to limit memory usage) - DEFER? >> * Cgroup hierarchy is responsible for resource limits >> * A requestor must either be uid 0 in its userns with victim mapped >> ito its userns, or the same uid and in same/ancestor pidns as the >> victim >> * If r requests creation of cgroup '/x', /x will be interpreted >> as relative to r's cgroup. r cannot make changes to cgroups not >> under its own current cgroup. >> * If r is not in the initial user_ns, then it may not change settings >> in its own cgroup, only descendants. (Not strictly necessary - >> we could require the use of extra cgroups when wanted, as lxc does >> currently) >> * If r requests creation of cgroup '/x', it must have write access >> to its own cgroup (not strictly necessary) >> * If r requests chown of cgroup /x to uid Y, Y is passed in a >> ucred over the unix socket, and therefore translated to init >> userns. >> * if r requests setting a limit under /x, then >> . either r must be root in its own userns, and UID(/x) be mapped >> into its userns, or else UID(r) == UID(/x) >> . /x must not be / (not strictly necessary, all users know to >> ensure an extra cgroup layer above '/') >> . setns(UIDNS(r)) would not work, due to in-kernel capable() checks >> which won't be satisfied. Therefore we'll need to do privilege >> checks ourselves, then perform the write as the host root user. >> (see devices.allow/deny). Further we need to support older kernels >> which don't support setns for pid. >> * If r requests action on victim V, it passes V's pid in a ucred, >> so that gets translated. >> Daemon will verify that V's uid is mapped into r's userns. Since >> r is either root or the same uid as V, it is allowed to classify. >> >> The above addresses >> * creating cgroups >> * chowning cgroups >> * setting cgroup limits >> * moving tasks into cgroups >> . but does not address a 'cgexec <group> -- command' type of behavior. >> * To handle that (specifically for upstart), recommend that r do: >> if (!pid) { >> request_reclassify(cgroup, getpid()); >> do_execve(); >> } >> . alternatively, the daemon could, if kernel is new enough, setns to >> the requestor's namespaces to execute a command in a new cgroup. >> The new command would be daemonized to that pid namespaces' pid 1. >> >> Types of requests: >> * r requests creating cgroup A'/A >> . lmctfy/cli/commands/create.cc >> . Verify that UID(r) mapped to 0 in r's userns >> . R=cgroup_of(r) >> . Verify that UID(R) is mapped into r's userns >> . Create R/A'/A >> . chown R/A'/A to UID(r) >> * r requests to move task x to cgroup A. >> . lmctfy/cli/commands/enter.cc >> . r must send PID(x) as ancillary message >> . Verify that UID(r) mapped to 0 in r's userns, and UID(x) is mapped into >> that userns >> (is it safe to allow if UID(x) == UID(r))? >> . R=cgroup_of(r) >> . Verify that R/A is owned by UID(r) or UID(x)? (not sure that's needed) >> . echo PID(x) >> /R/A/tasks >> * r requests chown of cgroup A to uid X >> . X is passed in ancillary message >> * ensures it is valid in r's userns >> * maps the userid to host for us >> . Verify that UID(r) mapped to 0 in r's userns >> . R=cgroup_of(r) >> . Chown R/A to X >> * r requests cgroup A's 'property=value' >> . Verify that either >> * A != '' >> * UID(r) == 0 on host >> In other words, r in a userns may not set root cgroup settings. >> . Verify that UID(r) mapped to 0 in r's userns >> . R=cgroup_of(r) >> . Set property=value for R/A >> * Expect kernel to guarantee hierarchical constraints >> * r requests deletion of cgroup A >> . lmctfy/cli/commands/destroy.cc (without -f) >> . same requirements as setting 'property=value' >> * r requests purge of cgroup A >> . lmctfy/cli/commands/destroy.cc (with -f) >> . same requirements as setting 'property=value' >> >> Long-term we will want the cgroup manager to become more intelligent - >> to place its own limits on clients, to address cpu and device hotplug, >> etc. Since we will not be doing that in the first prototype, the daemon >> will not keep any state about the clients. >> >> Client DBus Message API >> >> <name>: a-zA-Z0-9 >> <name>: "a-zA-Z0-9 " >> <controllerlist>: <controller1>[:controllerlist] >> <valueentry>: key:value >> <valueentry>: frozen >> <valueentry>: thawed >> <values>: valueentry[:values] >> keys: >> {memory,swap}.{limit,soft_limit} >> cpus_allowed # set of allowed cpus >> cpus_fraction # % of allowed cpus >> cpus_number # number of allowed cpus >> cpu_share_percent # percent of cpushare >> devices_whitelist >> devices_blacklist >> net_prio_index >> net_prio_interface_map >> net_classid >> hugetlb_limit >> blkio_weight >> blkio_weight_device >> blkio_throttle_{read,write} >> readkeys: >> devices_list >> {memory,swap}.{failcnt,max_use,limitnuma_stat} >> hugetlb_max_usage >> hugetlb_usage >> hugetlb_failcnt >> cpuacct_stat >> <etc> >> Commands: >> ListControllers >> Create <name> <controllerlist> <values> >> Setvalue <name> <values> >> Getvalue <name> <readkeys> >> ListChildren <name> >> ListTasks <name> >> ListControllers <name> >> Chown <name> <uid> >> Chown <name> <uid>:<gid> >> Move <pid> <name> [[ pid is sent as a SCM_CREDENTIAL ]] >> Delete <name> >> Delete-force <name> >> Kill <name> >> >> ------------------------------------------------------------------------------ >> Shape the Mobile Experience: Free Subscription >> Software experts and developers: Be at the forefront of tech innovation. >> Intel(R) Software Adrenaline delivers strategic insight and game-changing >> conversations that shape the rapidly evolving mobile landscape. Sign up now. >> http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk >> _______________________________________________ >> Lxc-devel mailing list >> Lxc-devel@xxxxxxxxxxxxxxxxxxxxx >> https://lists.sourceforge.net/lists/listinfo/lxc-devel >> > > -- > Michael H. Warfield (AI4NB) | (770) 978-7061 | mhw@xxxxxxxxxxxx > /\/\|=mhw=|\/\/ | (678) 463-0932 | http://www.wittsend.com/mhw/ > NIC whois: MHW9 | An optimist believes we live in the best of all > PGP Key: 0x674627FF | possible worlds. A pessimist is sure of it! > -- To unsubscribe from this list: send the line "unsubscribe cgroups" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html