On Mon, 2013-11-25 at 21:43 -0500, Stéphane Graber wrote: > Haha, > > I was wondering how long it'd take before we got the first comment about > systemd's own cgroup manager :) > > To try and keep this short, there are a lot of cases where systemd's > plan of having an in-pid1 manager, as practical as it's for them, just > isn't going to work for us. > > I believe our design makes things a bit cleaner by not having it tied to > any specific init system or feature and have a relatively low level, > very simple API that people can use as a building block for anything > that wants to manage cgroups. > > At this point in time, there's no hard limitation for having one or more > processes writing to the cgroup hierarchy, as much as some people may > want this to change. I very much doubt it'll happen any time soon and > until then, even if not perfectly adequate, there won't be any problem > running both systemd's manager and our own. > > There's also the possibility if someone felt sufficiently strongly about > this to contribute patches, to have our manager talk to systemd's if > present and go through their manager instead of accessing cgroupfs > itself. That's assuming systemd offers a sufficiently low level API that > could be used for that without bringing an unreasonable amount of > dependencies to our code. > > > I don't want this thread to turn into some kind of flamewar or similarly > overheated discussion about systemd vs everyone else, so I'll just state > that from my point of view (and I suspect that of the group who worked > on this early draft), systemd's manager while perfect for grouping and > resource allocation for systemd units and user sessions doesn't quite > fit our bill with regard to supporting multiple level of full > distro-agnostic containers using nesting and mixing user namespaces. > It also has what as a non-systemd person I consider a big drawback of > being built into an init system which quite a few major distributions > don't use (specifically those distros that account for the majority of > LXC's users). > > I think there's room for two implementations and competition (even if we > have slightly different goals) is a good thing and will undoubtedly help > both project consider use cases they didn't think of leading to a better > solution for everyone. And if some day one of the two wins or we can > somehow converge into a solution that works for everyone, that'd be > great. But our discussions at Linux Plumbers and other conferences have > shown that this isn't going to happen now, so it's best to stop arguing > and instead get some stuff done. Concur. And, as you know, I'm not a fan or supporter of that camp. I just want to make sure everyone is aware of all the gorillas in the room before the fecal flakes hit the rapidly whirling blades. That being said, I think this is a laudable goal. If we do it right, it well can become the standard they have to adhere to. Regards, Mike > On Mon, Nov 25, 2013 at 09:18:04PM -0500, Michael H. Warfield wrote: > > Serge... > > > > You have no idea how much I dread mentioning this (well, after > > LinuxPlumbers, maybe you can) but... You do realize that some of this > > is EXACTLY what the systemd crowd was talking about there in NOLA back > > then. I sat in those session grinding my teeth and listening to > > comments from some others around me about when systemd might subsume > > bash or even vi or quake. > > > > Somehow, you and others have tagged me as a "systemd expert" but I am > > far from it and even you noted that Lennart and I were on the edge of a > > physical discussion when I made some "off the cuff" remarks there about > > systemd design during my talk. I personally rank systemd in the same > > category as NetworkMangler (err, NetworkManager) in its propensity for > > committing inexplicable random acts of terrorism and changing its > > behavior from release to release to release. I'm not a fan and I'm not > > an expert, but I have to be involved with it and watch the damned thing > > like a trapped rat, like it or not. > > > > Like it or not, we can not go off on divergent designs. As much as they > > have delusions of taking over the Linux world, they are still going to > > be a major factor and this sort of thing needs to be coordinated. We > > are going to need exactly what you are proposing whether we have systemd > > in play or not. IF we CAN kick it to the curb, when we need to, we > > still need to know how to without tearing shit up and breaking shit that > > thinks it's there. Ideally, it shouldn't matter if systemd where in > > play or not. > > > > All I ask is that we not get too far off track that we have a major > > architectural divergence here. The risk is there. > > > > Mike > > > > > > On Mon, 2013-11-25 at 22:43 +0000, Serge E. Hallyn wrote: > > > Hi, > > > > > > as i've mentioned several times, I want to write a standalone cgroup > > > management daemon. Basic requirements are that it be a standalone > > > program; that a single instance running on the host be usable from > > > containers nested at any depth; that it not allow escaping ones > > > assigned limits; that it not allow subjegating tasks which do not > > > belong to you; and that, within your limits, you be able to parcel > > > those limits to your tasks as you like. > > > > > > Additionally, Tejun has specified that we do not want users to be > > > too closely tied to the cgroupfs implementation. Therefore > > > commands will be just a hair more general than specifying cgroupfs > > > filenames and values. I may go so far as to avoid specifying > > > specific controllers, as AFAIK there should be no redundancy in > > > features. On the other hand, I don't want to get too general. > > > So I'm basing the API loosely on the lmctfy command line API. > > > > > > One of the driving goals is to enable nested lxc as simply and safely as > > > possible. If this project is a success, then a large chunk of code can > > > be removed from lxc. I'm considering this project a part of the larger > > > lxc project, but given how central it is to systems management that > > > doesn't mean that I'll consider anyone else's needs as less important > > > than our own. > > > > > > This document consists of two parts. The first describes how I > > > intend the daemon (cgmanager) to be structured and how it will > > > enforce the safety requirements. The second describes the commands > > > which clients will be able to send to the manager. The list of > > > controller keys which can be set is very incomplete at this point, > > > serving mainly to show the approach I was thinking of taking. > > > > > > Summary > > > > > > Each 'host' (identified by a separate instance of the linux kernel) will > > > have exactly one running daemon to manage control groups. This daemon > > > will answer cgroup management requests over a dbus socket, located at > > > /sys/fs/cgroup/manager. This socket can be bind-mounted into various > > > containers, so that one daemon can support the whole system. > > > > > > Programs will be able to make cgroup requests using dbus calls, or > > > indirectly by linking against lmctfy which will be modified to use the > > > dbus calls if available. > > > > > > Outline: > > > . A single manager, cgmanager, is started on the host, very early > > > during boot. It has very few dependencies, and requires only > > > /proc, /run, and /sys to be mounted, with /etc ro. It will mount > > > the cgroup hierarchies in a private namespace and set defaults > > > (clone_children, use_hierarchy, sane_behavior, release_agent?) It > > > will open a socket at /sys/fs/cgroup/cgmanager (in a small tmpfs). > > > . A client (requestor 'r') can make cgroup requests over > > > /sys/fs/cgroup/manager using dbus calls. Detailed privilege > > > requirements for r are listed below. > > > . The client request will pertain an existing or new cgroup A. r's > > > privilege over the cgroup must be checked. r is said to have > > > privilege over A if A is owned by r's uid, or if A's owner is mapped > > > into r's user namespace, and r is root in that user namespace. > > > . The client request may pertain a victim task v, which may be moved > > > to a new cgroup. In that case r's privilege over both the cgroup > > > and v must be checked. r is said to have privilege over v if v > > > is mapped in r's pid namespace, v's uid is mapped into r's user ns, > > > and r is root in its userns. Or if r and v have the same uid > > > and v is mapped in r's pid namespace. > > > . r's credentials will be taken from socket's peercred, ensuring that > > > pid and uid are translated. > > > . r passes PID(v) as a SCM_CREDENTIAL, so that cgmanager receives the > > > translated global pid. It will then read UID(v) from /proc/PID(v)/status, > > > which is the global uid, and check /proc/PID(r)/uid_map to see whether > > > UID is mapped there. > > > . dbus-send can be enhanced to send a pid as SCM_CREDENTIAL to have > > > the kernel translate it for the reader. Only 'move task v to cgroup > > > A' will require a SCM_CREDENTIAL to be sent. > > > > > > Privilege requirements by action: > > > * Requestor of an action (r) over a socket may only make > > > changes to cgroups over which it has privilege. > > > * Requestors may be limited to a certain #/depth of cgroups > > > (to limit memory usage) - DEFER? > > > * Cgroup hierarchy is responsible for resource limits > > > * A requestor must either be uid 0 in its userns with victim mapped > > > ito its userns, or the same uid and in same/ancestor pidns as the > > > victim > > > * If r requests creation of cgroup '/x', /x will be interpreted > > > as relative to r's cgroup. r cannot make changes to cgroups not > > > under its own current cgroup. > > > * If r is not in the initial user_ns, then it may not change settings > > > in its own cgroup, only descendants. (Not strictly necessary - > > > we could require the use of extra cgroups when wanted, as lxc does > > > currently) > > > * If r requests creation of cgroup '/x', it must have write access > > > to its own cgroup (not strictly necessary) > > > * If r requests chown of cgroup /x to uid Y, Y is passed in a > > > ucred over the unix socket, and therefore translated to init > > > userns. > > > * if r requests setting a limit under /x, then > > > . either r must be root in its own userns, and UID(/x) be mapped > > > into its userns, or else UID(r) == UID(/x) > > > . /x must not be / (not strictly necessary, all users know to > > > ensure an extra cgroup layer above '/') > > > . setns(UIDNS(r)) would not work, due to in-kernel capable() checks > > > which won't be satisfied. Therefore we'll need to do privilege > > > checks ourselves, then perform the write as the host root user. > > > (see devices.allow/deny). Further we need to support older kernels > > > which don't support setns for pid. > > > * If r requests action on victim V, it passes V's pid in a ucred, > > > so that gets translated. > > > Daemon will verify that V's uid is mapped into r's userns. Since > > > r is either root or the same uid as V, it is allowed to classify. > > > > > > The above addresses > > > * creating cgroups > > > * chowning cgroups > > > * setting cgroup limits > > > * moving tasks into cgroups > > > . but does not address a 'cgexec <group> -- command' type of behavior. > > > * To handle that (specifically for upstart), recommend that r do: > > > if (!pid) { > > > request_reclassify(cgroup, getpid()); > > > do_execve(); > > > } > > > . alternatively, the daemon could, if kernel is new enough, setns to > > > the requestor's namespaces to execute a command in a new cgroup. > > > The new command would be daemonized to that pid namespaces' pid 1. > > > > > > Types of requests: > > > * r requests creating cgroup A'/A > > > . lmctfy/cli/commands/create.cc > > > . Verify that UID(r) mapped to 0 in r's userns > > > . R=cgroup_of(r) > > > . Verify that UID(R) is mapped into r's userns > > > . Create R/A'/A > > > . chown R/A'/A to UID(r) > > > * r requests to move task x to cgroup A. > > > . lmctfy/cli/commands/enter.cc > > > . r must send PID(x) as ancillary message > > > . Verify that UID(r) mapped to 0 in r's userns, and UID(x) is mapped into > > > that userns > > > (is it safe to allow if UID(x) == UID(r))? > > > . R=cgroup_of(r) > > > . Verify that R/A is owned by UID(r) or UID(x)? (not sure that's needed) > > > . echo PID(x) >> /R/A/tasks > > > * r requests chown of cgroup A to uid X > > > . X is passed in ancillary message > > > * ensures it is valid in r's userns > > > * maps the userid to host for us > > > . Verify that UID(r) mapped to 0 in r's userns > > > . R=cgroup_of(r) > > > . Chown R/A to X > > > * r requests cgroup A's 'property=value' > > > . Verify that either > > > * A != '' > > > * UID(r) == 0 on host > > > In other words, r in a userns may not set root cgroup settings. > > > . Verify that UID(r) mapped to 0 in r's userns > > > . R=cgroup_of(r) > > > . Set property=value for R/A > > > * Expect kernel to guarantee hierarchical constraints > > > * r requests deletion of cgroup A > > > . lmctfy/cli/commands/destroy.cc (without -f) > > > . same requirements as setting 'property=value' > > > * r requests purge of cgroup A > > > . lmctfy/cli/commands/destroy.cc (with -f) > > > . same requirements as setting 'property=value' > > > > > > Long-term we will want the cgroup manager to become more intelligent - > > > to place its own limits on clients, to address cpu and device hotplug, > > > etc. Since we will not be doing that in the first prototype, the daemon > > > will not keep any state about the clients. > > > > > > Client DBus Message API > > > > > > <name>: a-zA-Z0-9 > > > <name>: "a-zA-Z0-9 " > > > <controllerlist>: <controller1>[:controllerlist] > > > <valueentry>: key:value > > > <valueentry>: frozen > > > <valueentry>: thawed > > > <values>: valueentry[:values] > > > keys: > > > {memory,swap}.{limit,soft_limit} > > > cpus_allowed # set of allowed cpus > > > cpus_fraction # % of allowed cpus > > > cpus_number # number of allowed cpus > > > cpu_share_percent # percent of cpushare > > > devices_whitelist > > > devices_blacklist > > > net_prio_index > > > net_prio_interface_map > > > net_classid > > > hugetlb_limit > > > blkio_weight > > > blkio_weight_device > > > blkio_throttle_{read,write} > > > readkeys: > > > devices_list > > > {memory,swap}.{failcnt,max_use,limitnuma_stat} > > > hugetlb_max_usage > > > hugetlb_usage > > > hugetlb_failcnt > > > cpuacct_stat > > > <etc> > > > Commands: > > > ListControllers > > > Create <name> <controllerlist> <values> > > > Setvalue <name> <values> > > > Getvalue <name> <readkeys> > > > ListChildren <name> > > > ListTasks <name> > > > ListControllers <name> > > > Chown <name> <uid> > > > Chown <name> <uid>:<gid> > > > Move <pid> <name> [[ pid is sent as a SCM_CREDENTIAL ]] > > > Delete <name> > > > Delete-force <name> > > > Kill <name> > > > > > > ------------------------------------------------------------------------------ > > > Shape the Mobile Experience: Free Subscription > > > Software experts and developers: Be at the forefront of tech innovation. > > > Intel(R) Software Adrenaline delivers strategic insight and game-changing > > > conversations that shape the rapidly evolving mobile landscape. Sign up now. > > > http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk > > > _______________________________________________ > > > Lxc-devel mailing list > > > Lxc-devel@xxxxxxxxxxxxxxxxxxxxx > > > https://lists.sourceforge.net/lists/listinfo/lxc-devel > > > > > > > -- > > Michael H. Warfield (AI4NB) | (770) 978-7061 | mhw@xxxxxxxxxxxx > > /\/\|=mhw=|\/\/ | (678) 463-0932 | http://www.wittsend.com/mhw/ > > NIC whois: MHW9 | An optimist believes we live in the best of all > > PGP Key: 0x674627FF | possible worlds. A pessimist is sure of it! > > > > > > > ------------------------------------------------------------------------------ > > Shape the Mobile Experience: Free Subscription > > Software experts and developers: Be at the forefront of tech innovation. > > Intel(R) Software Adrenaline delivers strategic insight and game-changing > > conversations that shape the rapidly evolving mobile landscape. Sign up now. > > http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk > > > _______________________________________________ > > Lxc-devel mailing list > > Lxc-devel@xxxxxxxxxxxxxxxxxxxxx > > https://lists.sourceforge.net/lists/listinfo/lxc-devel > > > ------------------------------------------------------------------------------ > Shape the Mobile Experience: Free Subscription > Software experts and developers: Be at the forefront of tech innovation. > Intel(R) Software Adrenaline delivers strategic insight and game-changing > conversations that shape the rapidly evolving mobile landscape. Sign up now. > http://pubads.g.doubleclick.net/gampad/clk?id=63431311&iu=/4140/ostg.clktrk > _______________________________________________ > Lxc-devel mailing list > Lxc-devel@xxxxxxxxxxxxxxxxxxxxx > https://lists.sourceforge.net/lists/listinfo/lxc-devel -- Michael H. Warfield (AI4NB) | (770) 978-7061 | mhw@xxxxxxxxxxxx /\/\|=mhw=|\/\/ | (678) 463-0932 | http://www.wittsend.com/mhw/ NIC whois: MHW9 | An optimist believes we live in the best of all PGP Key: 0x674627FF | possible worlds. A pessimist is sure of it!
Attachment:
signature.asc
Description: This is a digitally signed message part