On 28 May 2015 at 04:30, Zefan Li <lizefan@xxxxxxxxxx> wrote: > On 2015/5/27 20:37, Dimitri John Ledkov wrote: >> On 27 May 2015 at 12:22, Zefan Li <lizefan@xxxxxxxxxx> wrote: >>> On 2015/5/27 6:07, Dimitri John Ledkov wrote: >>>> Add a kernel API to send a proc connector notification that a cgroup >>>> has become empty. A userspace daemon can then act upon such >>>> information, and usually clean-up and remove such a group as it's no >>>> longer needed. >>>> >>>> Currently there are two other ways (one for current & one for unified >>>> cgroups) to receive such notifications, but they either involve >>>> spawning userspace helper or monitoring a lot of files. This is a >>>> firehose of all such events instead from a single place. >>>> >>>> In the current cgroups structure the way to get notifications is by >>>> enabling `release_agent' and setting `notify_on_release' for a given >>>> cgroup hierarchy. This will then spawn userspace helper with removed >>>> cgroup as an argument. It has been acknowledged that this is >>>> expensive, especially in the exit-heavy workloads. In userspace this >>>> is currently used by systemd and CGmanager that I know of, both of >>>> agents establish connection to the long running daemon and pass the >>>> message to it. As a courtesy to other processes, such an event is >>>> sometimes forwarded further on, e.g. systemd forwards it to the system >>>> DBus. >>>> >>>> In the future/unified cgroups structure support for `release_agent' is >>>> removed, without a direct replacement. However, there is a new >>>> `cgroup.populated' file exposed that recursively reports if there are >>>> any tasks in a given cgroup hierarchy. It's a very good flag to >>>> quickly/lazily scan for empty things, however one would need to >>>> establish inotify watch on each and every cgroup.populated file at >>>> cgroup setup time (ideally before any pids enter said cgroup). Thus >>>> again anybody else, but the original creator of a given cgroup, has a >>>> chance to reliably monitor cgroup becoming empty (since there is no >>>> reliable recursive inotify watch). >>>> >>>> Hence, the addition to the proc connector firehose. Multiple things, >>>> albeit with a CAP_NET_ADMIN in the init pid/user namespace), could >>>> connect and monitor cgroups release notifications. In a way, this >>>> repeats udev history, at first it was a userspace helper, which later >>>> became a netlink socket. And I hope, that proc connector is a >>>> naturally good fit for this notification type. >>>> >>>> For precisely when cgroups should emit this event, see next patch >>>> against kernel/cgroup.c. >>>> >>> >>> We really don't want yet another way for cgroup notification. >>> >> >> we do have multiple information sources for similar events in other >> places... e.g. fork events can be tracked with ptrace and with >> proc-connector, ditto other things. >> >>> Systemd is happy with this cgroup.populated interface. Do you have any >>> real use case in mind that can't be satisfied with inotify watch? >>> >> >> cgroup.populated is not implemented in systemd and would require a lot >> of inotify watches. > > I believe systemd will use cgroup.populated, though I don't know its > roadmap. Maybe it's waiting for the kernel to remove the experimental > flag of unified hierarchy. > There is no code in master to support unified hierarchy in systemd that I can see. And more and more things rely on the current hierarchy, especially around container-like technologies. >> Also it's only set on the unified structure and >> not exposed on the current one. >> >> Also it will not allow anybody else to establish notify watch in a >> timely manner. Thus anyone external to the cgroups creator will not be >> able to monitor cgroup.populated at the right time. > > I guess this isn't a problem, as you can watch the IN_CREATE event, and > then you'll get notified when a cgroup is created. > It is a problem, there is no effective way to establish race-free inotify watches, which is well known. Having a watch on /sys/fs/cgroup, one has to establish inotify watch on a directory created there, and then another watch on cgroup.populated within there. By which time a process could have already entered, run and exited. >> With >> proc_connector I was thinking processes entering cgroups would be >> useful events as well, but I don't have a use-case for them yet thus >> I'm not sure how the event should look like. >> >> Would cgroup.populated be exposed on the legacy cgroup hierchy? At the >> moment I see about ~20ms of my ~200ms boot wasted on spawning the >> cgroups agent and I would like to get rid of that as soon as possible. >> This patch solves it for me. ( i have a matching one to connect to >> proc connector and then feed notifications to systemd via systemd's >> private api end-point ) >> >> Exposing cgroup.populated irrespective of the cgroup mount options >> would be great, but would result in many watches being established >> awaiting for a once in a lifecycle condition of a cgroup. Imho this is >> wasteful, but nonetheless will be much better than spawning the agent. >> > > Each inotify watch will consume a little memory, which should be > acceptable. > >> Would a patch that exposes cgroup.populated on legacy cgroup structure >> be accepted? It is forward-compatible afterall... or no? >> > > I'm afraid no...All new features are done in unified hiearchy, and we've > been restraining from adding them to the legacy hierarchy. > Am I right to say it's been a year with little movements in unified hierarchy?! What's the current status on unmarking it experimental, and/or what else needs doing in kernel and/or userspace? What you are saying is that we have inefficient notification mechanism that hammers everyone's boot time significantly, and no current path to resolve it. What can I do get us efficient cgroup release notifications soon? This patch-set is a no-op if one doesn't subscribe from the userspace and has no other side effects that I can trivially see and is very similar in-spirit to other notifications that proc-connector generates. E.g. /proc/pid/comm is exposed as a file, yet there is proc connector notification as well about comm name changes. Maybe Evgeniy can chip in, if such a notification would be beneficial to proc-connector. -- Regards, Dimitri. Pura Vida! https://clearlinux.org Open Source Technology Center Intel Corporation (UK) Ltd. - Co. Reg. #1134945 - Pipers Way, Swindon SN3 1RJ. -- To unsubscribe from this list: send the line "unsubscribe cgroups" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html