"Daniel P. Berrange" <berrange@xxxxxxxxxx> writes: > On Thu, Aug 30, 2012 at 03:15:17PM -0700, Eric W. Biederman wrote: >> "Daniel P. Berrange" <berrange@xxxxxxxxxx> writes: >> >> > One of the features that SystemD folks have asked us to fix in LXC, is >> > to make sure that /proc/sys/kernel/random/boot_id changes each time a >> > container is started. >> >> There may be a good reason for this. Most of the time what I have seen >> of kernel requests from the direction of SystemD is that while there may >> be a real problem but usually their imagined solution is not a >> particularly good solution. So a description of the problem is needed. >> >> Justifying something with just SystemD wants this is a good way to get >> a nack. > > SystemD records log messages for all system services in their journal. > They can show you all log messages for the current service execution, > all log messages for a service since system boot, or all log messsages > ever. The boot_id value is used as a unique tag to allow grouping of > the log messages per system boot. When we run systemd inside a container > we want to get that grouping of log messages generated by services inside > the container, to take account of the container boot, not the host boot. > Hence the desire to have the boot_id value reflect when a container is > booted. Since SystemD post-dates containers and since the logging feature is not currently in wide use that use case is completely non-persuasive. So far this just sounds like a plain SystemD bug and something that can be easily changed at this point in time. It has been a long time but my fuzzy memory says that the originial boot_id justification was based on use cases that could not be solved any other way. My memory says it was this thread https://lkml.org/lkml/1999/5/31/233 that inspired the implementation of boot_id. However reading the current emacs source code it appears emacs gave up before boot_id was implemented and stats /var/run/random-seed (which we seem to have removed) or looks in wtmp or utmp for the latest boot record. I did a quick grep through the binaries on my system and I could not find anything using /proc/sys/random/boot_id. That suggests to me that the proper solution is to actually just remove boot_id. Hmm. And then there is other interesting detail. What should boot_id return after the processes have migrated from one system to another. >> > The current semantics are that this file produces a new random UUID each >> > time the host OS is booted. Obviously each time we start a container now, >> > they just see the host's random boot_id, so from a container's POV this >> > does not change each time it starts. >> >> That is correct. As I recall the contract with boot_id is to provide >> a unique per boot value to assist in dealing with boots etc. I seem >> to recall emacs uses the combination of hostname+boot_id to help >> generate unique lock files names. >> >> I would definitely need a refresher on how boot_id is used in practice >> by applications other than SystemD before I could suggest a good design. >> >> There is also a question of uptime. > > Agreed, as you say, this is one of many other /proc values needing > virtualizing for container. If you think of it as virtualization and you figure the requirement is to exactly replicate a non-containerized system you won't come up with suggestions that make sense to implement. For the most part the semantics of namespaces exist to support process migration. >> > There seems to be general agreement that, aside from the PID directories, >> > changes to data in proc should be done by a FUSE filesystem overlay of >> > some kind. >> >> No. I have yet to see a justification for using FUSE in containers on >> top of proc files. >> >> I have seen a lot of bad ideas suggested like hacking /proc/cpuinfo >> instead of providing a proper mechanism to tell applications how >> parallel they can/should be. >> >> For hacks and controversial ideas FUSE is good because it makes it >> someone else's problem and it means it isn't something we have to >> support in the kernel for the indefinite future. At the same time in >> general a FUSE solution does not really solve anything it just sort of >> papers over a problem. >> >> For some problems papering over them is good enough, for other problems >> they really should be solved properly. > > Ok, well I guess things aren't as clear cut as I understood then. I've > been told that FUSE was the desired approach to dealing with all the > various files in /proc that might need changing for containers. Personally > I don't much care what approach is used - if the kernel wants to do more > stuff that's fine with my from a libvirt LXC POV. I'll just follow whatever > the consensus is in this area. Largely what I have seen is a bunch of half thought out hacks and the consensus being (ick don't bother me...). In which case FUSE is a good answer as it doesn't obligate anyone to maintain or care about the code, except those who want the hack. >> > We could use that mechanism to fix 'boot_id' in userspace, but >> > I'm wondering if this is a better candidate for dealing with in kernel >> > space, since as well as the /proc/sys tree, the data is also visible via >> > the sysctl() system call which a FUSE overlay won't address. >> >> Any application that uses the sysctl() system call needs to be fixed. >> When I looked years ago the number of applications using sysctl() could >> be numbered on one hand and most of those applications were the fedora >> installer, and the fedora installer hasn't used sysctl. > > Ok, I did wonder whether anyone would actually use sysctl() instead > of reading /proc/sys. If we can ignore the sysctl that gives us more > options. Most definitely. The warning sysctl spews when you figure out how to call it should be a good clue. >> > The kernel doesn't have a real concept of a 'container' to associate >> > a boot_id value with as such, but maybe it is reasonable to associate >> > a boot_id value with each PID namespace ? >> >> There is also the question of uptime and clocks and things like that. >> >> The utsnamespace might be a more resasonable place to tack on that kind >> of extended functionality. >> >> Just changing boot_id itself and not all of the other bits that track >> when we have booted does not seem reasonable. >> >> Once we can sort out the details a kernel implementation should be quite >> trivial. It just requires the appropriate sysctl registration dance. > > Ok, I'll try to identify a list of other related parts which need changing > wrt boot. > > Thanks for the feedback. I hope it helps. There may be a justification and a good case for messing with boot_id but I don't currently see it. What I see (so far) is SystemD unnecessarily tying itself to linux implemenation details. Eric _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linuxfoundation.org/mailman/listinfo/containers