Re: Virtualizing /proc/sys/kernel/random/boot_id per container ?

ebiederm@xxxxxxxxxxxx (Eric W. Biederman) · Thu, 30 Aug 2012 17:13:25 -0700

"Daniel P. Berrange" <berrange@xxxxxxxxxx> writes:

> On Thu, Aug 30, 2012 at 03:15:17PM -0700, Eric W. Biederman wrote:
>> "Daniel P. Berrange" <berrange@xxxxxxxxxx> writes:
>> 
>> > One of the features that SystemD folks have asked us to fix in LXC, is
>> > to make sure that /proc/sys/kernel/random/boot_id changes each time a
>> > container is started.
>> 
>> There may be a good reason for this.  Most of the time what I have seen
>> of kernel requests from the direction of SystemD is that while there may
>> be a real problem but usually their imagined solution is not a
>> particularly good solution.  So a description of the problem is needed.
>> 
>> Justifying something with just SystemD wants this is a good way to get
>> a nack.
>
> SystemD records log messages for all system services in their journal.
> They can show you all log messages for the current service execution,
> all log messages for a service since system boot, or all log messsages
> ever. The boot_id value is used as a unique tag to allow grouping of
> the log messages per system boot. When we run systemd inside a container
> we want to get that grouping of log messages generated by services inside
> the container, to take account of the container boot, not the host boot.
> Hence the desire to have the boot_id value reflect when a container is
> booted.

Since SystemD post-dates containers and since the logging feature is not
currently in wide use that use case is completely non-persuasive.

So far this just sounds like a plain SystemD bug and something that can
be easily changed at this point in time.

It has been a long time but my fuzzy memory says that the originial
boot_id justification was based on use cases that could not be solved
any other way.

My memory says it was this thread https://lkml.org/lkml/1999/5/31/233
that inspired the implementation of boot_id.  However reading the
current emacs source code it appears emacs gave up before boot_id
was implemented and stats /var/run/random-seed (which we seem to
have removed) or looks in wtmp or utmp for the latest boot record.

I did a quick grep through the binaries on my system and I could not
find anything using /proc/sys/random/boot_id.

That suggests to me that the proper solution is to actually just remove
boot_id.

Hmm.  And then there is other interesting detail.  What should boot_id
return after the processes have migrated from one system to another.

>> > The current semantics are that this file produces a new random UUID each
>> > time the host OS is booted. Obviously each time we start a container now,
>> > they just see the host's random boot_id, so from a container's POV this
>> > does not change each time it starts.
>> 
>> That is correct.  As I recall the contract with boot_id is to provide
>> a unique per boot value to assist in dealing with boots etc.  I seem
>> to recall emacs uses the combination of hostname+boot_id to help
>> generate unique lock files names.
>> 
>> I would definitely need a refresher on how boot_id is used in practice
>> by applications other than SystemD before I could suggest a good design.
>> 
>> There is also a question of uptime.
>
> Agreed, as you say, this is one of many other /proc values needing
> virtualizing for container.

If you think of it as virtualization and you figure the requirement is
to exactly replicate a non-containerized system you won't come up with
suggestions that make sense to implement.

For the most part the semantics of namespaces exist to support process
migration.

>> > There seems to be general agreement that, aside from the PID directories,
>> > changes to data in  proc should be done by a FUSE filesystem overlay of
>> > some kind.
>> 
>> No.  I have yet to see a justification for using FUSE in containers on
>> top of proc files.
>> 
>> I have seen a lot of bad ideas suggested like hacking /proc/cpuinfo
>> instead of providing a proper mechanism to tell applications how
>> parallel they can/should be.
>> 
>> For hacks and controversial ideas FUSE is good because it makes it
>> someone else's problem and it means it isn't something we have to
>> support in the kernel for the indefinite future.  At the same time in
>> general a FUSE solution does not really solve anything it just sort of
>> papers over a problem.
>> 
>> For some problems papering over them is good enough, for other problems
>> they really should be solved properly.
>
> Ok, well I guess things aren't as clear cut as I understood then. I've
> been told that FUSE was the desired approach to dealing with all the
> various files in /proc that might need changing for containers. Personally
> I don't much care what approach is used - if the kernel wants to do more
> stuff that's fine with my from a libvirt LXC POV. I'll just follow whatever
> the consensus is in this area.

Largely what I have seen is a bunch of half thought out hacks and the
consensus being (ick don't bother me...).  In which case FUSE is a good
answer as it doesn't obligate anyone to maintain or care about the code,
except those who want the hack.

>> > We could use that mechanism to fix 'boot_id' in userspace, but
>> > I'm wondering if this is a better candidate for dealing with in kernel
>> > space, since as well as the /proc/sys tree, the data is also visible via
>> > the sysctl() system call which a FUSE overlay won't address.
>> 
>> Any application that uses the sysctl() system call needs to be fixed.
>> When I looked years ago the number of applications using sysctl() could
>> be numbered on one hand and most of those applications were the fedora
>> installer, and the fedora installer hasn't used sysctl.
>
> Ok, I did wonder whether anyone would actually use sysctl() instead
> of reading /proc/sys. If we can ignore the sysctl that gives us more
> options.

Most definitely.  The warning sysctl spews when you figure out how
to call it should be a good clue.

>> > The kernel doesn't have a real concept of a 'container' to associate
>> > a boot_id value with as such, but maybe it is reasonable to associate
>> > a boot_id value with each PID namespace ?
>> 
>> There is also the question of uptime and clocks and things like that.
>> 
>> The utsnamespace might be a more resasonable place to tack on that kind
>> of extended functionality.
>>
>> Just changing boot_id itself and not all of the other bits that track
>> when we have booted does not seem reasonable.
>> 
>> Once we can sort out the details a kernel implementation should be quite
>> trivial.  It just requires the appropriate sysctl registration dance.
>
> Ok, I'll try to identify a list of other related parts which need changing
> wrt boot.
>
> Thanks for the feedback.

I hope it helps.

There may be a justification and a good case for messing with boot_id
but I don't currently see it.

What I see (so far) is SystemD unnecessarily tying itself to linux
implemenation details.

Eric

_______________________________________________
Containers mailing list
Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx
https://lists.linuxfoundation.org/mailman/listinfo/containers