Re: clone3: allow creation of time namespace with offset

"Michael Kerrisk (man-pages)" <mtk.manpages@xxxxxxxxx> · Wed, 1 Apr 2020 13:40:25 +0200

On 3/25/20 12:26 PM, Christian Brauner wrote:
> On Wed, Mar 25, 2020 at 08:58:36AM +0100, Adrian Reber wrote:
>> On Tue, Mar 24, 2020 at 06:56:49PM +0100, Christian Brauner wrote:
>>> On Tue, Mar 24, 2020 at 05:25:46PM +0100, Adrian Reber wrote:
>>>> On Tue, Mar 24, 2020 at 05:09:45PM +0100, Christian Brauner wrote:
>>>>> On Fri, Mar 20, 2020 at 11:33:55AM -0700, Andrei Vagin wrote:
>>>>>> On Thu, Mar 19, 2020 at 11:29:55AM +0100, Christian Brauner wrote:
>>>>>>> On Thu, Mar 19, 2020 at 09:16:43AM +0100, Arnd Bergmann wrote:
>>>>>>>> On Thu, Mar 19, 2020 at 9:11 AM Adrian Reber <areber@xxxxxxxxxx> wrote:
>>>>>>>>
>>>>>>>>> With Arnd's idea of only using nanoseconds, timens_offset would then
>>>>>>>>> contain something like this:
>>>>>>>>>
>>>>>>>>> struct timens_offset {
>>>>>>>>>         __aligned_s64 monotonic_offset_ns;
>>>>>>>>>         __aligned_s64 boottime_offset_ns;
>>>>>>>>> };
>>>>>>>>>
>>>>>>>>> I kind of prefer adding boottime and monotonic directly to struct clone_args
>>>>>>>>>
>>>>>>>>>         __aligned_u64 tls;
>>>>>>>>>         __aligned_u64 set_tid;
>>>>>>>>>         __aligned_u64 set_tid_size;
>>>>>>>>> +       __aligned_s64 monotonic_offset_ns;
>>>>>>>>> +       __aligned_s64 boottime_offset_ns;
>>>>>>>>>  };
>>>>>>>>
>>>>>>>> I would also prefer the second approach using two 64-bit integers
>>>>>>>> instead of a pointer, as it keeps the interface simpler to implement
>>>>>>>> and simpler to interpret by other tools.
>>>>>>>
>>>>>>> Why I don't like has two reasons. There's the scenario where we have
>>>>>>> added new extensions after the new boottime member and then we introduce
>>>>>>> another offset. Then you'd be looking at:
>>>>>>>
>>>>>>> __aligned_u64 tls;
>>>>>>> __aligned_u64 set_tid;
>>>>>>> __aligned_u64 set_tid_size;
>>>>>>> + __aligned_s64 monotonic_offset_ns;
>>>>>>> + __aligned_s64 boottime_offset_ns;
>>>>>>> __aligned_s64 something_1
>>>>>>> __aligned_s64 anything_2
>>>>>>> + __aligned_s64 sometime_offset_ns
>>>>>>>
>>>>>>> which bothers me just by looking at it. That's in addition to adding two
>>>>>>> new members to the struct when most people will never set CLONE_NEWTIME.
>>>>>>> We'll also likely have more features in the future that will want to
>>>>>>> pass down more info than we want to directly expose in struct
>>>>>>> clone_args, e.g. for a long time I have been thinking about adding a
>>>>>>> struct for CLONE_NEWUSER that allows you to specify the id mappings you
>>>>>>> want the new user namespace to get. We surely don't want to force all
>>>>>>> new info into the uppermost struct. So I'm not convinced we should here.
>>>>>>
>>>>>> I think here we can start thinking about a netlink-like interface.
>>>>>
>>>>> I think netlink is just not a great model for an API and I would not
>>>>> want us to go down that route.
>>>>>
>>>>> I kept thinking about this for a bit and I think that we will end up
>>>>> growing more namespace-related functionality. So one thing that came to
>>>>> my mind is the following layout:
>>>>>
>>>>> struct {
>>>>> 	struct {
>>>>> 		__s64 monotonic;
>>>>> 		__s64 boot;
>>>>> 	} time;
>>>>> } namespaces;
>>>>>
>>>>> struct _clone_args {
>>>>> 	__aligned_u64 flags;
>>>>> 	__aligned_u64 pidfd;
>>>>> 	__aligned_u64 child_tid;
>>>>> 	__aligned_u64 parent_tid;
>>>>> 	__aligned_u64 exit_signal;
>>>>> 	__aligned_u64 stack;
>>>>> 	__aligned_u64 stack_size;
>>>>> 	__aligned_u64 tls;
>>>>> 	__aligned_u64 set_tid;
>>>>> 	__aligned_u64 set_tid_size;
>>>>> 	__aligned_u64 namespaces;
>>>>> 	__aligned_u64 namespaces_size;
>>>>> };
>>>>>
>>>>> Then when we end up adding id mapping support for CLONE_NEWUSER we can
>>>>> extend this with:
>>>>>
>>>>> struct {
>>>>> 	struct {
>>>>> 		__aligned_u64 monotonic;
>>>>> 		__aligned_u64 boot;
>>>
>>> s/__aligned_u64/__s64/g
>>>
>>> Sorry, leftover from my first draft.
>>>
>>>>> 	} time;
>>>>>
>>>>> 	struct {
>>>>> 		/* id mapping members */
>>>>> 	} user;
>>>>> } namespaces;
>>>>>
>>>>> Thoughts? Other ideas?
>>>>
>>>> Works for me.
>>>>
>>>> If we add the user namespace id mappings and then at some point a third
>>>> element for the time namespace appears it would also start to be mixed.
>>>> Just as you mentioned that a few mails ago.
>>>
>>> I think you misunderstand me or I'm misunderstanding you. That new time
>>> namespace member would go into struct time {} so
>>>
>>> struct {
>>> 	struct {
>>> 		__s64 monotonic;
>>> 		__s64 boot;
>>> 		__s64 someothertime;
>>> 	} time;
>>>
>>> 	struct {
>>> 		/* id mapping members */
>>> 	} user;
>>> } namespaces;

So far, this seems like the least worst approach to me :-).

I think it's reasonable to assume that there will be another
time NS offset to add one day. I don't think anyone expected
CLOCK_BOOTIME (added in 2011) at the time that CLOCK_MONOTONIC
appeared (as part of the POSIX timers API in Linux 2.6.0 2003);
similarly, we probably can't conceive now what clock might be
added in the future that should also be governed by time
namespaces.

But...

>> My question was about how does the kernel know how 'struct namespaces'
>> is structured. How can an older kernel (which only is aware of two
>> clocks) deal with a, like in this example, third clock. Will the size
>> '__aligned_u64 namespaces_size' be used for versioning?
> 
> Yes, that would be the idea.

The idea implied here (if I understand correctly) of "binary chop
on the structure size to see what your kernel supports" is pretty
clunky, IMO. It's worth at least considering alternatives. 
For example, seccomp has a number of interesting interfaces to
discover what the running kernel supports [1]. Maybe it is worth
considering something similar for clone3()?

Cheers,

Michael

[1]
/proc/sys/kernel/seccomp/actions_avail (see seccomp(2))
SECCOMP_GET_ACTION_AVAIL (see secommp(2))
cgroups also provides something similar in the form of
/sys/kernel/cgroup/features (see cgroups(7))

-- 
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/