Re: [RFC][PATCH] ns: Syscalls for better namespace sharing control.

Oren Laadan <orenl@xxxxxxxxxxxxxxx> · Wed, 03 Mar 2010 15:59:05 -0500

Daniel Lezcano wrote:
> Eric W. Biederman wrote:
>> Pavel Emelyanov <xemul@xxxxxxxxxxxxx> writes:
>>
>>> Eric W. Biederman wrote:
>>>> Pavel Emelyanov <xemul@xxxxxxxxxxxxx> writes:
>>>>
>>>>> Eric W. Biederman wrote:
>>>>>> Pavel Emelyanov <xemul@xxxxxxxxxxxxx> writes:
>>>>>>
>>>>>>> Thanks. What's the problem with setns?
>>>>>> joining a preexisting namespace is roughly the same problem as
>>>>>> unsharing a namespace.  We simply haven't figure out how to do it
>>>>>> safely for the pid and the uid namespaces.
>>>>> The pid may change after this for sure. What problems do you know
>>>>> about it? What if we try to allocate the same PID in a new space
>>>>> or return -EBUSY? This will be a good starting point. If we manage
>>>>> to fix it later this will not break the API at all.
>>>> Parentage.  The pid is the identity of a process and all kinds of things
>>>> make assumptions in all kinds of strange places.  I don't see how
>>>> waitpid can work if you change the pid.
>>> Agree. But what if we enter a pid space, which is a subnamespace of a current
>>> one? In that case parent will still see the task by its old pid. We can restrict
>>> first version of entering with this rule as well and this restriction will not
>>> block us in typical usecase (I mean enter a container from a host).
>> When I was thinking about pid namespaces and unshare last time.  The idea I came
>> to was we unshare of the pid namespace should only affect which pid namespace
>> your children are in.
>>
>> I remember that do that there were a few cases where you would have to access
>> task->pid->pid_ns instead of task->nsproxy->pid_ns, but essentially it was pretty
>> simple.
>>
>>>> glibc doesn't cope if you change someones pid.
>>> OK, but what if we try to allocate the same pid returning -EBUSY on failure?
>>>
>>> My aim is to provide even a restricted enter. For most of the cases this
>>> should work and make our lives easier. So two restrictions currently:
>>> a) enter a sub namespace
>>> b) allocate the same pid as we have now
>>>
>>> Hm? :)
>> Replacing struct pid is guaranteed to do all kinds of nasty things with
>> signal handling and the like, de_thread is nasty enough and you are talking
>> something worse.  So if we can change pid namespaces without changing
>> the pid I am for it.
> 
> I agree with all the points you and Pavel you talked about but I don't 
> feel comfortable to have the current process to switch the pid namespace 
> because of the process tree hierarchy (what will be the parent of the 
> process when you enter the pid namespace for example). What is the 
> difference with the sys_bindns or the sys_hijack, proposed a couple of 
> years ago ?
> 
> I did a suggestion some weeks ago about a new syscall 'cloneat' where 
> the child process becomes the child of the targeted process specified in 
> the syscall. Maybe it would be interesting to replace the 'setns' by, or 
> add, a 'cloneat' syscall with the file descriptor passed as parameter. 
> The copy_process function shall not use the nsproxy of the caller but 
> the one provided in the fd argument.
> 
> The newly created process becomes the child of the process where we 
> retrieve the namespace with nsfd and this one have to 'waitpid' it, (the 
> caller of 'cloneat' can not wait it). It's a bit similar with the 
> CLONE_PARENT flag, except the creation order is inverted (the father 
> creates for the child).
> 
> So when entering the container, we specify the pid 1 of the container 
> which is usually a child reaper.
> 
> Does it make sense ?

For what it's worth, I think that this suggestion (cloneat) is the
so far the cleanest to allow a process to enter an existing namespace.

Oren.

_______________________________________________
Containers mailing list
Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx
https://lists.linux-foundation.org/mailman/listinfo/containers