On Sat, Jul 3, 2010 at 4:32 PM, Sukadev Bhattiprolu <sukadev@xxxxxxxxxxxxxxxxxx> wrote: > +struct clone_args { > + u64 clone_flags_high; > + u64 child_stack_base; > + u64 child_stack_size; > + u64 parent_tid_ptr; > + u64 child_tid_ptr; > + u32 nr_pids; > + u32 reserved0; > +}; > + > + > +sys_eclone(u32 flags_low, struct clone_args * __user cargs, int cargs_size, > + pid_t * __user pids) I don't see why cargs_size is needed for expansion if you have flags. > + The order of pids in @pids is oldest in pids[0] to youngest pid > + namespace in pids[nr_pids-1]. If the number of pids specified in the > + @pids list is fewer than the nesting level of the process, the pids > + are applied from youngest namespace. I.e if the process is nested in > + a level-6 pid namespace and @pids only specifies 3 pids, the 3 pids > + are applied to levels 6, 5 and 4. Levels 0 through 3 are assumed to > + have a pid of '0' (the kernel will assign a pid in those namespaces). That feels backwards. I'd have guessed pids[0] is how the process sees itself. You'd truncate the array to reduce nesting level rather than pointing into it. > + On failure, eclone() returns -1 and sets 'errno' to one of following > + values (the child process is not created). Careful here: do you intend to document the system call itself, or an expected glibc wrapper that doesn't exist yet? > + EPERM Caller does not have the CAP_SYS_ADMIN privilege needed to > + specify the pids in this call (if pids are not specifed > + CAP_SYS_ADMIN is not required). It seems appropriate to let PID 1 in any PID namespace be able to assign PIDs in it's own namespace and in any child namespaces. > + EINVAL The child_stack_size field is not 0 (on architectures that > + pass in a stack pointer in ->child_stack field). need to change this > + "int $0x80\n\t" /* Linux/i386 system call */ > + "testl %0,%0\n\t" /* check return value */ > + "jne 1f\n\t" /* jump if parent */ > + > + "popl %%esi\n\t" /* get subthread function */ > + "call *%%esi\n\t" /* start subthread function */ > + "movl %2,%0\n\t" > + "int $0x80\n" /* exit system call: exit subthread */ ... > +/* > + * Allocate a stack for the clone-child and arrange to have the child > + * execute @child_fn with @child_arg as the argument. > + */ ... > + *--stack = child_arg; > + *--stack = child_fn; ... > +static int do_clone(int (*child_fn)(void *), void *child_arg, > + unsigned int flags_low, int nr_pids, pid_t *pids_list) There needs to be a way to pass child_fn and child_arg via the kernel. Besides being required for kernel-managed stacks, it's normally a saner interface. Stack setup would be much like the stack setup for signal handlers. Imagine using this for a vfork-like interface that didn't have painful interactions with the compiler. Speaking of vfork.... 1. can you implement it for i386 (register starved) using eclone? 2. can you restart a pair of processes between vfork and execve? _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linux-foundation.org/mailman/listinfo/containers