Re: [PATCH 11/11][v15]: Document sys_eclone

Albert Cahalan <acahalan@xxxxxxxxx> · Sat, 3 Jul 2010 19:41:30 -0400

On Sat, Jul 3, 2010 at 4:32 PM, Sukadev Bhattiprolu
<sukadev@xxxxxxxxxxxxxxxxxx> wrote:

> +struct clone_args {
> +       u64 clone_flags_high;
> +       u64 child_stack_base;
> +       u64 child_stack_size;
> +       u64 parent_tid_ptr;
> +       u64 child_tid_ptr;
> +       u32 nr_pids;
> +       u32 reserved0;
> +};
> +
> +
> +sys_eclone(u32 flags_low, struct clone_args * __user cargs, int cargs_size,
> +               pid_t * __user pids)

I don't see why cargs_size is needed for expansion if you have flags.

> +       The order of pids in @pids is oldest in pids[0] to youngest pid
> +       namespace in pids[nr_pids-1]. If the number of pids specified in the
> +       @pids list is fewer than the nesting level of the process, the pids
> +       are applied from youngest namespace. I.e if the process is nested in
> +       a level-6 pid namespace and @pids only specifies 3 pids, the 3 pids
> +       are applied to levels 6, 5 and 4. Levels 0 through 3 are assumed to
> +       have a pid of '0' (the kernel will assign a pid in those namespaces).

That feels backwards. I'd have guessed pids[0] is how the
process sees itself. You'd truncate the array to reduce nesting
level rather than pointing into it.

> +       On failure, eclone() returns -1 and sets 'errno' to one of following
> +       values (the child process is not created).

Careful here: do you intend to document the system call itself,
or an expected glibc wrapper that doesn't exist yet?

> +       EPERM   Caller does not have the CAP_SYS_ADMIN privilege needed to
> +               specify the pids in this call (if pids are not specifed
> +               CAP_SYS_ADMIN is not required).

It seems appropriate to let PID 1 in any PID namespace be
able to assign PIDs in it's own namespace and in any
child namespaces.

> +       EINVAL  The child_stack_size field is not 0 (on architectures that
> +               pass in a stack pointer in ->child_stack field).

need to change this

> +                "int $0x80\n\t"        /* Linux/i386 system call */
> +                "testl %0,%0\n\t"      /* check return value */
> +                "jne 1f\n\t"           /* jump if parent */
> +
> +                "popl %%esi\n\t"       /* get subthread function */
> +                "call *%%esi\n\t"      /* start subthread function */
> +                "movl %2,%0\n\t"
> +                "int $0x80\n"          /* exit system call: exit subthread */
...
> +/*
> + * Allocate a stack for the clone-child and arrange to have the child
> + * execute @child_fn with @child_arg as the argument.
> + */
...
> +       *--stack = child_arg;
> +       *--stack = child_fn;
...
> +static int do_clone(int (*child_fn)(void *), void *child_arg,
> +               unsigned int flags_low, int nr_pids, pid_t *pids_list)

There needs to be a way to pass child_fn and child_arg
via the kernel. Besides being required for kernel-managed
stacks, it's normally a saner interface. Stack setup would
be much like the stack setup for signal handlers. Imagine
using this for a vfork-like interface that didn't have painful
interactions with the compiler.

Speaking of vfork....

1. can you implement it for i386 (register starved) using eclone?

2. can you restart a pair of processes between vfork and execve?
_______________________________________________
Containers mailing list
Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx
https://lists.linux-foundation.org/mailman/listinfo/containers