Hi, Andrei Vagin > -----Original Message----- > From: 'Andrei Vagin' [mailto:avagin@xxxxxxxxx] > Sent: Saturday, August 06, 2016 1:14 AM > To: Zhao Lei <zhaolei@xxxxxxxxxxxxxx> > Cc: 'LKML' <linux-kernel@xxxxxxxxxxxxxxx>; 'Linux Containers' > <containers@xxxxxxxxxxxxxxxxxxxxxxxxxx>; 'Eric W. Biederman' > <ebiederm@xxxxxxxxxxxx> > Subject: Re: [PATCH v2 1/2] Limit dump_pipe program's permission to init for > container > > On Fri, Aug 05, 2016 at 03:52:25PM +0800, Zhao Lei wrote: > > Hi, Andrei Vagin > > > > Thanks for your detailed review and suggestion. > > > > > -----Original Message----- > > > From: Andrei Vagin [mailto:avagin@xxxxxxxxx] > > > Sent: Friday, August 05, 2016 2:32 PM > > > To: Zhao Lei <zhaolei@xxxxxxxxxxxxxx> > > > Cc: LKML <linux-kernel@xxxxxxxxxxxxxxx>; Linux Containers > > > <containers@xxxxxxxxxxxxxxxxxxxxxxxxxx>; Eric W. Biederman > > > <ebiederm@xxxxxxxxxxxx> > > > Subject: Re: [PATCH v2 1/2] Limit dump_pipe program's permission to init > for > > > container > > > > > > On Tue, Aug 2, 2016 at 2:08 AM, Zhao Lei <zhaolei@xxxxxxxxxxxxxx> wrote: > > > > Currently when we set core_pattern to a pipe, the pipe program is > > > > forked by kthread running with root's permission, and write dumpfile > > > > into host's filesystem. > > > > Same thing happened for container, the dumper and dumpfile are also > > > > in host(not in container). > > > > > > > > It have following program: > > > > 1: Not consistent with file_type core_pattern > > > > When we set core_pattern to a file, the container will write dump > > > > into container's filesystem instead of host. > > > > 2: Not safe for privileged container > > > > In a privileged container, user can destroy host system by following > > > > command: > > > > # # In a container > > > > # echo "|/bin/dd of=/boot/vmlinuz" >/proc/sys/kernel/core_pattern > > > > # make_dump > > > > > > > > This patch switch dumper program's environment to init task, so, for > > > > container, dumper program have same environment with init task in > > > > container, which make dumper program put in container's filesystem, and > > > > write coredump into container's filesystem. > > > > The dumper's permission is also limited into subset of container's init > > > > process. > > > > > > Does it change the current behavior? A pid namespace may be used for > > > sandboxes. For example, chrome uses it. In this case, we probably want > > > to collect core dumps from a root pid namespace. > > > > > > If we are going to virtualize core_pattern relative to the pid > > > namespace, maybe it's better to make it optional for a pid namespace. > > > I mean it core_pattern is not set for the current pid namespace, we > > > can check a parent pid namespace and so on. A helper will be executed > > > in a pid namespace, where core_pattern is set. > > > > > Is it means, we don't want to always gather dump from the container ifself, > > wnd we want to custom which level of the container can gather dump > > information? > > > > If it is exact, current patch have similar function as above. > > > > If user had not set core_pattern in a namespace, the dump setting in > > the parent namespace will be used. > > For example, > > HOST change core_pattern > > Container1 no change > > Container2 change core_pattern > > Container3 no change > > Container4 no change > > > > If code dump happened in container 2, 3 and 4, the dumper's environment > > is same as container2. > > And if dump happened in container 1 and HOST, the dumper's environment > > will be same as HOST. > > > > So, > > 1: if a container want to make the custom core dump, > > it can set core_pattern in container, and the gathered information > > will be limited in this container. > > 2: If the host(or some top level container) want to gather dump information > > more than a container internal, the host(or some top level container) > > can set core_pattern, and not change core_pattern in the container, > > then the pipe program will be run in same environment as the host, > > and gather information from host's env. > > The behavior which you described is the same with what I'm thinking > about. > > But I don't understand how you implement it in a code. > I see that you call find_task_by_vpid(1) to get a base task, so the base > that is always the init task of the current pid namespace. I think we > need to get the init task from a pid namespace where core_pattern is > set, don't we? > > pidns1 task1 (core_pattern is set) > \_pidns2 task2 (core_pattern isn't set) > \_pidns2 ... > \_pidns2 taskX (crashed and call do_coredump()) > > find_task_by_vpid(1) is called from do_coredump() and will return task2, > but if I understand you correctly, the base_task has to be task1 in this > case. > find_task_by_vpid(1) is changed in [PATCH 2/2]. It changed core_pattern to per-ns instead of global one, and save pid_ns of the task who changed core_pattern, and use above pid_ns for find_task_by_vpid(). In above pid tree, the pidns2 taskX crashed, then get core_pattern and pid_ns from pidns1's struct, so we can run pipe_program in pidns1. > > > > > After reading these patches, I think it may be a good idea to add one > > > more mode to handle core files, when a core file is sent to an > > > existing process instead of executing a usermode helpers. This mode > > > will cover of use-cases where the pipe mode is used now. And it looks > > > more secure, because in this case we control namespaces and credential > > > for the abort daemon (a process which handles core files).u > > > > > > How it can be done. An abort daemon creates an abstract listening unix > > > socket and writes its name into /proc/sys/kernel/core_pattern. The > > > kernel saves this name and the network namespace. Then when any > > > process is crashed, the kernel creates a new unix socket and connect > > > it to the socket of the abort daemon and streams a core file into this > > > socket. > > > > > Good idea. > > > > And do it means if we want to custom core_pattern for each container, > > we can create more than one abort daemon for it? > > Yes. > > > If the abort daemon is running in the host, the dump file will be write > > into host's env, and if abort daemon is running in container, the dump > > file will be write into container's fs. > > Yes. An abort daemon handles core files in its namespaces. The abort > daemon knows a pid of a crashed process, so it has access to namespaces of > the crashed task via /proc/pid/ns/* and can work with them. I want to say > that this daemon can decided in which context to handle a core file. > > I don't like the idea to play with namespaces, credentials, an security > context for user helpers. > Thanks for your detained explanation, I think it should be a useful function for core dump. Thanks Zhaolei > > > > > > > > > > Suggested-by: Eric W. Biederman <ebieyderm@xxxxxxxxxxxx> > > > > Suggested-by: KOSAKI Motohiro <kosaki.motohiro@xxxxxxxxxxxxxx> > > > > > > > > Signed-off-by: Zhao Lei <zhaolei@xxxxxxxxxxxxxx> > > > > --- > > > > fs/coredump.c | 87 > > > ++++++++++++++++++++++++++++++++++++++++++++++++- > > > > include/linux/binfmts.h | 1 + > > > > 2 files changed, 87 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/fs/coredump.c b/fs/coredump.c > > > > index 281b768..8511267 100644 > > > > --- a/fs/coredump.c > > > > +++ b/fs/coredump.c > > > > @@ -516,6 +516,8 @@ static int umh_pipe_setup(struct > subprocess_info > > > *info, struct cred *new) > > > > { > > > > struct file *files[2]; > > > > struct coredump_params *cp = (struct coredump_params > > > *)info->data; > > > > + struct task_struct *base_task; > > > > + > > > > int err = create_pipe_files(files, 0); > > > > if (err) > > > > return err; > > > > @@ -524,10 +526,79 @@ static int umh_pipe_setup(struct > subprocess_info > > > *info, struct cred *new) > > > > > > > > err = replace_fd(0, files[0], 0); > > > > fput(files[0]); > > > > + if (err) > > > > + return err; > > > > + > > > > /* and disallow core files too */ > > > > current->signal->rlim[RLIMIT_CORE] = (struct rlimit){1, 1}; > > > > > > > > - return err; > > > > + base_task = cp->base_task; > > > > + if (base_task) { > > > > + const struct cred *base_cred; > > > > + > > > > + /* Set fs_root to base_task */ > > > > + spin_lock(&base_task->fs->lock); > > > > + set_fs_root(current->fs, &base_task->fs->root); > > > > + spin_unlock(&base_task->fs->lock); > > > > + > > > > + /* Set namespaces to base_task */ > > > > + get_nsproxy(base_task->nsproxy); > > > > + switch_task_namespaces(current, > base_task->nsproxy); > > > > > > This task will continue running in the current pid namespace, because > > > switch_task_namespaces doesn't change a pid for the task. Ussualy, we > > > need to make setns+fork to switch a pid namespace. > > > > > Yes. > > Now I known why I had not find this problem in my test, > > I output pids by a shell command, and the shell command is already got > forked. > > > > Seems need double fork(make code complex...) to do it. > > > > > > + > > > > + /* Set cgroup to base_task */ > > > > + current->flags &= ~PF_NO_SETAFFINITY; > > > > + err = cgroup_attach_task_all(base_task, current); > > > > + if (err < 0) > > > > + return err; > > > > + > > > > + /* Set cred to base_task */ > > > > + base_cred = get_task_cred(base_task); > > > > + > > > > > > I think you can use prepare_kernel_cred here > > > > > prepare_kernel_cred() is called by caller of this > function(call_usermodehelper_exec_async()), > > and this function is a hook, which can set additional cred property like: > > call_usermodehelper_exec_async() > > { > > prepare_kernel_cred() > > this_function() > > { > > custom_cred struct > > } > > commit_creds(new); > > } > > > > So we can only change cred's container here. > > > > But seems we need rewrite this code when we move to above "double fork". > > > > Before rewrite, could you give me some suggestion, if current code > > already can custom to dump information of "container internal" and "parent > level container", > > is this meet your request? > > > > Thanks > > Zhaolei > > > > > > + new->uid = base_cred->uid; > > > > + new->gid = base_cred->gid; > > > > + new->suid = base_cred->suid; > > > > + new->sgid = base_cred->sgid; > > > > + new->euid = base_cred->euid; > > > > + new->egid = base_cred->egid; > > > > + new->fsuid = base_cred->fsuid; > > > > + new->fsgid = base_cred->fsgid; > > > > + > > > > + new->securebits = base_cred->securebits; > > > > + > > > > + new->cap_inheritable = base_cred->cap_inheritable; > > > > + new->cap_permitted = > base_cred->cap_permitted; > > > > + new->cap_effective = base_cred->cap_effective; > > > > + new->cap_bset = base_cred->cap_bset; > > > > + new->cap_ambient = base_cred->cap_ambient; > > > > + > > > > + security_cred_free(new); > > > > +#ifdef CONFIG_SECURITY > > > > + new->security = NULL; > > > > +#endif > > > > + err = security_prepare_creds(new, base_cred, > > > GFP_KERNEL); > > > > + if (err < 0) { > > > > + put_cred(base_cred); > > > > + return err; > > > > + } > > > > + > > > > + free_uid(new->user); > > > > + new->user = base_cred->user; > > > > + get_uid(new->user); > > > > + > > > > + put_user_ns(new->user_ns); > > > > + new->user_ns = base_cred->user_ns; > > > > + get_user_ns(new->user_ns); > > > > + > > > > + put_group_info(new->group_info); > > > > + new->group_info = base_cred->group_info; > > > > + get_group_info(new->group_info); > > > > + > > > > + put_cred(base_cred); > > > > + > > > > + validate_creds(new); > > > > + } > > > > + > > > > + return 0; > > > > } > > > > > > > > void do_coredump(const siginfo_t *siginfo) > > > > @@ -590,6 +661,7 @@ void do_coredump(const siginfo_t *siginfo) > > > > > > > > if (ispipe) { > > > > int dump_count; > > > > + struct task_struct *vinit_task; > > > > char **helper_argv; > > > > struct subprocess_info *sub_info; > > > > > > > > @@ -631,6 +703,14 @@ void do_coredump(const siginfo_t *siginfo) > > > > goto fail_dropcount; > > > > } > > > > > > > > + rcu_read_lock(); > > > > + vinit_task = find_task_by_vpid(1); > > > > + rcu_read_unlock(); > > > > + if (!vinit_task) { > > > > + printk(KERN_WARNING "failed getting init > task > > > info, skipping core dump\n"); > > > > + goto fail_dropcount; > > > > + } > > > > + > > > > helper_argv = argv_split(GFP_KERNEL, cn.corename, > > > NULL); > > > > if (!helper_argv) { > > > > printk(KERN_WARNING "%s failed to > allocate > > > memory\n", > > > > @@ -638,6 +718,10 @@ void do_coredump(const siginfo_t *siginfo) > > > > goto fail_dropcount; > > > > } > > > > > > > > + get_task_struct(vinit_task); > > > > + > > > > + cprm.base_task = vinit_task; > > > > + > > > > retval = -ENOMEM; > > > > sub_info = > call_usermodehelper_setup(helper_argv[0], > > > > helper_argv, > > > NULL, GFP_KERNEL, > > > > @@ -646,6 +730,7 @@ void do_coredump(const siginfo_t *siginfo) > > > > retval = > call_usermodehelper_exec(sub_info, > > > > > > > UMH_WAIT_EXEC); > > > > > > > > + put_task_struct(vinit_task); > > > > argv_free(helper_argv); > > > > if (retval) { > > > > printk(KERN_INFO "Core dump to |%s pipe > > > failed\n", > > > > diff --git a/include/linux/binfmts.h b/include/linux/binfmts.h > > > > index 314b3ca..0c9a72c 100644 > > > > --- a/include/linux/binfmts.h > > > > +++ b/include/linux/binfmts.h > > > > @@ -59,6 +59,7 @@ struct linux_binprm { > > > > > > > > /* Function parameter for binfmt->coredump */ > > > > struct coredump_params { > > > > + struct task_struct *base_task; > > > > const siginfo_t *siginfo; > > > > struct pt_regs *regs; > > > > struct file *file; > > > > -- > > > > 1.8.5.1 > > > > > > > > > > > > > > > > _______________________________________________ > > > > Containers mailing list > > > > Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx > > > > https://lists.linuxfoundation.org/mailman/listinfo/containers > > > > > > > > > > > _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linuxfoundation.org/mailman/listinfo/containers