On 03.05.2017 00:13, Eric W. Biederman wrote: > Kirill Tkhai <ktkhai@xxxxxxxxxxxxx> writes: > >> On 02.05.2017 19:33, Oleg Nesterov wrote: >>> sorry for delay, vacation... >>> >>> On 04/28, Kirill Tkhai wrote: >>>> >>>> On 27.04.2017 19:22, Oleg Nesterov wrote: >>>>> >>>>> Ah, OK, I didn't notice the ns->child_reaper check in pidns_for_children_get(). >>>>> >>>>> But note that it doesn't need tasklist_lock too. >>>> >>>> Hm, are there possible strange situations with memory ordering, when we see >>>> ns->child_reaper of already died ns, which was placed in the same memory? >>>> Do we have to use some memory barriers here? >>> >>> Could you spell please? I don't understand your concerns... >>> >>> I don't see how, say, >>> >>> static struct ns_common *pidns_for_children_get(struct task_struct *task) >>> { >>> struct ns_common *ns = NULL; >>> struct pid_namespace *pid_ns; >>> >>> task_lock(task); >>> if (task->nsproxy) { >>> pid_ns = task->nsproxy->pid_ns_for_children; >>> if (pid_ns->child_reaper) { > ^^^^^^^^^^^^^^^^^^^^ > Oleg my apologies I missed this line earlier. > This does look like a valid way to skip read_lock(&tasklist_lock); >>> ns = &pid_ns->ns; >>> get_pid_ns(ns); > ^^^^^^^^^^^^^ This needs to be: > get_pid_ns(pid_ns); > >>> } >>> } >>> task_unlock(task); >>> >>> return ns; >>> } >>> >>> can be wrong. It also looks more clean to me. >>> >>> ->child_reaper is not stable without tasklist, it can be dead/etc, but >>> we do not care? >> >> I mean the following. We had a pid_ns1 with a child_reaper set. Then >> it became dead, and a new pid_ns2 were allocated in the same memory. > > task->nsproxy->pid_ns_for_children is always changed with > task_lock(task) held. See switch_task_namespaces (used by unshare and > setns). This also gives us the guarantee that the pid_ns reference > won't be freed/reused in any for until task_lock(task) is dropped. Now I've checked kmem_cache_zalloc() and it looks like it zeroes cache memory content synchronous on allocation (it seems there is no pre-zeroed memory for GFP_ZERO cases). So, the zeroing happens before switch_task_namespaces() (and task_unlock()) and we're really safe after task_lock() in pidns_for_children_get(). Ok, I'll send new version of the patchset.