Serge E. Hallyn wrote: > Quoting Oren Laadan (orenl@xxxxxxxxxxxxxxx): >> Applied. >> >> Serge E. Hallyn wrote: >>> The nsproxy restore path recognizes that an objref of 0 for >>> ipc or uts ns means don't unshare it. But the checkpoint side >>> forgot to write down 0 when the ipc or uts ns isn't unshared! >>> >>> Fix that. >>> >>> To test, run a program with a private pidns but shared utsns >>> which does >>> >>> sleep(5); >>> sethostname("serge", 6); >>> >>> checkpoint it, reset your hostname (if you let the program >>> complete), then restart the program: without this patch, it >>> will not reset your hostname. It should, and with this patch >>> it will. >>> >>> Signed-off-by: Serge E. Hallyn <serue@xxxxxxxxxx> >>> --- >>> kernel/nsproxy.c | 19 +++++++++++++------ >>> 1 files changed, 13 insertions(+), 6 deletions(-) >>> >>> diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c >>> index 0da0d83..dcb502c 100644 >>> --- a/kernel/nsproxy.c >>> +++ b/kernel/nsproxy.c >>> @@ -280,13 +280,20 @@ static int do_checkpoint_ns(struct ckpt_ctx *ctx, struct nsproxy *nsproxy) >>> if (!h) >>> return -ENOMEM; >>> - ret = checkpoint_obj(ctx, nsproxy->uts_ns, CKPT_OBJ_UTS_NS); >>> - if (ret <= 0) >>> - goto out; >>> + ret = 0; >>> + if (nsproxy->uts_ns != ctx->root_nsproxy->uts_ns) { >>> + ret = checkpoint_obj(ctx, nsproxy->uts_ns, CKPT_OBJ_UTS_NS); >>> + if (ret <= 0) >>> + goto out; >>> + } >>> h->uts_objref = ret; >>> - ret = checkpoint_obj(ctx, nsproxy->ipc_ns, CKPT_OBJ_IPC_NS); >>> - if (ret < 0) >>> - goto out; >>> + >>> + ret = 0; >>> + if (nsproxy->ipc_ns != ctx->root_nsproxy->ipc_ns) { >>> + ret = checkpoint_obj(ctx, nsproxy->ipc_ns, CKPT_OBJ_IPC_NS); >>> + if (ret < 0) >>> + goto out; >>> + } >>> h->ipc_objref = ret; >>> /* FIXME: for now, only marked visited to pacify leaks */ > > All right, tihs patch was not right. What we should be checking > is whether nsproxy->uts_ns != ctx->root_task->parent->nsproxy->uts_ns. > But I don't want to just send the patch to do that until we discuss > whether that is the right thing to do. > > Let me give a precise definition: I call an 'ambient namespace' a > namespace which was not unshared when the container was created. > Unfortunately there isn't really a reliable way to tell whether that > was the case. Checking container_init->parent may depend upon the > container init not having been reparented. Hmm... yeah, I should have looked at it more carefully - My original idea is that someone (e.g. userspace) could zero out, e.g., the h->uts_objref, and that way allow a restart to "inherit" the uts-ns of the parent. I didn't not do it at checkpoint because (a) I wanted to allow flexibility by letting the user choose later, and (b) as you pointed out already, it's hard to figure out this property at checkpoint anyway. Using a leak detection is tricky, because if we are doing full container checkpoint, we disallow leaks anyway, and if we are doing a subtree, then leaks are allowed. > > So as I see it we can do three things: > > 1. always unshare any namespace which was not empty at checkpoint. > So if the container was not unshared from host, and we checkpoint > members of that namespace, then at restart we will restart in an > unshared namespace and recreate the objects. That basically means > undo the patch I originally sent. > > That means that if the restarted task does 'hostname' it may end > up not affecting the hosts's hostname, even if it was originally > started on the host without separate utsns. Maybe that's what we > want? > This is the default we have used so far, and I'm quite happy with it. > 2. use the simple 'nsproxy->uts_ns != ctx->root_task->parent->nsproxy->uts_ns' > test. I think that would be pretty reliable. > > 3. for each namespace in ctx->root_nsproxy, check whether there are > any leaks, and, if so, mark it in the checkpoing image header so that > we can give restart a hint that it might not want to unshare those. If we leave some work to userspace anyway, an alternative to doing the accounting in the kernel (remember: this scenario only makes sense for non-container checkpoint), is to simply also save the ref count of the nsproxy with the nsproxy data (not even each namespace). Then user space can figure out if there is a "leak". Finally, if we do want to allow such a leak (e.g. only the uts-ns of the root) in a full container checkpoint, then we will need some way (flag ?) to request that when doing the checkpoint. So for now, I simply revert the patch (unless you object). Oren. _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linux-foundation.org/mailman/listinfo/containers