Quoting Oren Laadan (orenl@xxxxxxxxxxxxxxx): > > > Serge E. Hallyn wrote: > > Quoting Oren Laadan (orenl@xxxxxxxxxxxxxxx): > >> Applied. > >> > >> Serge E. Hallyn wrote: > >>> The nsproxy restore path recognizes that an objref of 0 for > >>> ipc or uts ns means don't unshare it. But the checkpoint side > >>> forgot to write down 0 when the ipc or uts ns isn't unshared! > >>> > >>> Fix that. > >>> > >>> To test, run a program with a private pidns but shared utsns > >>> which does > >>> > >>> sleep(5); > >>> sethostname("serge", 6); > >>> > >>> checkpoint it, reset your hostname (if you let the program > >>> complete), then restart the program: without this patch, it > >>> will not reset your hostname. It should, and with this patch > >>> it will. > >>> > >>> Signed-off-by: Serge E. Hallyn <serue@xxxxxxxxxx> > >>> --- > >>> kernel/nsproxy.c | 19 +++++++++++++------ > >>> 1 files changed, 13 insertions(+), 6 deletions(-) > >>> > >>> diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c > >>> index 0da0d83..dcb502c 100644 > >>> --- a/kernel/nsproxy.c > >>> +++ b/kernel/nsproxy.c > >>> @@ -280,13 +280,20 @@ static int do_checkpoint_ns(struct ckpt_ctx *ctx, struct nsproxy *nsproxy) > >>> if (!h) > >>> return -ENOMEM; > >>> - ret = checkpoint_obj(ctx, nsproxy->uts_ns, CKPT_OBJ_UTS_NS); > >>> - if (ret <= 0) > >>> - goto out; > >>> + ret = 0; > >>> + if (nsproxy->uts_ns != ctx->root_nsproxy->uts_ns) { > >>> + ret = checkpoint_obj(ctx, nsproxy->uts_ns, CKPT_OBJ_UTS_NS); > >>> + if (ret <= 0) > >>> + goto out; > >>> + } > >>> h->uts_objref = ret; > >>> - ret = checkpoint_obj(ctx, nsproxy->ipc_ns, CKPT_OBJ_IPC_NS); > >>> - if (ret < 0) > >>> - goto out; > >>> + > >>> + ret = 0; > >>> + if (nsproxy->ipc_ns != ctx->root_nsproxy->ipc_ns) { > >>> + ret = checkpoint_obj(ctx, nsproxy->ipc_ns, CKPT_OBJ_IPC_NS); > >>> + if (ret < 0) > >>> + goto out; > >>> + } > >>> h->ipc_objref = ret; > >>> /* FIXME: for now, only marked visited to pacify leaks */ > > > > All right, tihs patch was not right. What we should be checking > > is whether nsproxy->uts_ns != ctx->root_task->parent->nsproxy->uts_ns. > > But I don't want to just send the patch to do that until we discuss > > whether that is the right thing to do. > > > > Let me give a precise definition: I call an 'ambient namespace' a > > namespace which was not unshared when the container was created. > > Unfortunately there isn't really a reliable way to tell whether that > > was the case. Checking container_init->parent may depend upon the > > container init not having been reparented. > > Hmm... yeah, I should have looked at it more carefully - Me too :) > My original idea is that someone (e.g. userspace) could zero out, > e.g., the h->uts_objref, and that way allow a restart to "inherit" > the uts-ns of the parent. > > I didn't not do it at checkpoint because (a) I wanted to allow > flexibility by letting the user choose later, and (b) as you pointed > out already, it's hard to figure out this property at checkpoint > anyway. > > Using a leak detection is tricky, because if we are doing full > container checkpoint, we disallow leaks anyway, and if we are > doing a subtree, then leaks are allowed. > > > > > So as I see it we can do three things: > > > > 1. always unshare any namespace which was not empty at checkpoint. > > So if the container was not unshared from host, and we checkpoint > > members of that namespace, then at restart we will restart in an > > unshared namespace and recreate the objects. That basically means > > undo the patch I originally sent. > > > > That means that if the restarted task does 'hostname' it may end > > up not affecting the hosts's hostname, even if it was originally > > started on the host without separate utsns. Maybe that's what we > > want? > > > > This is the default we have used so far, and I'm quite happy with > it. > > > 2. use the simple 'nsproxy->uts_ns != ctx->root_task->parent->nsproxy->uts_ns' > > test. I think that would be pretty reliable. > > > > 3. for each namespace in ctx->root_nsproxy, check whether there are > > any leaks, and, if so, mark it in the checkpoing image header so that > > we can give restart a hint that it might not want to unshare those. > > If we leave some work to userspace anyway, an alternative to doing > the accounting in the kernel (remember: this scenario only makes > sense for non-container checkpoint), is to simply also save the ref > count of the nsproxy with the nsproxy data (not even each namespace). > Then user space can figure out if there is a "leak". > > Finally, if we do want to allow such a leak (e.g. only the uts-ns > of the root) in a full container checkpoint, then we will need some > way (flag ?) to request that when doing the checkpoint. > > So for now, I simply revert the patch (unless you object). Nope. I think it's best. thanks, -serge _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linux-foundation.org/mailman/listinfo/containers