Oren Laadan wrote: > > Serge E. Hallyn wrote: >> Quoting Sukadev Bhattiprolu (sukadev@xxxxxxxxxxxxxxxxxx): >> >> Bear with me as I try to get this straight... >> >> Let's say we have the following task being checkpointed (btw, the >> actual checkpoint code hopefully will, as Oren mentioned, not >> be always checkpointing all of a task's pids, but only the >> pids from the checkpointer's pidns level an up - but that doesn't >> matter to you): >> >> Either: >> >> level no | pid >> 0 1001 >> 1 50 >> 2 3 >> >> or >> >> level no | pid >> 0 1001 >> 1 50 >> 2 1 >> >> Now we want to restart that inside a container. So for the >> first task we have a parent task: >> >> level no | pid >> 0 5009 >> 1 1000 >> 2 49 >> 3 2 >> >> calling clone_with_pid with upids {1001,50,3} to produce task: >> >> level no | pid >> 0 5010 >> 1 1001 >> 2 50 >> 3 3 >> >> So the numbers in your code for this case become: >> >>> + unum_pids = pid_set.num_pids; >> unum_pids = 3 >>> + knum_pids = task_pid(current)->level + 1; >> knum_pids = 3 + 1 = 4 >> >>> + target_pids = kzalloc((knum_pids + 1) * sizeof(pid_t), GFP_KERNEL); >> XXX First problem: >> target_pids gets room for 5 pids, one more than we need. > > The two issues are related, and this is intentional. The idea > was that when you use CLONE_NEWPID, you imply a new nesting level > and a pid==1 there. So you are not supposed to specify the pid > of that new level. > > IOW, the parent specify the pid's of the child from the _parent's_ > level and up (of the desired depth). CLONE_NEWPID creates a new > pidns level below the parent's, and that is not covered in the > array of pids. > > By allocation an extra slot and forcing it to be 0, we ensure that > the case of CLONE_NEWPID is covered correctly. Clearly, if this > flag isn't set, then the extra slot is redundant (but doesn't hurt). > >>> + j = knum_pids - unum_pids; >> j = 1, so we copy the 3 pids in the right place. >> >>> + rc = copy_from_user(&target_pids[j], pid_set.target_pids, size); >>> + if (rc) { >>> + rc = -EFAULT; >>> + goto out_free; >>> + } >>> + >>> + return target_pids; >> >> For the second one, we have a parent task >> >> level no | pid >> 0 5009 >> 1 1000 >> 2 49 >> >> calling clone_with_pid with CLONE_NEWPID and {1001,50,1} to produce: >> >> level no | pid >> 0 5010 >> 1 1001 >> 2 50 >> 3 1 >> >> So the numbers in your code become: >> >>> + unum_pids = pid_set.num_pids; >> unum_pids = 3 > > This is a "bug" of the parent. The parent should specify the pids > from the parent's level only and up, and not include the new level > below that will be created. (After all, it will have to be 1). Suka: maybe state this explicitly,too, in the patch description and in the comments in the code ? Like: "The syscall operates relative to the parent's nesting level. If CLONE_NEWPID is used as well, it will create an additional level below. In this case the syscall indicates the desired pids at the nesting levels seen by the parent." Oren. _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linux-foundation.org/mailman/listinfo/containers