> >> The ability to streamline the checkpoint image IMHO is invaluable. > >> It's the unix way (TM) of doing things; it makes the process pipe-able. > >> > >> You can do many nice things when the checkpoint can be streamed: you > >> can compress, sign, encrypt etc on the fly without taking additional > >> diskspace. You can transfer over the network (e.g. for migration), > >> or store remotely without explicit file system support. You can easily > >> transform the stream from one c/r version to another etc. > >> > >> This should be a design principle. In my experience I never hit a wall > >> that forced me to "sacrifice" this decision. > >> > >>> sacrifised (read: child can ptrace parent) > >> Hmmm... if all tasks are created in user space, then this specific > >> becomes a no-brainer ! > > > > No! > > Actually yes :) > > > > > A ptraces B. Container is checkpointed. > > > > Kernel realizes ptrace is going on. A and B in theory can have any > > realitionship. > > > > Consequently, kernel doesn't know in which order to dump A and B. > > > > And there is no such order: > > *) A can be parent of B (you dump A, B), > > *) A can be child of B (you want to dump B, A, but this conflicts with > > ->real_parent order) > > *) A and B just tasks (any order). > > Current code does not support ptrace() - which has a multitude > if tidy-bits issues to solve during restart regardless. > > However, creating tasks in userspace uses (and will uses) only > "real" process relationships, not ptrace-relationships, when it > comes to decide on the fork/clone order. > > Technically, that can be done in checkpoint (dumping the task tree) > or in restart-user-space (rearranging the data before fork/clone). > > > > > I'm showing that whole issue can be avoided: > > If the issue can be avoided, then why would you need to sacrifice > the stream-ability of the checkpoint image ? > > > *) all tasks are simply created regardless of who is parent of whom > > (see kernel_thread()) > > *) Every task_struct image among other things contains references to > > ->real_parent and ->parent. > > *) After every task is created it's time to change references: > > **) lookup who is ->real_parent, change ->real_parent _by hand_ > > not with some "correct clone(2)" order. > > **) lookup who is ->parent, change ->parent. > > > > You're probably escaping all of this with object numbers? > > (Will be) escaping this by arranging to fork/clone in the proper order. task_struct and reparenting is just an example. There is another loop: struct user_struct => struct user_namespace => struct user_namespace::creator Before actual dump each struct user_struct gets unique id (objref, whatever) and simply dumped regardless of order. Image of struct user_namespace contains id of creator user and dumped. On restart: restart user_ns restart user lookup object by creator id if found, rewrite ->creator if not found, restore creator user, and rewrite ->creator. So, yes, if object number is dumped on disk, you get streamability in presence of loops. Clever. Just needs a way to quickly lookup file position by object id. BTW, this is why OpenVZ code have "section concept. I hoped it won't be needed. _______________________________________________ Containers mailing list Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linux-foundation.org/mailman/listinfo/containers