Re: [RFC v14][PATCH 00/54] Kernel based checkpoint/restart

Oren Laadan <orenl@xxxxxxxxxxxxxxx> · Mon, 04 May 2009 16:13:59 -0400

Serge E. Hallyn wrote:
> Quoting Oren Laadan (orenl@xxxxxxxxxxxxxxx):
>>> I see one drawback with this approach if you allow checkpoint of
>>> application that is not isolated in a container. In that case, you may
>>> want to select which IPC objects to dump to not dump all the IPC objects
>>> living in the system. Indeed, this is why we have chosen in Kerrighed to
>>> checkpoint IPC objects independently of tasks, since we have no
>>> container/namespaces support currently.
>> I assume that in this case it will be the application itself that
>> will somehow tell the system which specific sysvipc objects (ids) it
>> cares about.
>>
>> (I'm not sure how would the system otherwise know what to dump and
>> what to leave out).
>>
>> I originally proposed the construct of cradvise() syscall to handle
>> exactly those cases where the application would like to advise the
>> kernel about certain resources. So, extending the previous example,
>> a task may call something like:
>>
>>    cradvise(CHECKPOINT_SYSVIPC_SHM, false);  /* generally skip shm */
>>    cradvise(CHECKPOINT_SYSVIPC_SHMID, id, true);  /* but include this */
>>
>> or:
>>    cradvise(CHECKPOINT_SYSVIPC_SHM, true);  /* generally include shm */
>>    cradvise(CHECKPOINT_SYSVIPC_SHMID, id, false);  /* but skip this */
>>
>> Anyway, these are just examples of the concept and what sort of generic
>> interface can be used to implement it; don't pick on the details...
>>
>> Oren.
> 
> Oren, I have to be honest:  I could of course be wrong, but imo there
> is 0 chance of such a bigger-and-uglier-than-ioctl syscall as cradvise
> being accepted upstream.  There may be good uses for it, but I think
> it's worthwhile thinking of ways around it whenever possible.

Clearly there is a tradeoff is between the flexibility and granularity
of control that one can have over how checkpoint/restart is done, vs.
complexity of the interface.

Unlike ioctl() which is a dump-place for any _type_ of device, what I'd
expect from cradvise()-like mechanism is to allow control on any _class_
of resource in the kernel. One can easily enumerate the existing ones
now in the kernel: mostly open file descriptors, namespaces, sysvipc,
memory descriptors, memory contents, etc. I don't expect cradvise() to
be specific to a specific device - that'll be userspace responsibility.

IOW, while we need to think carefully about what the interface would be,
I don't expect it to be bigger and uglier than ioctl(), because it's
focused scope, besides the fact the ioctl() is hard to compete with to
begin with...

> 
> In this particular case, wouldn't it be better to do something like:
> 
> 	1. freeze + checkpoint full application + container (== C1)
> 	2. continue application, which does a clone(CLONE_COPYIPC) (*1)
> 	3. application removes all shms except the one to be
> 	checkpointed
> 	4. freeze + checkpoint application again ( == C2)
> 	5. restart applicaiton from C1
> 
> This requires an ability to clone an ipc namespace while copying its
> contents, but that seems more viable upstream, and more generally
> useful, than yet another use for cradvise().

Sure, and indeed possibly useful outside c/r domain.

Note that for performance (speed, memory) reasons it will require
that the clone be done in COW style - not trivial for SHM.

Oren.

_______________________________________________
Containers mailing list
Containers@xxxxxxxxxxxxxxxxxxxxxxxxxx
https://lists.linux-foundation.org/mailman/listinfo/containers