On Sun, Apr 17, 2011 at 7:41 AM, Tharindu Rukshan Bamunuarachchi <btharindu@xxxxxxxxx> wrote: > hi all, > > has anyone heard about or used task snapshot mechanism for Linux ? > > what i mean by process hibernation ... stop process , take snapshot of > current state and later start/continue from the point of snapshot. (in > case of failure of original process) > Hello Tharindu, I work in a research group in the HPC field. Our group develops many tools that use process checkpoint restart. Basically the people here use 3 CR mechanism that I'm aware of: 1- Berkeley lab's checkpoint/restart - BLCR (https://ftg.lbl.gov/CheckpointRestart/CheckpointRestart.shtml): Pros: - Probably the most robust framework to CR in Linux. Is a hybrid kernel-space/user-space implementation. - You can compile OpenMPI message passing library to checkpoint distributed applications using BLCR, very useful in HPC (http://osl.iu.edu/research/ft/ompi-cr/). Cons: - It looks that they are slowing down its development. The last official release is 0.82 (June 16, 2009) and support kernel 2.6.30 (pretty old). To compile with newer kernels there are some patches flowing in the development mailing list but I think only to give support until 2.6.34 I think. - You need root permissions to insert the blcr kernel module. One of our tools used BLCR and we couldn't run in many clusters because the sysadmins were skeptical about inserting a kernel module with a few random patches published in a mailing list. 2- DMTCP: Distributed MultiThreaded CheckPointing (http://dmtcp.sourceforge.net/) Pros: - A completely user-space solution. You don't need to bother the sysadmins to install kernel modules. - Can checkpoint distributed computation (we already tried with OpenMPI and it also checkpoints the orte daemon). - There is current development to add DMTCP to OpenMPI for parallel applications checkpoints from OpenMPI as a alternative to BLCR (https://bitbucket.org/jsquyres/ompi-dmtcp2/). Cons: - Since it is implemented in user-space it has a lot of workarounds to maintain process state in userspace. - Duplicates kernel-space process information. - Only works with socket-based communications (it doesn't work with proprietary infiniband protocols for example). 3- Linux-cr checkpoint/restart mechanism (https://ckpt.wiki.kernel.org/index.php/Main_Page) Pros: - The checkpoint/restart mechanism is implemented in the kernel as syscalls and some user-space tools. - Their intention is to push the mechanism upstream for kernel inclusion. - Since their implementation is kernel based it is very robust. Cons: - The patch-set still didn't make for kernel inclusion. And the the whole subject is complicated. Not all kernel developers agree that implement CR in the kernel is a good idea (http://lwn.net/Articles/414264/). - You need a custom kernel that has linux-cr support. So which CR mechanism you choose will depend of many factors (you have control the machine, use sockets, can boot a custom kernel, etc). Hope it helps. Regards, ----------------------------------------- Javier MartÃnez Canillas (+34) 682 39 81 69 PhD Student in High Performance Computing Computer Architecture and Operating System Department (CAOS) Universitat AutÃnoma de Barcelona Barcelona, Spain _______________________________________________ Kernelnewbies mailing list Kernelnewbies@xxxxxxxxxxxxxxxxx http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies