Hi, Following Andreas Dilger's reply (http://lkml.org/lkml/2010/3/17/410) I'm (re)posting the subset of checkpoint-restart patch-set that is related to linux-fsdevel. (I'm unsure why those weren't sent before). Altogether there are 17 patches here (out of the 96 total). For the original post/thread see: http://lkml.org/lkml/2010/3/17/232. As Matt Helsley put briefly, checkpoint-restart mainly saves the critical pieces of kernel information from the struct file needed to restart the open file descriptors. It does not save the file (system) contents in the checkpoint image. That's left for proper filesystem freezing, snapshotting, or rsync (for example) depending on the tools and/or filesystems userspace has chosen. Oren. --- Here is the introduction to the original post: --- Following up on the thread on the checkpoint-restart patch set (http://lkml.org/lkml/2010/3/1/422), the following series is the latest checkpoint/restart, based on 2.6.33. The first 20 patches are cleanups and prepartion for c/r; they are followed by the actual c/r code. Please apply to -mm, and let us know if there is any way we can help. --- Linux Checkpoint-Restart: web, wiki: http://www.linux-cr.org bug track: https://www.linux-cr.org/redmine The repositories for the project are in: kernel: http://www.linux-cr.org/git/?p=linux-cr.git;a=summary user tools: http://www.linux-cr.org/git/?p=user-cr.git;a=summary tests suite: http://www.linux-cr.org/git/?p=tests-cr.git;a=summary --- CHANGELOG: v20 [2010-Mar-16] BUG FIXES (only) - [Serge Hallyn] Fix unlabeled restore case - [Serge Hallyn] Always restore msg_msg label - [Serge Hallyn] Selinux prevents msgrcv on restore message queues? - [Serge Hallyn] save_access_regs for self-checkpoint - [Serge Hallyn] send uses_interp=1 to arch_setup_additional_pages - Fix "scheduling in atomic" while restoring ipc (sem, shm, msg) - Cleanup: no need to restore perm->{id,key,seq} - Fix sysvipc=n compile - Make uts_ns=n compile - Only use arch_setup_additional_pages() if supported by arch - Export key symbols to enable c/r from kernel modules - Avoid crash if incoming object doesn't have .restore - Replace error_sem with an event completion - [Serge Hallyn] Change sysctl and default for unprivileged use - [Nathan Lynch] Use syscall_get_error - Add entry for checkpoint/restart in MAINTAINERS [2010-Feb-19] v19 NEW FEATURES - Support for x86-64 architecture - Support for c/r of LSM (smack, selinux) - Support for c/r of task fs_root and pwd - Support for c/r of epoll - Support for c/r of eventfd - Enable C/R while executing over NFS - Preliminary c/r of mounts namespace - Add @logfd argument to sys_{checkpoint,restart} prototypes - Define new api for error and debug logging - Restart to handle checkpoint images lacking {uts,ipc}-ns - Refuse to checkpoint if monitoring directories with dnotify - Refuse to checkpoint if file locks and leases are held - Refuse to checkpoint files with f_owner OTHER CHANGES - Rebase to kernel 2.6.33-rc8 - Settled version of new sys_eclone() - [Serge Hallyn] Fix potential use-before-set return (vdso) - Update documentation and examples for new syscalls API (doc) - [Liu Alexander] Fix typos (doc) - [Serge Hallyn] Update checkpoint image format (doc) - [Serge Hallyn] Use ckpt_err() to for bad header values - sys_{checkpoint,restart} to use ptregs prototype - Set ctx->errno in do_ckpt_msg() if needed - Fix up headers so we can munge them for use by userspace - Multiple fixes to _ckpt_write_err() and friends - [Matt Helsley] Add cpp definitions for enums - [Serge Hallyn] Add global section container to image format - [Matt Helsley] Fix total byte read/write count for large images - ckpt_read_buf_type() to accept max payload (excludes ckpt_hdr) - [Serge Hallyn] Use ckpt_err() for arch incompatbilities - Introduce walk_task_subtree() to iterate through descendants - Call restore_notify_error for restart (not checkpoint !) - Make kread/kwrite() abort if CKPT_CTX_ERROR is set - [Serge Hallyn] Move init_completion(&ctx->complete) to ctx_alloc - Simplify logic of tracking restarting tasks (->ctx) - Coordinator kills descendants on failure for proper cleanup - Prepare descendants needs PTRACE_MODE_ATTACH permissions - Threads wait for entire thread group before restoring - Add debug process-tree status during restart - Fix handling of bogus pid arg to sys_restart - In reparent_thread() test for PF_RESTARTING on parent - Keep __u32s in even groups for 32-64 bit compatibility - Define ckpt_obj_try_fetch - Disallow zero or negative objref during restart - Check for valid destructor before calling it (deferqueue) - Fix false negative of test for unlinked files at checkpoint - [Serge Hallyn] Rename fs_mnt to root_fs_path - Restore thread/cpu state early - Ensure null-termination of file names read from image - Fix compile warning in restore_open_fname() - Introduce FOLL_DIRTY to follow_page() for "dirty" pages - [Serge Hallyn] Checkpoint saved_auxv as u64s - Export filemap_checkpoint() - [Serge Hallyn] Disallow checkpoint of tasks with aio requests - Fix compilation failure when !CONFIG_CHEKCPOINT (regression) - Expose page write functions - Do not hold mmap_sem while checkpointing vma's - Do not hold mmap_sem when reading memory pages on restart - Move consider_private_page() to mm/memory.c:__get_dirty_page() - [Serge Hallyn] move destroy_mm into mmap.c and remove size check - [Serge Hallyn] fill vdso (syscall32_setup_pages) for TIF_IA32/x86_64 - [Serge Hallyn] Fix return value of read_pages_contents() - [Serge Hallyn] Change m_type to long, not int (ipc) - Don't free sma if it's an error on restore - Use task->saves_sigmask and drop task->checkpoint_data - [Serge Hallyn] Handle saved_sigmask at checkpoint - Defer restore of blocked signals mask during restart - Self-restart to tolerate missing PGIDs - [Serge Hallyn] skb->tail can be offset - Export and leverage sock_alloc_file() - [Nathan Lynch] Fix net/checkpoint.c for 64-bit - [Dan Smith] Unify skb read/write functions and handle fragmented buffers - [Dan Smith] Update buffer restore code to match the new format - [Dan Smith] Fix compile issue with CONFIG_CHECKPOINT=n - [Dan Smith] Remove an unnecessary check on socket restart - [Dan Smith] Pass the stored sock->protocol into sock_create() on restore - Relax tcp.window_clamp value in INET restore - Restore gso_type fields on sockets and buffers for proper operation - Fix broken compilation for no-c/r architectures - Return -EBUSY (not BUG_ON) if fd is gone on restart - Fix the chunk size instead of auto-tune (epoll) ARCH: x86 (32,64) - Use PTREGSCALL4 for sys_{checkpoint,restart} - Remove debug-reg support (need to redo with perf_events) - [Serge Hallyn] Support for ia32 (checkpoint, restart) - Split arch/x86/checkpoint.c to generic and 32bit specific parts - sys_{checkpoint,restore} to use ptregs - Allow X86_EFLAGS_RF on restart - [Serge Hallyn] Only allow 'restart' with same bit-ness as image. - Move checkpoint.c from arch/x86/mm->arch/x86/kernel ARCH: s390 [Serge Hallyn] - Define s390x sys_restart wrapper - Fixes to restart-blocks logic and signal path - Fix checkpoint and restart compat wrappers - sys_{checkpoint,restore} to use ptregs - Use simpler test_task_thread to test current ti flags - Fix 31-bit s390 checkpoint/restart wrappers - Update sys_checkpoint (do_sys_checkpoint on all archs) - [Oren Laadan] Move checkpoint.c from arch/s390/mm->arch/s390/kernel ARCH: powerpc [Nathan Lynch] - [Serge Hallyn] Add hook task_has_saved_sigmask() - Warn if full register state unavailable - Fix up checkpoint syscall, tidy restart - [Oren Laadan] Move checkpoint.c from arch/powerpc/{mm->kernel} [2009-Sep-22] v18 NEW FEATURES - [Nathan Lynch] Re-introduce powerpc support - Save/restore pseudo-terminals - Save/restore (pty) controlling terminals - Save/restore restore PGIDs - [Dan Smith] Save/restore unix domain sockets - Save/restore FIFOs - Save/restore pending signals - Save/restore rlimits - Save/restore itimers - [Matt Helsley] Handle many non-pseudo file-systems OTHER CHANGES - Rename headerless struct ckpt_hdr_* to struct ckpt_* - [Nathan Lynch] discard const from struct cred * where appropriate - [Serge Hallyn][s390] Set return value for self-checkpoint - Handle kmalloc failure in restore_sem_array() - [IPC] Collect files used by shm objects - [IPC] Use file (not inode) as shared object on checkpoint of shm - More ckpt_write_err()s to give information on checkpoint failure - Adjust format of pipe buffer to include the mandatory pre-header - [LEAKS] Mark the backing file as visited at chekcpoint - Tighten checks on supported vma to checkpoint or restart - [Serge Hallyn] Export filemap_checkpoint() (used for ext4) - Introduce ckpt_collect_file() that also uses file->collect method - Use ckpt_collect_file() instead of ckpt_obj_collect() for files - Fix leak-detection issue in collect_mm() (test for first-time obj) - Invoke set_close_on_exec() unconditionally on restart - [Dan Smith] Export fill_fname() as ckpt_fill_fname() - Interface to pass simple pointers as data with deferqueue - [Dan Smith] Fix ckpt_obj_lookup_add() leak detection logic - Replace EAGAIN with EBUSY where necessary - Introduce CKPT_OBJ_VISITED in leak detection - ckpt_obj_collect() returns objref for new objects, 0 otherwise - Rename ckpt_obj_checkpointed() to ckpt_obj_visited() - Introduce ckpt_obj_visit() to mark objects as visited - Set the CHECKPOINTED flag on objects before calling checkpoint - Introduce ckpt_obj_reserve() - Change ref_drop() to accept a @lastref argument (for cleanup) - Disallow multiple objects with same objref in restart - Allow _ckpt_read_obj_type() to read header only (w/o payload) - Fix leak of ckpt_ctx when restoring zombie tasks - Fix race of prepare_descendant() with an ongoing fork() - Track and report the first error if restart fails - Tighten logic to protect against bogus pids in input - [Matt Helsley] Improve debug output from ckpt_notify_error() - [Nathan Lynch] fix compilation errors with CONFIG_COMPAT=y - Detect error-headers in input data on restart, and abort. - Standard format for checkpoint error strings (and documentation) - [Dan Smith] Add an errno validation function - Add ckpt_read_payload(): read a variable-length object (no header) - Add ckpt_read_string(): same for strings (ensures null-terminated) - Add ckpt_read_consume(): consumes next object without processing - [John Dykstra] Fix no-dot-config-targets pattern in linux/Makefile [2009-Jul-21] v17 - Introduce syscall clone_with_pids() to restore original pids - Support threads and zombies - Save/restore task->files - Save/restore task->sighand - Save/restore futex - Save/restore credentials - Introduce PF_RESTARTING to skip notifications on task exit - restart(2) allow caller to ask to freeze tasks after restart - restart(2) isn't idempotent: return -EINTR if interrupted - Improve debugging output handling - Make multi-process restart logic more robust and complete - Correctly select return value for restarting tasks on success - Tighten ptrace test for checkpoint to PTRACE_MODE_ATTACH - Use CHECKPOINTING state for frozen checkpointed tasks - Fix compilation without CONFIG_CHECKPOINT - Fix compilation with CONFIG_COMPAT - Fix headers includes and exports - Leak detection performed in two steps - Detect "inverse" leaks of objects (dis)appearing unexpectedly - Memory: save/restore mm->{flags,def_flags,saved_auxv} - Memory: only collect sub-objects of mm once (leak detection) - Files: validate f_mode after restore - Namespaces: leak detection for nsproxy sub-components - Namespaces: proper restart from namespace(s) without namespace(s) - Save global constants in header instead of per-object - IPC: replace sys_unshare() with create_ipc_ns() - IPC: restore objects in suitable namespace - IPC: correct behavior under !CONFIG_IPC_NS - UTS: save/restore all fields - UTS: replace sys_unshare() with create_uts_ns() - X86_32: sanitize cpu, debug, and segment registers on restart - cgroup_freezer: add CHECKPOINTING state to safeguard checkpoint - cgroup_freezer: add interface to freeze a cgroup (given a task) [2009-May-27] v16 - Privilege checks for IPC checkpoint - Fix error string generation during checkpoint - Use kzalloc for header allocation - Restart blocks are arch-independent - Redo pipe c/r using splice - Fixes to s390 arch - Remove powerpc arch (temporary) - Explicitly restore ->nsproxy - All objects in image are precedeed by 'struct ckpt_hdr' - Fix leaks detection (and leaks) - Reorder of patchset - Misc bugs and compilation fixes [2009-Apr-12] v15 - Minor fixes [2009-Apr-28] v14 - Tested against kernel v2.6.30-rc3 on x86_32. - Refactor files chekpoint to use f_ops (file operations) - Refactor mm/vma to use vma_ops - Explicitly handle VDSO vma (and require compat mode) - Added code to c/r restat-blocks (restart timeout related syscalls) - Added code to c/r namespaces: uts, ipc (with Dan Smith) - Added code to c/r sysvipc (shm, msg, sem) - Support for VM_CLONE shared memory - Added resource leak detection for whole-container checkpoint - Added sysctl gauge to allow unprivileged restart/checkpoint - Improve and simplify the code and logic of shared objects - Rework image format: shared objects appear prior to their use - Merge checkpoint and restart functionality into same files - Massive renaming of functions: prefix "ckpt_" for generics, "checkpoint_" for checkpoint, and "restore_" for restart. - Report checkpoint errors as a valid (string record) in the output - Merged PPC architecture (by Nathan Lunch), - Requires updates to userspace tools too. - Misc nits and bug fixes [2009-Mar-31] v14-rc2 - Change along Dave's suggestion to use f_ops->checkpoint() for files - Merge patch simplifying Kconfig, with CONFIG_CHECKPOINT_SUPPORT - Merge support for PPC arch (Nathan Lynch) - Misc cleanups and fixes in response to comments [2009-Mar-20] v14-rc1: - The 'h.parent' field of 'struct cr_hdr' isn't used - discard - Check whether calls to cr_hbuf_get() succeed or fail. - Fixed of pipe c/r code - Prevent deadlock by refusing c/r when a pipe inode == ctx->file inode - Refuse non-self checkpoint if a task isn't frozen - Use unsigned fields in checkpoint headers unless otherwise required - Rename functions in files c/r to better reflect their role - Add support for anonymous shared memory - Merge support for s390 arch (Dan Smith, Serge Hallyn) [2008-Dec-03] v13: - Cleanups of 'struct cr_ctx' - remove unused fields - Misc fixes for comments [2008-Dec-17] v12: - Fix re-alloc/reset of pgarr chain to correctly reuse buffers (empty pgarr are saves in a separate pool chain) - Add a couple of missed calls to cr_hbuf_put() - cr_kwrite/cr_kread() again use vfs_read(), vfs_write() (safer) - Split cr_write/cr_read() to two parts: _cr_write/read() helper - Befriend with sparse: explicit conversion to 'void __user *' - Redrefine 'pr_fmt' ind replace cr_debug() with pr_debug() [2008-Dec-05] v11: - Use contents of 'init->fs->root' instead of pointing to it - Ignore symlinks (there is no such thing as an open symlink) - cr_scan_fds() retries from scratch if it hits size limits - Add missing test for VM_MAYSHARE when dumping memory - Improve documentation about: behavior when tasks aren't fronen, life span of the object hash, references to objects in the hash [2008-Nov-26] v10: - Grab vfs root of container init, rather than current process - Acquire dcache_lock around call to __d_path() in cr_fill_name() - Force end-of-string in cr_read_string() (fix possible DoS) - Introduce cr_write_buffer(), cr_read_buffer() and cr_read_buf_type() [2008-Nov-10] v9: - Support multiple processes c/r - Extend checkpoint header with archtiecture dependent header - Misc bug fixes (see individual changelogs) - Rebase to v2.6.28-rc3. [2008-Oct-29] v8: - Support "external" checkpoint - Include Dave Hansen's 'deny-checkpoint' patch - Split docs in Documentation/checkpoint/..., and improve contents [2008-Oct-17] v7: - Fix save/restore state of FPU - Fix argument given to kunmap_atomic() in memory dump/restore [2008-Oct-07] v6: - Balance all calls to cr_hbuf_get() with matching cr_hbuf_put() (even though it's not really needed) - Add assumptions and what's-missing to documentation - Misc fixes and cleanups [2008-Sep-11] v5: - Config is now 'def_bool n' by default - Improve memory dump/restore code (following Dave Hansen's comments) - Change dump format (and code) to allow chunks of <vaddrs, pages> instead of one long list of each - Fix use of follow_page() to avoid faulting in non-present pages - Memory restore now maps user pages explicitly to copy data into them, instead of reading directly to user space; got rid of mprotect_fixup() - Remove preempt_disable() when restoring debug registers - Rename headers files s/ckpt/checkpoint/ - Fix misc bugs in files dump/restore - Fixes and cleanups on some error paths - Fix misc coding style [2008-Sep-09] v4: - Various fixes and clean-ups - Fix calculation of hash table size - Fix header structure alignment - Use stand list_... for cr_pgarr [2008-Aug-29] v3: - Various fixes and clean-ups - Use standard hlist_... for hash table - Better use of standard kmalloc/kfree [2008-Aug-20] v2: - Added Dump and restore of open files (regular and directories) - Added basic handling of shared objects, and improve handling of 'parent tag' concept - Added documentation - Improved ABI, 64bit padding for image data - Improved locking when saving/restoring memory - Added UTS information to header (release, version, machine) - Cleanup extraction of filename from a file pointer - Refactor to allow easier reviewing - Remove requirement for CAPS_SYS_ADMIN until we come up with a security policy (this means that file restore may fail) - Other cleanup and response to comments for v1 [2008-Jul-29] v1: - Initial version: support a single task with address space of only private anonymous or file-mapped VMAs; syscalls ignore pid/crid argument and act on current process. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html