Hi, This patchset implements persistent over-kexec memory storage or PRAM, which is intended to be used for saving memory pages of the currently executing kernel and restoring them after a kexec in the newly booted one. This can be utilized for speeding up reboot by leaving process memory and/or FS caches in-place. The patchset introduces the PRAM kernel API serving for that purpose and makes use of this API to make tmpfs 'persistent', i.e. makes it possible to save tmpfs tree on unmount and restore it on the next mount even if the system is kexec'd between the mount and unmount. For further details, please see below. -- The problem -- If Ksplice is not available or cannot be applied, a kernel update requires restarting the system, which implies reinitialization of all running application. Since this is a disk-bound operation, it can take quite a lot of time. What is worse, if the host serves as a web or database or whatever else server, apart from huge downtime the system reboot will cause any existent connection to be dropped, which may not always be tolerated. Although the kernel boot can be speeded up significantly by employing kexec, which jumps directly to the new kernel skipping the BIOS and boot loader stages, it has nothing to do with running applications, which still need to be restarted. -- The solution -- There is the rapidly developing criu project (www.criu.org), which targets on saving running application states to disk to be restored later. It is already accepted by the community and hopefully it will soon be able to dump and restore every Linux process. Obviously criu can be successfully used to omit full application reinitialization on reboot, but criu'ing may still take a lot of time. To illustrate, imagine a database server that cached to its internal buffers 100 GB of data. Writing the image of that process sequentially at 100 MB/s will take more that 15 minutes. Multiplied by two, since the image must be read after reboot, it gives half an hour of downtime! The server's clients will probably disconnect by timeout until the system is up and running, which cancels all the benefits of criu'ing. However, the disk read/write, which is the bottleneck in the criu scheme, can be avoided if kexec is used for rebooting. The point is kexec does not reset the RAM state leaving all data written to memory intact. This fact is already utilized by kdump to gather the full memory image on kernel panic. If it were possible to save arbitrary data and restore them after kexec, it could be utilized to completely avoid disk accesses when criu'ing. This patchset implements the kernel API for saving data to be restored after kexec and employs it to make tmpfs 'persistent' as described below. -- Usage -- 1) Boot kernel with 'pram_banned=MEMRANGE' boot option. MEMRANGE=MEMMIN-MEMMAX specifies memory range where kexec will load the new kernel code. It is used to avoid conflicts with persistent memory as described in implementation details. MEMRANGE=0-128M should be enough. 2) Mount tmpfs with 'pram=NAME' option. NAME is an arbitrary string specifying persistent memory node. Different tmpfs trees may be saved to PRAM if different names are passed. # mkdir -p /mnt/crdump # mount -t tmpfs -o pram=mytmpfs none /mnt/crdump 3) Checkpoint the process tree you'd want to pass over kexec to tmpfs. # criu dump -D /mnt/crdump -t $PID 4) Unmount tmpfs. It will be automatically saved to PRAM on unmount. # umount /mnt/crdump 5) Load the new kernel image. Kexec needs some tweaking for PRAM to work. First, one should pass PRAM super block pfn via 'pram' boot option. The pfn is exported via the sysfs file /sys/kernel/pram. Second, kexec must be forced to load the kernel code to MEMRANGE (see p.1). # kexec --load /vmlinuz --initrd=initrd.img \ --append="$(cat /proc/cmdline | sed -e 's/pram=[^ ]*//g') pram=$(cat /sys/kernel/pram)" \ --mem-min=$MEMMIN --mem-max=$MEMMAX 6) Boot to the new kernel. # reboot 7) Mount tmpfs with 'pram=NAME' option. It should find the PRAM node with the tmpfs tree saved on previous unmount and restore it. # mount -t tmpfs -o pram=mytmpfs none /mnt/crdump 8) Restore the process saved in p.3. # criu restore -d -D /mnt/crdump 9) Remove the dump and unmount tmpfs # rm -f /mnt/crdump # umount /mnt/crdump -- Implementation details -- * Saving a memory page is simply incrementing its refcounter so the page will not get freed when the last user puts it. So the data saved to PRAM may be safely used as usual. * To preserve persistent memory in the newly booted kernel, PRAM marks all the pages saved as reserved at early boot so that they will not be recycled. For the new kernel to find persistent memory metadata, one should pass PRAM super block pfn, which is exported via /sys/kernel/pram, in the 'pram' boot param. * Since some memory is required for completing boot sequence, PRAM tracks all memory regions that have ever been reserved by other parts of the kernel and avoids using them for persistent memory. Since the device configuration cannot change during kexec, and the newly booted kernel is likely to have the same set of device drivers, it should work in most cases. * Since kexec may load the new kernel code to any memory region, it can destroy persistent memory. To exclude this, kexec should be forced to load the new kernel code to a memory region that is banned for PRAM. For that purpose, there is the 'pram_banned' boot param and --mem-min and --mem-max otpions of the kexec utility. * If a conflict still happens, it will be identified and all persistent memory will be discarded to prevent further errors. It is guaranteed by checksumming all data saved to PRAM. * tmpfs is saved to PRAM on unmount and loaded on mount if 'pram=NAME' mount option is passed. NAME specifies the PRAM node to save data to. This is to allow saving several tmpfs trees. * Saving tmpfs to PRAM is not well elaborated at present and serves rather as a proof of concept. Namely, only regular files without multiple hard links are supported and tmpfs may not be swapped out. If these requirements are not met, save to PRAM will be aborted spewing a message to the kernel log. This is not very difficult to fix, but at present one should turn off swap to test the feature. -- Future plans -- What we'd like to do: * Implement swap entries 'freezing' to allow saving a swapped out tmpfs. * Implement full support of tmpfs including saving dirs, special files, etc. * Implement SPLICE_F_MOVE, SPLICE_F_GIFT flags for splicing data from/to shmem. This would allow avoiding memory copying on checkpoint/restore. * Save uptodate fs cache on umount to be restored on mount after kexec. Thanks, Vladimir Davydov (13): mm: add PRAM API stubs and Kconfig mm: PRAM: implement node load and save functions mm: PRAM: implement page stream operations mm: PRAM: implement byte stream operations mm: PRAM: link nodes by pfn before reboot mm: PRAM: introduce super block mm: PRAM: preserve persistent memory at boot mm: PRAM: checksum saved data mm: PRAM: ban pages that have been reserved at boot time mm: PRAM: allow to ban arbitrary memory ranges mm: PRAM: allow to free persistent memory from userspace mm: shmem: introduce shmem_insert_page mm: shmem: enable saving to PRAM arch/x86/kernel/setup.c | 2 + arch/x86/mm/init_32.c | 5 + arch/x86/mm/init_64.c | 5 + include/linux/pram.h | 62 +++ include/linux/shmem_fs.h | 29 ++ mm/Kconfig | 14 + mm/Makefile | 1 + mm/bootmem.c | 4 + mm/memblock.c | 7 +- mm/pram.c | 1279 ++++++++++++++++++++++++++++++++++++++++++++++ mm/shmem.c | 97 +++- mm/shmem_pram.c | 378 ++++++++++++++ 12 files changed, 1878 insertions(+), 5 deletions(-) create mode 100644 include/linux/pram.h create mode 100644 mm/pram.c create mode 100644 mm/shmem_pram.c -- 1.7.10.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>