Overview -------- This patch series adds a new bpf program type CRIB (Checkpoint/Restore In eBPF) for better checkpoint/restore of processes. CRIB provides a new way to dump/restore process information for better performance, more flexibility, more extensibility (easier support for dumping/restoring more information), and more elegant implementation. Motivation ---------- The original goal of the CRIU (Checkpoint/Restore In Userspace) project was to implement most of the checkpoint/restore functionality in userspace [0], avoiding placing most of the implementation in the kernel. The CRIU project achieves this goal and is currently widely used for live migration in the cloud and works well in most scenarios. However, the current technology that CRIU relies on is not optimal and has some problems. [0]: https://lwn.net/Articles/451916/ 1. CRIU relies heavily on procfs to get process information (checkpoint) Procfs is not really a good place to use for checkpointing processes (same for sysfs). - Lots of system calls, lots of context switches (each file needs to open, read, close) - Variety of formats (each file format is different and parsers need to be implemented for each format) - Fixed return information (if the information needed is not currently supported by procfs, even if it is just a struct member, the upstream kernel code still needs to be modified to add it) - Non-extensible formats (the format of some files in the procfs cannot be extended without breaking backward compatibility) - Lots of extra information, slow to read (not all information in some files is useful for checkpoint, and text parsing is inefficient) More detailed summary of why procfs is not suitable for checkpointing can be found in [1]. [1]: https://criu.org/Task-diag Andrey has tried to replace insufficient procfs by using netlink (task_diag) [2], but it was not accepted by upstream for reasons [3][4][5][6]: - netlink is unable to elegantly obtain the pidns and userns of processes - Since the namespace issue cannot be resolved elegantly, obtaining process information via netlink can lead to credential security issues. [2]: https://lwn.net/Articles/650243/ [3]: https://lore.kernel.org/linux-kernel//CALCETrVg5AyeXW_AGguFoGCPK9_2zeobEgT9JJFsakH6PyQf_A@xxxxxxxxxxxxxx/ [4]: https://lore.kernel.org/linux-kernel//CALCETrVSRkMSAVPz9JW4XCV7DmrgkyGK54HRUrue2R756f5C=Q@xxxxxxxxxxxxxx/ [5]: https://lore.kernel.org/linux-kernel//CALCETrW4LU3M2OAWjnckFR-rqenBjV+ROBi8B3eOo=Y_mCWfGQ@xxxxxxxxxxxxxx/ [6]: https://lore.kernel.org/linux-kernel//CALCETrUzOBybH0-rcgvzMNazjadZpuxkBZLkoUDY30X_-cqBzg@xxxxxxxxxxxxxx/ 2. Some process status information is difficult to dump/restore through normal interfaces One example is checkpoint/restore for TCP sockets, where we are unable to get the underlying protocol information for TCP sockets through procfs (or sysfs), or through the normal socket API. Here we need to add TCP repair mode [7][8], which works but is not an elegant approach. In TCP repair mode, we need to change (hijack) the behaviour of the system calls, including recvmsg and sendmsg, used to dump/restore packets in the socket write/receive queue. In TCP repair mode, additional getsockopt/setsockopt optnames need to be introduced to dump/restore the underlying TCP socket information such as sequence number, send window, receive window, max window. [7]: https://lwn.net/Articles/495304/ [8]: https://criu.org/TCP_connection The above approach to extending system calls may be feasible, but not good practice: - The structure of the data returned by each system call API is roughly fixed at the moment it is added. If we need to add new members, then we may need data structures V1 and V2. If we want to remove members we no longer need, it would be painful because we need to maintain backward compatibility. More often we need new extensions to system calls, such as the new getsockopt optnames. - We need case-by-case extensions to system calls. As more and more features are added to the kernel (e.g. io uring, bpf), checkpointing/restoring these features via the normal API will become more and more difficult (or even impossible). We have had to continue to add (extend) lots of single-purpose (perhaps only for checkpoint/restore) interfaces for various kernel features , more xxx repair modes, ioctl commands, getxxxopt/setxxxopt optnames. Obviously, these interfaces are not elegant and may even be considered cumbersome. CRIB introduction ----------------- CRIB is a new bpf program type that is not attached to any hooks (similar to BPF_PROG_TYPE_SYSCALL), runs through BPF_PROG_RUN, and is called by userspace programs as eBPF API for dumping/restoring process information. The entire CRIB consists of three parts, CRIB kfuncs, CRIB ebpf programs, and CRIB user space program. - CRIB kfuncs provides low-level APIs. Each kfuncs low-level API is only responsible for one small task, such as getting a specific file object based on the file descriptor of a process. - CRIB ebpf program provides high-level APIs. Each CRIB ebpf program obtains process information in the kernel by calling the CRIB kfuncs API and returns the data to the userspace program through ringbuf. Each CRIB ebpf API is responsible for some relatively complex tasks, such as getting all the socket information of a process. - The CRIB userspace program is responsible for loading the CRIB ebpf program and calling the CRIB ebpf API, deciding what needs to be dumped and what needs to be restored, and saving the dumped information so that it can be read during restoration. With the above CRIB design, the CRIB kfunc API in the kernel can be kept simple enough that it does not require much modification even in the future. Each kfuncs can be easily kept reliable without a lot of complicated code. Complex ebpf programs and userspace programs are maintained outside the kernel, and CRIB ebpf programs are maintained with CRIB userspace programs. My current positioning of CRIB is that CRIU as CRIB userspace program and CRIB ebpf program can be used as a new engine for CRIU, a new and better way to dump/restore processes which has higher performance and can dump/restore more information. Why CRIB is better? ------------------- 1. More elegant way to get process information If xxx repair mode, ioctl, getxxxopt, setxxxopt are like using gastroscope, colonoscope, nasal endoscope, and we need to keep looking for (add) more "holes" in the kernel for physical examination (dump/restore information), then using CRIB is like putting an intelligent micro physical examination robot (ebpf) into the kernel and letting it work inside the kernel to collect all the information and return. We no longer need to open more inelegant "holes" in the kernel, and we no longer need to add more interfaces that are only used for checkpoint/restore. 2. More flexible and extensible CRIB ebpf programs are maintained with CRIB userspace programs, which means that CRIB ebpf programs do not need to provide stable APIs, do not need stable structures, and can continue to change flexibly with the needs of CRIB userspace programs. Most of the information in kernel data structures can be obtained through BPF_CORE_READ, so there is no need to add trivial CRIB kfuncs, and the trivial code for obtaining the structure members can be kept outside the kernel in the CRIB ebpf program. This means that this part of the code can be added or removed flexibly. CRIB kfuncs focuses on implementing dump/restore that cannot be done by simple data structure operations. 3. Higher performance - Since CRIB is very flexible (CRIB ebpf programs are changeable), we can dump/restore just enough information and no additional information is needed. - CRIB ebpf programs can return binary data (not text) via ringbuf, which means no additional conversion or parsing is required. - With BPF ringbuf, we avoid lots of system calls, lots of context switches, and lots of memory copying (between kernel space and user space). 4. Better support for namespaces and credentials Since CRIB ebpf programs can access the task_struct of a process, it is simple for CRIB ebpf programs to know the current namespace (e.g., pidns, userns) and credentials of a process, and there is no situation where CRIB cannot know that a process has dropped privileges. The problems in the netlink method mentioned earlier do not exist in CRIB. Proof of Concept ---------------- I have currently added three selftest programs to demonstrate the functionality of CRIB. - dump_task shows the performance comparison between CRIB and procfs. CRIB takes only 20-30% of the time of the procfs to obtain the same process information. - dump_all_socket shows that CRIB does not need to rely on procfs to get all the socket information of a process, and can get the underlying protocol information (e.g., sequence number, send window) of TCP sockets without using getsockopt. - restore_udp_socket shows that CRIB can dump/restore packets from the write queue and receive queue of UDP sockets without adding additional system call interfaces and without UDP repair mode. Shortcoming? ------------ Yes, obviously, loading the ebpf programs takes time. However, in most scenarios, CRIU runs as a service and is integrated into other software (via RPC or C API) such as OpenVZ , docker, k8s, rather than as a standalone tool. This means that in most scenarios CRIU will handle multiple checkpoints/restores, but in this case CRIB ebpf programs only need to be loaded once, and can be subsequently used like normal APIs. Overall, it is worth it. More? ----- In restore_udp_socket I had to add a struct bpf_crib_skb_info for restoring packets, this is because there is currently no BPF_CORE_WRITE. I am not sure what the current attitude of the kernel community towards BPF_CORE_WRITE is, personally I think it is well worth adding, as we need a portable way to change the value in the kernel. This not only allows more complexity in the CRIB restoring part to be transferred from CRIB kfuncs to CRIB ebpf programs, but also allows ebpf to unlock more possible application scenarios. At the end ---------- This patch series is not the final patch series, this is still a proof of concept, incomplete in functionality and probably buggy, but I think it is enough to show the power of CRIB, which is a meaningful innovation. (I know I did not pay attention to the coding style of the test cases in selftest, as these are only for proof of concept, not real testing) This is not only a new checkpoint/restore method, but also allows us to think about what more eBPF might be able to do, and what more we can unlock with eBPF. I would like to get some feedback, welcome to discuss! (This resend is used to fix mail thread that was messed up by outlook.) Signed-off-by: Juntong Deng <juntong.deng@xxxxxxxxxxx> Juntong Deng (16): bpf: Introduce BPF_PROG_TYPE_CRIB bpf: Add KF_ITER_GETTER and KF_ITER_SETTER flags bpf: Improve bpf kfuncs pointer arguments chain of trust bpf: Add bpf_task_from_vpid() kfunc bpf/crib: Add struct file related CRIB kfuncs bpf/crib: Introduce task_file open-coded iterator kfuncs bpf/crib: Add struct sock related CRIB kfuncs bpf/crib: Add CRIB kfuncs for getting pointer to often-used socket-related structures bpf/crib: Add CRIB kfuncs for getting socket source/destination addresses bpf/crib: Add struct sk_buff related CRIB kfuncs bpf/crib: Introduce skb open-coded iterator kfuncs bpf/crib: Introduce skb_data open-coded iterator kfuncs bpf/crib: Add CRIB kfuncs for restoring data in skb selftests/crib: Add test for getting basic information of the process selftests/crib: Add test for getting all socket information of the process selftests/crib: Add test for dumping/restoring UDP socket packets include/linux/bpf_crib.h | 62 +++ include/linux/bpf_types.h | 4 + include/linux/btf.h | 5 +- include/uapi/linux/bpf.h | 1 + kernel/bpf/Kconfig | 2 + kernel/bpf/Makefile | 2 + kernel/bpf/btf.c | 34 +- kernel/bpf/crib/Kconfig | 14 + kernel/bpf/crib/Makefile | 3 + kernel/bpf/crib/bpf_checkpoint.c | 360 ++++++++++++++++ kernel/bpf/crib/bpf_crib.c | 397 ++++++++++++++++++ kernel/bpf/crib/bpf_restore.c | 80 ++++ kernel/bpf/helpers.c | 21 + kernel/bpf/syscall.c | 1 + kernel/bpf/verifier.c | 15 +- tools/include/uapi/linux/bpf.h | 1 + tools/lib/bpf/libbpf.c | 2 + tools/lib/bpf/libbpf_probes.c | 1 + tools/testing/selftests/crib/.gitignore | 1 + tools/testing/selftests/crib/Makefile | 136 ++++++ tools/testing/selftests/crib/config | 7 + .../selftests/crib/test_dump_all_socket.bpf.c | 252 +++++++++++ .../selftests/crib/test_dump_all_socket.c | 375 +++++++++++++++++ .../selftests/crib/test_dump_all_socket.h | 69 +++ .../selftests/crib/test_dump_task.bpf.c | 125 ++++++ tools/testing/selftests/crib/test_dump_task.c | 337 +++++++++++++++ tools/testing/selftests/crib/test_dump_task.h | 90 ++++ .../crib/test_restore_udp_socket.bpf.c | 311 ++++++++++++++ .../selftests/crib/test_restore_udp_socket.c | 333 +++++++++++++++ .../selftests/crib/test_restore_udp_socket.h | 51 +++ 30 files changed, 3080 insertions(+), 12 deletions(-) create mode 100644 include/linux/bpf_crib.h create mode 100644 kernel/bpf/crib/Kconfig create mode 100644 kernel/bpf/crib/Makefile create mode 100644 kernel/bpf/crib/bpf_checkpoint.c create mode 100644 kernel/bpf/crib/bpf_crib.c create mode 100644 kernel/bpf/crib/bpf_restore.c create mode 100644 tools/testing/selftests/crib/.gitignore create mode 100644 tools/testing/selftests/crib/Makefile create mode 100644 tools/testing/selftests/crib/config create mode 100644 tools/testing/selftests/crib/test_dump_all_socket.bpf.c create mode 100644 tools/testing/selftests/crib/test_dump_all_socket.c create mode 100644 tools/testing/selftests/crib/test_dump_all_socket.h create mode 100644 tools/testing/selftests/crib/test_dump_task.bpf.c create mode 100644 tools/testing/selftests/crib/test_dump_task.c create mode 100644 tools/testing/selftests/crib/test_dump_task.h create mode 100644 tools/testing/selftests/crib/test_restore_udp_socket.bpf.c create mode 100644 tools/testing/selftests/crib/test_restore_udp_socket.c create mode 100644 tools/testing/selftests/crib/test_restore_udp_socket.h -- 2.39.2