We are working on extending support for CRIU checkpoint/restore in the amdgpu driver to support new use cases and ROCm applications increasingly using render node ioctls for memory management. In the longer term this may also allow checkpoint/restore of graphics application. With this patch series we are soliciting feedback about our design and proof-of-concept implementation. Our intention is to upstream a future version of these changes in DRM, amdgpu and CRIU. Following the philosophy of CRIU (where “IU” stand for “in usermode”) much of the complexity of saving and restoring memory sharing relationships is handled in user mode. We believe our work can serve as a blueprint for supporting CRIU in other DRM and dmabuf drivers in the future. # Motivation Applications managing GPU/Compute resources on AMD platforms use the amdgpu driver. This driver exposes one DRM /dev/dri/renderD interface per AMD device and one KFD /dev/kfd device file per machine for all devices. AMD's CRIU plugin currently deals with both /dev/kfd and renderD device files. For /dev/kfd it uses a CRIU-exclusive ioctl on the device file to dump and restore device data. For the renderD device files, it only records the device ID and topology. dmabuf is a kernel component that allows users to share memory between devices. A user can request export of a file descriptor (a dmabuf_fd) representing a region of memory allocated on a device from that device's driver. That dmabuf_fd can be sent (imported) to another device (by the same process or a different one) to create shared memory between the two devices. This results in the two processes having handles on the same physical memory. Further details about dmabuf can be found at https://docs.kernel.org/driver-api/dma-buf.html AMD is moving towards dmabuf-based buffer object management to support inter-process communication, as well as other features. AMD's CRIU plugin needs to support processes that have shared memory created with dmabuf. Currently, all applications that use this feature allocate memory in /dev/kfd, export it from that device, and then import in an renderD device. This CRIU solution should be able to support future use cases including allocating and exporting from renderD, importing on /dev/kfd, and even sharing memory with non-AMD devices. As much as possible, we would like to contain these changes in the CRIU plugin, although changes in core CRIU, amdgpu, amdkfd, and core drm are also necessary. Please find attached the relevant kernel and CRIU patches. # Access relevant data The amdgpu CRIU plugin already uses CRIU ioctls of the kfd device to extract driver state from /dev/kfd, and to recreate that state during restore. The renderD devices have their own driver state, and to access it we need CRIU renderD ioctls. The amdgpu data that needs to be stored during dump and restored during restore is all BO-related: the bos' addresses, sizes, and flags, and the similar data for associated virtual memory mappings. Typically, applications access renderD drm ioctls through a wrapped library called libdrm. However, since these ioctls will only ever be used by CRIU itself, we have chosen to have CRIU call them directly. This means that the relevant header file, amdgpu_drm.h, must be added to CRIU. # Identifying shared BOs (buffer objects) During checkpoint, the amdgpu CRIU plugin must identify which buffer objects across the renderD and /dev/kfd devices in all checkpointed processes correspond to shared data, so that that sharing relationship can be later restored. An easy way is to use memory’s GEM handle. Because all the checkpointing happens in a single CRIU process, shared memory will always share a GEM handle. Gem handles are intended to identify memory and require no additional code to use. Extracting the gem handles requires use of the libdrm function amdgpu_device_get_fd, which was added in version libdrm-2.4.109 of libdrm in November 2021. This introduces a dependency of CRIU on sufficiently new libdrm. # Restoring shared BOs - The Mechanism During restore, CRIU must recreate the shared memory relationships that existed before checkpoint. The plan is to mimic the means by which the memory was shared in the first place. During checkpoint, one of the processes which holds a shared piece of memory will be designated the exporter and the rest importers. The exporter process will get a dmabuf_fd from the driver and transfer it to the importer processes. Those processes will send the dmabuf_fd to the driver, where it will be imported. This approach minimizes kernel changes. The renderD dmabuf import code cannot be reused as it is. When a dmabuf_fd is imported normally, it is assigned a gem handle arbitrarily. However, the checkpointed process uses the gem handle to identify this data, so the restored memory must have the same gem handle as before. This necessitates adding two new drm interfaces for creating and importing buffer objects with a specified gem handle. Transferring dmabuf_fds between CRIU processes requires a socket. Each process creates its own socket and registers it to a name containing its pid. When a process exports a dmabuf_fd, it sends it to the sockets of all other processes. Before a process begins restoring a device file, it checks its socket for incoming messages. In the current version of the patchset, this is done in core CRIU, parallel to the existing socket for exchanging file descriptors for shared files. It could be done within the amdgpu plugin instead, but this would require an additional plugin callback to ensure that all sockets are created before any plugin files are restored. This new socket is created within the amdgpu CRIU plugin. Each process creates its own socket and registers it to a name containing its pid. When a process exports a dmabuf_fd, it sends it to the sockets of all other processes. Before a process begins restoring a device file, it checks its socket for incoming messages. The sockets themselves and the dmabuf_fds received over sockets both must be assigned fds. These fds cannot conflict with each other or with fds of other restored files. We intend to alter find_unused_fd_pid to allow the allocation of multiple regions of unused fds; this work is not yet complete. # Restoring shared BOs - The Logic CRIU already has a mechanism for handling files that have requirements to be restored. Files that can't fully restore on the first try can return 1, indicating they are incomplete and will need to be retried. We extend this mechanism to plugin files to handle the shared bo case. Any device file that needs to restore imported bos will retry if the corresponding exports haven't happened yet. Another option would be to do the plugin restore in two phases, with all the exported bos in the first phase and all the imports in the second phase. That would be more efficient but would require more core CRIU changes, in particular additional synchronization between restoring processes. To support this, files-ext.c is modified to allow plugin files to retry. # TODO 1) Backwards compatibility It is important to us that older versions of CRIU continue to work with new kernels with the changes. The existing amdkfd CRIU ioctl UAPI must be extended in a way that does not break the ABI. 2) Checkpoint and restore of dmabuf fds themselves The usual approach to IPC taken by AMD libraries is to export a dmabuf, immediately send it to another process, and then close it. Therefore, it is unlikely that a process will have a dmabuf_fd open during checkpoint. We would still like to handle this case, however. Work to do so is not done. This would require handling of dmabuf file descriptors in the amdgpu CRIU plugin. It woudl support only dmabuf fds that were exported by amdgpu. For each such dmabuf fd, it would have to store an identifier of the underlying amdgpu BO that can be used during restore to recreate the dmabuf fd from the restored BO. The CRIU patch can also be found at https://github.com/fdavid-amd/criu/tree/dmabuf-post