[no subject]

David Francis <David.Francis@xxxxxxx> · Fri, 31 Jan 2025 13:58:28 -0500

We are working on extending support for CRIU checkpoint/restore in the amdgpu
driver to support new use cases and ROCm applications increasingly using
render node ioctls for memory management. In the longer term this may also
allow checkpoint/restore of graphics application. With this patch series we
are soliciting feedback about our design and proof-of-concept implementation.
Our intention is to upstream a future version of these changes in DRM, amdgpu
and CRIU.

Following the philosophy of CRIU (where “IU” stand for “in usermode”) much of
the complexity of saving and restoring memory sharing relationships is handled
in user mode. We believe our work can serve as a blueprint for supporting CRIU
in other DRM and dmabuf drivers in the future.

# Motivation

Applications managing GPU/Compute resources on AMD platforms use the
amdgpu driver. This driver exposes one DRM /dev/dri/renderD interface per
AMD device and one KFD /dev/kfd device file per machine for all devices.

AMD's CRIU plugin currently deals with both /dev/kfd and renderD device files.
For /dev/kfd it uses a CRIU-exclusive ioctl on the device file to dump and
restore device data. For the renderD device files, it only records the
device ID and topology.

dmabuf is a kernel component that allows users to share memory between
devices. A user can request export of a file descriptor (a dmabuf_fd)
representing a region of memory allocated on a device from that device's
driver. That dmabuf_fd can be sent (imported) to another device (by the
same process or a different one) to create shared memory between the two
devices. This results in the two processes having handles on the same physical
memory.

Further details about dmabuf can be found at
https://docs.kernel.org/driver-api/dma-buf.html

AMD is moving towards dmabuf-based buffer object management to support
inter-process communication, as well as other features.

AMD's CRIU plugin needs to support processes that have shared memory created
with dmabuf. Currently, all applications that use this feature allocate
memory in /dev/kfd, export it from that device, and then import in an renderD
device. This CRIU solution should be able to support future use cases
including allocating and exporting from renderD, importing on /dev/kfd, and
even sharing memory with non-AMD devices. As much as possible, we would like to
contain these changes in the CRIU plugin, although changes in core CRIU,
amdgpu, amdkfd, and core drm are also necessary.

Please find attached the relevant kernel and CRIU patches.

# Access relevant data

The amdgpu CRIU plugin already uses CRIU ioctls of the kfd device to
extract driver state from /dev/kfd, and to recreate that state during restore.
The renderD devices have their own driver state, and to access it we need
CRIU renderD ioctls.

The amdgpu data that needs to be stored during dump and restored during
restore is all BO-related: the bos' addresses, sizes, and flags, and the
similar data for associated virtual memory mappings.

Typically, applications access renderD drm ioctls through a wrapped library
called libdrm. However, since these ioctls will only ever be used by CRIU
itself, we have chosen to have CRIU call them directly. This means that the
relevant header file, amdgpu_drm.h, must be added to CRIU.

# Identifying shared BOs (buffer objects)

During checkpoint, the amdgpu CRIU plugin must identify which buffer objects
across the renderD and /dev/kfd devices in all checkpointed processes
correspond to shared data, so that that sharing relationship can be later
restored. An easy way is to use memory’s GEM handle. Because all the
checkpointing happens in a single CRIU process, shared memory will always share
a GEM handle. Gem handles are intended to identify memory and require no
additional code to use.

Extracting the gem handles requires use of the libdrm function
amdgpu_device_get_fd, which was added in version libdrm-2.4.109 of libdrm in
November 2021. This introduces a dependency of CRIU on sufficiently new libdrm.

# Restoring shared BOs - The Mechanism

During restore, CRIU must recreate the shared memory relationships that existed
before checkpoint. The plan is to mimic the means by which the memory was
shared in the first place. During checkpoint, one of the processes which holds
a shared piece of memory will be designated the exporter and the rest
importers. The exporter process will get a dmabuf_fd from the driver and
transfer it to the importer processes. Those processes will send the dmabuf_fd
to the driver, where it will be imported. This approach minimizes kernel
changes.

The renderD dmabuf import code cannot be reused as it is. When a dmabuf_fd is
imported normally, it is assigned a gem handle arbitrarily. However, the
checkpointed process uses the gem handle to identify this data, so the
restored memory must have the same gem handle as before. This necessitates
adding two new drm interfaces for creating and importing buffer objects with
a specified gem handle.

Transferring dmabuf_fds between CRIU processes requires a socket.

Each process creates its own socket and registers it to a name containing its
pid. When a process exports a dmabuf_fd, it sends it to the sockets of all
other processes. Before a process begins restoring a device file, it checks its
socket for incoming messages. In the current version of the patchset, this is
done in core CRIU, parallel to the existing socket for exchanging file
descriptors for shared files. It could be done within the amdgpu plugin
instead, but this would require an additional plugin callback to ensure that
all sockets are created before any plugin files are restored.

This new socket is created within the amdgpu CRIU plugin. Each process creates
its own socket and registers it to a name containing its pid. When a process
exports a dmabuf_fd, it sends it to the sockets of all other processes. Before
a process begins restoring a device file, it checks its socket for incoming
messages.

The sockets themselves and the dmabuf_fds received over sockets both must be
assigned fds. These fds cannot conflict with each other or with fds of other
restored files. We intend to alter find_unused_fd_pid to allow the allocation
of multiple regions of unused fds; this work is not yet complete.

# Restoring shared BOs - The Logic

CRIU already has a mechanism for handling files that have requirements to be
restored. Files that can't fully restore on the first try can return 1,
indicating they are incomplete and will need to be retried. We extend this
mechanism to plugin files to handle the shared bo case. Any device file that
needs to restore imported bos will retry if the corresponding exports haven't
happened yet.

Another option would be to do the plugin restore in two phases, with all the
exported bos in the first phase and all the imports in the second phase. That
would be more efficient but would require more core CRIU changes, in particular
additional synchronization between restoring processes.

To support this, files-ext.c is modified to allow plugin files to retry.

# TODO

1) Backwards compatibility

It is important to us that older versions of CRIU continue to work with new
kernels with the changes. The existing amdkfd CRIU ioctl UAPI must be extended
in a way that does not break the ABI.

2) Checkpoint and restore of dmabuf fds themselves

The usual approach to IPC taken by AMD libraries is to export a dmabuf,
immediately send it to another process, and then close it. Therefore, it is
unlikely that a process will have a dmabuf_fd open during checkpoint. We
would still like to handle this case, however. Work to do so is not done.

This would require handling of dmabuf file descriptors in the amdgpu CRIU
plugin. It woudl support only dmabuf fds that were exported by amdgpu. For
each such dmabuf fd, it would have to store an identifier of the underlying
amdgpu BO that can be used during restore to recreate the dmabuf fd from the
restored BO.

The CRIU patch can also be found at
https://github.com/fdavid-amd/criu/tree/dmabuf-post