Re: Proposal to add CRIU support to DRM render nodes

Felix Kuehling <felix.kuehling@xxxxxxx> · Mon, 15 Jan 2024 13:58:18 -0500

I haven't seen any replies to this proposal. Either it got lost in the 
pre-holiday noise, or there is genuinely no interest in this.

If it's the latter, I would look for an AMDGPU driver-specific solution 
with minimally invasive changes in DRM and DMABuf code, if needed. Maybe 
it could be generalized later if there is interest then.

Regards,
  Felix

On 2023-12-06 16:23, Felix Kuehling wrote:
Executive Summary: We need to add CRIU support to DRM render nodes in 
order to maintain CRIU support for ROCm application once they start 
relying on render nodes for more GPU memory management. In this email 
I'm providing some background why we are doing this, and outlining 
some of the problems we need to solve to checkpoint and restore render 
node state and shared memory (DMABuf) state. I have some thoughts on 
the API design, leaning on what we did for KFD, but would like to get 
feedback from the DRI community regarding that API and to what extent 
there is interest in making that generic.

We are working on using DRM render nodes for virtual address mappings 
in ROCm applications to implement the CUDA11-style VM API and improve 
interoperability between graphics and compute. This uses DMABufs for 
sharing buffer objects between KFD and multiple render node devices, 
as well as between processes. In the long run this also provides a 
path to moving all or most memory management from the KFD ioctl API to 
libdrm.

Once ROCm user mode starts using render nodes for virtual address 
management, that creates a problem for checkpointing and restoring 
ROCm applications with CRIU. Currently there is no support for 
checkpointing and restoring render node state, other than CPU virtual 
address mappings. Support will be needed for checkpointing GEM buffer 
objects and handles, their GPU virtual address mappings and memory 
sharing relationships between devices and processes.

Eventually, if full CRIU support for graphics applications is desired, 
more state would need to be captured, including scheduler contexts and 
BO lists. Most of this state is driver-specific.

After some internal discussions we decided to take our design process 
public as this potentially touches DRM GEM and DMABuf APIs and may 
have implications for other drivers in the future.

One basic question before going into any API details: Is there a 
desire to have CRIU support for other DRM drivers?

With that out of the way, some considerations for a possible DRM CRIU 
API (either generic of AMDGPU driver specific): The API goes through 
several phases during checkpoint and restore:

Checkpoint:

 1. Process-info (enumerates objects and sizes so user mode can
    allocate memory for the checkpoint, stops execution on the GPU)
 2. Checkpoint (store object metadata for BOs, queues, etc.)
 3. Unpause (resumes execution after the checkpoint is complete)

Restore:

 1. Restore (restore objects, VMAs are not in the right place at this
    time)
 2. Resume (final fixups after the VMAs are sorted out, resume execution)

For some more background about our implementation in KFD, you can 
refer to this whitepaper: 
https://github.com/checkpoint-restore/criu/blob/criu-dev/plugins/amdgpu/README.md

Potential objections to a KFD-style CRIU API in DRM render nodes, I'll 
address each of them in more detail below:

  * Opaque information in the checkpoint data that user mode can't
    interpret or do anything with
  * A second API for creating objects (e.g. BOs) that is separate from
    the regular BO creation API
  * Kernel mode would need to be involved in restoring BO sharing
    relationships rather than replaying BO creation, export and import
    from user mode

# Opaque information in the checkpoint

This comes out of ABI compatibility considerations. Adding any new 
objects or attributes to the driver/HW state that needs to be 
checkpointed could potentially break the ABI of the CRIU 
checkpoint/restore ioctl if the plugin needs to parse that 
information. Therefore, much of the information in our KFD CRIU ioctl 
API is opaque. It is written by kernel mode in the checkpoint, it is 
consumed by kernel mode when restoring the checkpoint, but user mode 
doesn't care about the contents or binary layout, so there is no user 
mode ABI to break. This is how we were able to maintain CRIU support 
when we added the SVM API to KFD without changing the CRIU plugin and 
without breaking our ABI.

Opaque information may also lend itself to API abstraction, if this 
becomes a generic DRM API with driver-specific callbacks that fill in 
HW-specific opaque data.

# Second API for creating objects

Creating BOs and other objects when restoring a checkpoint needs more 
information than the usual BO alloc and similar APIs provide. For 
example, we need to restore BOs with the same GEM handles so that user 
mode can continue using those handles after resuming execution. If BOs 
are shared through DMABufs without dynamic attachment, we need to 
restore pinned BOs as pinned. Validation of virtual addresses and 
handling MMU notifiers must be suspended until the virtual address 
space is restored. For user mode queues we need to save and restore a 
lot of queue execution state so that execution can resume cleanly.

# Restoring buffer sharing relationships

Different GEM handles in different render nodes and processes can 
refer to the same underlying shared memory, either by directly 
pointing to the same GEM object, or by creating an import attachment 
that may get its SG tables invalidated and updated dynamically through 
dynamic attachment callbacks. In the latter case it's obvious, who is 
the exporter and who is the importer. In the first case, either one 
could be the exporter, and it's not clear who would need to create the 
BO and who would need to import it when restoring the checkpoint. To 
further complicate things, multiple processes in a checkpoint get 
restored concurrently. So there is no guarantee that an exporter has 
restored a shared BO at the time an importer is trying to restore its 
import.

A proposal to deal with these problems would be to treat importers and 
exporters the same. Whoever restores first, ends up creating the BO 
and potentially attaching to it. The other process(es) can find BOs 
that were already restored by another process by looking it up with a 
unique ID that could be based on the DMABuf inode number. An 
alternative would be a two-pass approach that needs to restore BOs on 
two passes:

 1. Restore exported BOs
 2. Restore imports

With some inter-process synchronization in CRIU itself between these 
two passes. This may require changes in the core CRIU, outside our 
plugin. Both approaches depend on identifying BOs with some unique ID 
that could be based on the DMABuf inode number in the checkpoint. 
However, we would need to identify the processes in the same restore 
session, possibly based on parent/child process relationships, to 
create a scope where those IDs are valid during restore.

Finally, we would also need to checkpoint and restore DMABuf file 
descriptors themselves. These are anonymous file descriptors. The CRIU 
plugin could probably be taught to recreate them from the original 
exported BO based on the inode number that could be queried with fstat 
in the checkpoint. It would need help from the render node CRIU API to 
find the right BO from the inode, which may be from a different 
process in the same restore session.

Regards,
  Felix