Re: Proposal to add CRIU support to DRM render nodes

Tvrtko Ursulin <tursulin@xxxxxxxxxxx> · Tue, 16 Apr 2024 15:04:35 +0100

On 01/04/2024 18:58, Felix Kuehling wrote:

On 2024-04-01 12:56, Tvrtko Ursulin wrote:

On 01/04/2024 17:37, Felix Kuehling wrote:
On 2024-04-01 11:09, Tvrtko Ursulin wrote:

On 28/03/2024 20:42, Felix Kuehling wrote:

On 2024-03-28 12:03, Tvrtko Ursulin wrote:

Hi Felix,

I had one more thought while browsing around the amdgpu CRIU 
plugin. It appears it relies on the KFD support being compiled in 
and /dev/kfd present, correct? AFAICT at least, it relies on that 
to figure out the amdgpu DRM node.

In would be probably good to consider designing things without 
that dependency. So that checkpointing an application which does 
not use /dev/kfd is possible. Or if the kernel does not even have 
the KFD support compiled in.

Yeah, if we want to support graphics apps that don't use KFD, we 
should definitely do that. Currently we get a lot of topology 
information from KFD, not even from the /dev/kfd device but from 
the sysfs nodes exposed by KFD. We'd need to get GPU device info 
from the render nodes instead. And if KFD is available, we may need 
to integrate both sources of information.

It could perhaps mean no more than adding some GPU discovery code 
into CRIU. Which shuold be flexible enough to account for things 
like re-assigned minor numbers due driver reload.

Do you mean adding GPU discovery to the core CRIU, or to the 
plugin. I was thinking this is still part of the plugin.

Yes I agree. I was only thinking about adding some DRM device 
discovery code in a more decoupled fashion from the current plugin, 
for both the reason discussed above (decoupling a bit from reliance 
on kfd sysfs), and then also if/when a new DRM driver might want to 
implement this the code could be move to some common plugin area.

I am not sure how feasible that would be though. The "gpu id" 
concept and it's matching in the current kernel code and CRIU plugin 
- is that value tied to the physical GPU instance or how it works?

The concept of the GPU ID is that it's stable while the system is up, 
even when devices get added and removed dynamically. It was baked 
into the API early on, but I don't think we ever fully validated 
device hot plug. I think the closest we're getting is with our latest 
MI GPUs and dynamic partition mode change.

Doesn't it read the saved gpu id from the image file while doing 
restore and tries to open the render node to match it? Maybe I am 
misreading the code.. But if it does, does it imply that in practice 
it could be stable across reboots? Or that it is not possible to 
restore to a different instance of maybe the same GPU model installed 
in a system?

Ah, the idea is, that when you restore on a different system, you may 
get different GPU IDs. Or you may checkpoint an app running on GPU 1 but 
restore it on GPU 2 on the same system. That's why we need to translate 
GPU IDs in restored applications. User mode still uses the old GPU IDs, 
but the kernel mode driver translates them to the actual GPU IDs of the 
GPUs that the process was restored on.

I see.. I think. Normal flow is ppd->user_gpu_id set during client init, 
but for restored clients it gets overriden during restore so that any 
further ioctls can actually not instantly fail.

And then in amdgpu_plugin_restore_file, when it is opening the render 
node, it relies on the kfd topology to have filled in (more or less) the 
target_gpu_id corresponding to the render node gpu id of the target GPU 
- the one associated with the new kfd gpu_id?

I am digging into this because I am trying to see if some part of GPU 
discovery could somehow be decoupled.. to offer you to work on at least 
that until you start to tackle the main body of the feature. But it 
looks properly tangled up.

Do you have any suggestions with what I could help with? Maybe 
developing some sort of drm device enumeration library if you see a way 
that would be useful in decoupling the device discovery from kfd. We 
would need to define what sort of information you would need to be 
queryable from it.

This also highlights another aspect on those spatially partitioned 
GPUs. GPU IDs identify device partitions, not devices. Similarly, 
each partition has its own render node, and the KFD topology info in 
sysfs points to the render-minor number corresponding to each GPU ID.

I am not familiar with this. This is not SR-IOV but some other kind of 
partitioning? Would you have any links where I could read more?

Right, the bare-metal driver can partition a PF spatially without SRIOV. 
SRIOV can also use spatial partitioning and expose each partition 
through its own VF, but that's not useful for bare metal. Spatial 
partitioning is new in MI300. There is some high-level info in this 
whitepaper: 
https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdna-3-white-paper.pdf.

From the outside (userspace) this looks simply like multiple DRM render 
nodes or how exactly?

Regards,

Tvrtko

Regards,
   Felix

Regards,

Tvrtko

Otherwise I am eagerly awaiting to hear more about the design 
specifics around dma-buf handling. And also seeing how to extend 
to other DRM related anonymous fds.

I've been pretty far under-water lately. I hope I'll find time to 
work on this more, but it's probably going to be at least a few weeks.

Got it.

Regards,

Tvrtko

Regards,
   Felix

Regards,

Tvrtko

On 15/03/2024 18:36, Tvrtko Ursulin wrote:

On 15/03/2024 02:33, Felix Kuehling wrote:

On 2024-03-12 5:45, Tvrtko Ursulin wrote:

On 11/03/2024 14:48, Tvrtko Ursulin wrote:

Hi Felix,

On 06/12/2023 21:23, Felix Kuehling wrote:
Executive Summary: We need to add CRIU support to DRM render 
nodes in order to maintain CRIU support for ROCm application 
once they start relying on render nodes for more GPU memory 
management. In this email I'm providing some background why 
we are doing this, and outlining some of the problems we need 
to solve to checkpoint and restore render node state and 
shared memory (DMABuf) state. I have some thoughts on the API 
design, leaning on what we did for KFD, but would like to get 
feedback from the DRI community regarding that API and to 
what extent there is interest in making that generic.

We are working on using DRM render nodes for virtual address 
mappings in ROCm applications to implement the CUDA11-style 
VM API and improve interoperability between graphics and 
compute. This uses DMABufs for sharing buffer objects between 
KFD and multiple render node devices, as well as between 
processes. In the long run this also provides a path to 
moving all or most memory management from the KFD ioctl API 
to libdrm.

Once ROCm user mode starts using render nodes for virtual 
address management, that creates a problem for checkpointing 
and restoring ROCm applications with CRIU. Currently there is 
no support for checkpointing and restoring render node state, 
other than CPU virtual address mappings. Support will be 
needed for checkpointing GEM buffer objects and handles, 
their GPU virtual address mappings and memory sharing 
relationships between devices and processes.

Eventually, if full CRIU support for graphics applications is 
desired, more state would need to be captured, including 
scheduler contexts and BO lists. Most of this state is 
driver-specific.

After some internal discussions we decided to take our design 
process public as this potentially touches DRM GEM and DMABuf 
APIs and may have implications for other drivers in the future.

One basic question before going into any API details: Is 
there a desire to have CRIU support for other DRM drivers?

This sounds like a very interesting feature on the overall, 
although I cannot answer on the last question here.

I forgot to finish this thought. I cannot answer / don't know 
of any concrete plans, but I think feature is pretty cool and 
if amdgpu gets it working I wouldn't be surprised if other 
drivers would get interested.

Thanks, that's good to hear!

Funnily enough, it has a tiny relation to an i915 feature I 
recently implemented on Mesa's request, which is to be able to 
"upload" the GPU context from the GPU hang error state and 
replay the hanging request. It is kind of (at a stretch) a 
very special tiny subset of checkout and restore so I am not 
mentioning it as a curiosity.

And there is also another partical conceptual intersect with 
the (at the moment not yet upstream) i915 online debugger. 
This part being in the area of discovering and enumerating GPU 
resources beloning to the client.

I don't see an immediate design or code sharing opportunities 
though but just mentioning.

I did spend some time reading your plugin and kernel 
implementation out of curiousity and have some comments and 
questions.

With that out of the way, some considerations for a possible 
DRM CRIU API (either generic of AMDGPU driver specific): The 
API goes through several phases during checkpoint and restore:

Checkpoint:

 1. Process-info (enumerates objects and sizes so user mode 
can allocate
    memory for the checkpoint, stops execution on the GPU)
 2. Checkpoint (store object metadata for BOs, queues, etc.)
 3. Unpause (resumes execution after the checkpoint is complete)

Restore:

 1. Restore (restore objects, VMAs are not in the right place 
at this time)
 2. Resume (final fixups after the VMAs are sorted out, 
resume execution)

Btw is check-pointing guaranteeing all relevant activity is 
idled? For instance dma_resv objects are free of fences which 
would need to restored for things to continue executing 
sensibly? Or how is that handled?

In our compute use cases, we suspend user mode queues. This can 
include CWSR (compute-wave-save-restore) where the state of 
in-flight waves is stored in memory and can be reloaded and 
resumed from memory later. We don't use any fences other than 
"eviction fences", that are signaled after the queues are 
suspended. And those fences are never handed to user mode. So we 
don't need to worry about any fence state in the checkpoint.

If we extended this to support the kernel mode command 
submission APIs, I would expect that we'd wait for all current 
submissions to complete, and stop new ones from being sent to 
the HW before taking the checkpoint. When we take the checkpoint 
in the CRIU plugin, the CPU threads are already frozen and 
cannot submit any more work. If we wait for all currently 
pending submissions to drain, I think we don't need to save any 
fence state because all the fences will have signaled. (I may be 
missing some intricacies and I'm afraid it may not be that 
simple in reality, but that's my opening bid. ;)

It feels feasible to me too, for the normally behaving clients at 
least.

Presumably, given that the whole checkpointing is not instant, it 
would be okay to wait a second or two longer for the in-progress 
submissions complete. After which kernel would need to prune all 
signalled fences from the respective container objects before 
checkpointing.

For the "misbehaving" clients who have perhaps queued up too much 
work, either still in the scheduler with unsatisfied 
dependencies, or already submitted to the hardware and/or driver 
backend, is there a timeout concept in CRIU so it would be 
possible to say something like "try to checkpoint but if the 
kernel says no time period t then give up"?

For some more background about our implementation in KFD, you 
can refer to this whitepaper: 
https://github.com/checkpoint-restore/criu/blob/criu-dev/plugins/amdgpu/README.md

Potential objections to a KFD-style CRIU API in DRM render 
nodes, I'll address each of them in more detail below:

  * Opaque information in the checkpoint data that user mode 
can't
    interpret or do anything with
  * A second API for creating objects (e.g. BOs) that is 
separate from
    the regular BO creation API
  * Kernel mode would need to be involved in restoring BO 
sharing
    relationships rather than replaying BO creation, export 
and import
    from user mode

# Opaque information in the checkpoint

This comes out of ABI compatibility considerations. Adding 
any new objects or attributes to the driver/HW state that 
needs to be checkpointed could potentially break the ABI of 
the CRIU checkpoint/restore ioctl if the plugin needs to 
parse that information. Therefore, much of the information in 
our KFD CRIU ioctl API is opaque. It is written by kernel 
mode in the checkpoint, it is consumed by kernel mode when 
restoring the checkpoint, but user mode doesn't care about 
the contents or binary layout, so there is no user mode ABI 
to break. This is how we were able to maintain CRIU support 
when we added the SVM API to KFD without changing the CRIU 
plugin and without breaking our ABI.

Opaque information may also lend itself to API abstraction, 
if this becomes a generic DRM API with driver-specific 
callbacks that fill in HW-specific opaque data.

This feels sound in principle to me. Fundamentally the state 
is very hardware specfic, and/or driver version specific, so I 
don't see anything could be gained in practice by making it 
much less opaque. (Apart from making things more complicated.)

I was however unsure of the current split of how you dump 
buffer objects with some data in the defined bo structure, and 
some in completely opaque private data. Is there a benefit to 
that split, or maybe in other words, is there a benefit on 
having part transparent and part opaque for buffer objects?

Some of the buffer object state is needed by the plugin. E.g. 
the size and mmap offset are needed to match VMAs with BOs. I'd 
have to review the plugin in detail to prove that all the fields 
are, in fact, needed by the plugin, but that was the intent. 
Anything that the plugin doesn't need to know should be in the 
opaque data structures.

Right, got it.

Would it make sense to make the opaque data in the same block as 
the defined one? I mean instead of separating the two in the 
binary image for instance have struct kfd_criu_bo_bucket have a 
trailing priv_data blob? Maybe it is too late now if the image 
format is not versioned or something.

To slightly touch upon the question of whether this could 
become a generic DRM API. It feels it would be hard to do it 
from the start. What sounds more feasible is if/when generic 
looking helpers can be spotted while developing the RFC then 
potentially structure the code they can easily be promoted to 
shared/common at some future moment.

Yes, that's how this usually goes, in my experience. Thanks for 
confirming.

# Second API for creating objects

Creating BOs and other objects when restoring a checkpoint 
needs more information than the usual BO alloc and similar 
APIs provide. For example, we need to restore BOs with the 
same GEM handles so that user mode can continue using those 
handles after resuming execution. If BOs are shared through 
DMABufs without dynamic attachment, we need to restore pinned 
BOs as pinned. Validation of virtual addresses and handling 
MMU notifiers must be suspended until the virtual address 
space is restored. For user mode queues we need to save and 
restore a lot of queue execution state so that execution can 
resume cleanly.

This also sounds justified to me. Restore creating all 
internal objects is definitely special and sounds better to 
add uapi to create them directly with the correct properties, 
than to add uapi to adjust internal properties after creation. 
And in case you would always need some new uapi - so at least 
to adjust after creation. At which point you may have both in 
one. Internally implementation can be split or common, 
whatever makes sense for a given object type, but new uapi 
definitely sounds is required.
# Restoring buffer sharing relationships

Different GEM handles in different render nodes and processes 
can refer to the same underlying shared memory, either by 
directly pointing to the same GEM object, or by creating an 
import attachment that may get its SG tables invalidated and 
updated dynamically through dynamic attachment callbacks. In 
the latter case it's obvious, who is the exporter and who is 
the importer. In the first case, either one could be the 
exporter, and it's not clear who would need to create the BO 
and who would need to 

To see if I follow the former case correctly.

This could be two clients A and B, where B has imported a 
dma-buf shared BO from A and has since closed the dma-buf fd? 
Which results in a single BO with reference count of 2 and 
obj->import_attach unset. History of who created the object is 
lost.

Yes. In the amdgpu driver this happens when the exporter and 
import device are the same.

In fact it could even be that two imported objects remain 
(clients A, B and C) and A, who originally created the BO, has 
since fully released it. So any kind of "creator" tracking if 
added wouldn't be fully reliable either.

That's a good point.

import it when restoring the checkpoint. To further 
complicate things, multiple processes in a checkpoint get 
restored concurrently. So there is no guarantee that an 
exporter has restored a shared BO at the time an importer is 
trying to restore its import.

A proposal to deal with these problems would be to treat 
importers and exporters the same. Whoever restores first, 
ends up creating the BO and potentially attaching to it. The 
other process(es) can find BOs that were already restored by 
another process by looking it up with a unique ID that could 
be based on the DMABuf inode number. An alternative would be 
a two-pass approach that needs to restore BOs on two passes:

 1. Restore exported BOs
 2. Restore imports

With some inter-process synchronization in CRIU itself 
between these two passes. This may require changes in the 
core CRIU, outside our plugin. Both approaches depend on 
identifying BOs with some unique ID that could be based on 
the DMABuf inode number in the checkpoint. However, we would 
need to identify the processes in the same restore session, 
possibly based on parent/child process relationships, to 
create a scope where those IDs are valid during restore.

If my understanding above is on the right track, then I think 
this is the only thing which can be done (for all scenarios).

I presented two alternatives. I think you're in favor of the 
first one, where it doesn't matter who is the importer and 
exporter. I think the two-pass approach requires that you can 
identify an exporter. And as you pointed out, the exporter may 
already have dropped their reference to the BO.

Yep.

I also *think* it would be safe. At least at the moment I 
cannot think what could go wrong. Semantics are that it 
doesn't really matter who created the object.

I would agree. What matters is that the object is recreated on 
the correct device, and that all the direct references and 
import attachments pointing to it are restored.

Finally, we would also need to checkpoint and restore DMABuf 
file descriptors themselves. These are anonymous file 
descriptors. The CRIU plugin could probably be taught to 
recreate them from the original exported BO based on the 
inode number that could be queried with fstat in the 
checkpoint. It would need help from the render node CRIU API 
to find the right BO from the inode, which may be from a 
different process in the same restore session.

This part feels like it is breaking the component separation a 
bit because even for buffers fully owned by amdgpu, strictly 
speaking the dma-buf fd is not. At least my understanding from 
the above is that you propose to attempt to import the fd, 
from the kernel side, during the checkpoint process? Like:

Checkpoint:

CRIU for each anon fd:
   amdgpu_plugin(fd)
     -> attempt in kernel dma buf import (passes fd to kernel 
via ioctl?)
         -> is it ours? (no -> error)
             -> create a record mapping fd number to amdgpu BO

Restore:

for each dma-buf fd record:
    create BO if does not exists
    export BO to same fd
    close BO handle if not in regular BO handle records

Or since you mention lookup by inode, that would need to be 
dmabuf_plugin so it can lookup inodes in the private mount 
space. However how would it co-operate with amdgpu_plugin is 
not clear to me.

The way I think about the ownership is, whichever driver created 
the underlying BO owns the checkpointing of the dmabuf. You need 
driver-specific information to link the dmabuf with the driver's 
BO and you need the right driver to recreate the BO and the 
dmabuf fd when restoring the checkpoint.

It gets really interesting if you have an amdgpu plugin and an 
i915 plugin, and they checkpoint an application that shares BOs 
between the two devices through DMABufs. E.g. if i915 created a 
BO and amdgpu imported it, then during restore, i915 needs to 
restore the dmabuf before the amdgpu import of it can be 
restored. I think that brings us back to a two-phase approach to 
restoring the memory sharing relationships. Uff.

I think this part of the discussion somewhat depends on the 
previous part about idling. If it is feasible to completely idle 
and prune, and fail if that is not happening quickly enough, then 
maybe there wouldn't be too much hierarchical state to save.

Otherwise my idea was that there is a top-level drm_plugin.so 
which understands amdgpu fds, i915, syncobj, sync_file, and uses 
some new uapi to uniquely identify each, associate with the 
correct driver, and then internally dispatches to 
amdgpu|i915|dmabuf|..._plugin.so. Building the in memory 
representation of their relationships. As long as all objects and 
their relationships have been recorded I think everything could 
then be correctly restored.

It is possible there is flaw in my thinking and that something in 
CRIU design would make this impossible? I think it would require 
the top-level drm_plugin.so to hold all state in memory until the 
whole checkpointing is done, and then verify something is not 
incomplete, failing it all if it was. (For instance one plugin 
discovered an reference to an object which was not discoverd by 
any other plugin or things like that.) May need some further 
tweaks to CRIU common code.

Maybe I need to better understand how exactly you mean to query 
the DRM driver about random anonymous fds. I see it as a problem 
in the design, possibly even implementation, but maybe I am 
missing something which makes it not so. I mean even with my 
general idea I don't know how would one determine which driver to 
query about a particular anonymous inode.

I later also realised that I was maybe increasing the scope for 
you here. :) You did state focus is ROCm applications which 
possibly doesn't care about dma-resv, fences, syncobjs etc?

That's my focus for now. But I don't want to engineer a solution 
that would preclude your use cases in the future.

But I think the "how to handle dma-bufs" design question is 
still relevant and interesting. For example I had this thought 
that perhaps what would be needed is a CRIU plugin hierarchy.

Because fundamentally we would be snapshoting a hierarcy of 
kernel objects belonging to different drivers (kfd, amdgpu, 
dma-buf, ...). And if one day someone would to try to handle 
dma fences and drm syncobjs, the argument for a hierarchial 
design would be even stronger I think.

Something like a drm_plugin.so could call sub-plugins (amdgpu, 
dma-buf, sync file, ...) and internally build the 
representation of the whole state and how the relationship 
between the objects.

Maybe. I guess a structure similar to libdrm makes sense. I'm 
not sure it's strictly a hierarchy. Maybe more like some common 
code shared by multiple GPU driver plugins. I think the common 
checkpoint state is quite limited and restoring it requires the 
GPU-specific drivers anyway.

Also the idea of building a representation of the whole state 
doesn't work well with the CRIU design, because "the whole 
state" can include multiple processes that restore themselves 
concurrently and only synchronize with each other in a few 
places in the restore process. I feel, if we can work out how to 
checkpoint and restore shared objects between processes, we can 
do the same for shared objects between drivers without imposing 
a strict hierarchy and some omniscient entity that needs to know 
"the whole state".

Okay, this continues on the same problem space as above. And you 
obviously know how CRIU works much better than me.

With that kind of design there probably would be a need to 
define some common kernel side api and uapi, so all involved 
objects can be enumerated with some unique ids etc.

Now.. the counter argument.. the more state from different 
drivers would one want to handle the bigger this project would 
get. Would it even be feasible is the question, to the point 
that it may be simpler to just run the workload in a VM via 
SR-IOV and simply hibernate the whole thin guest. :)

Well, CRIU kind of tries to do that, but with containers instead 
of VMs. ;)

It would definitely be useful for hardware and drivers without 
SR-IOV support so lets hope it is doable. :)

Regards,

Tvrtko