Re: [PATCH] drm/ttm: Don't inherit GEM object VMAs in child process

Christian König <ckoenig.leichtzumerken@xxxxxxxxx> · Wed, 5 Jan 2022 09:08:57 +0100

Am 04.01.22 um 19:08 schrieb Felix Kuehling:
[+Adrian]

Am 2021-12-23 um 2:05 a.m. schrieb Christian König:

Am 22.12.21 um 21:53 schrieb Daniel Vetter:
On Mon, Dec 20, 2021 at 01:12:51PM -0500, Bhardwaj, Rajneesh wrote:

[SNIP]
Still sounds funky. I think minimally we should have an ack from CRIU
developers that this is officially the right way to solve this
problem. I
really don't want to have random one-off hacks that don't work across
the
board, for a problem where we (drm subsystem) really shouldn't be the
only
one with this problem. Where "this problem" means that the mmap space is
per file description, and not per underlying inode or real device or
whatever. That part sounds like a CRIU problem, and I expect CRIU folks
want a consistent solution across the board for this. Hence please
grab an
ack from them.
Unfortunately it's a KFD design problem. AMD used a single device
node, then mmaped different objects from the same offset to different
processes and expected it to work the rest of the fs subsystem without
churn.
This may be true for mmaps in the KFD device, but not for mmaps in the
DRM render nodes.

Correct, yes.

So yes, this is indeed because the mmap space is per file descriptor
for the use case here.
No. This is a different problem.

I was already wondering which mmaps through the KFD node we have left 
which cause problems here.

The problem has to do with the way that DRM manages mmap permissions. In
order to be able to mmap an offset in the render node, there needs to be
a BO that was created in the same render node. If you fork a process, it
inherits the VMA.

Yeah, so far it works like designed.

But KFD doesn't know anything about the inherited BOs
from the parent process.

Ok, why that? When the KFD is reinitializing it's context why shouldn't 
it cleanup those VMAs?

Therefore those BOs don't get checkpointed and
restored in the child process. When the CRIU checkpoint is restored, our
CRIU plugin never creates a BO corresponding to the VMA in the child
process' render node FD. We've also lost the relationship between the
parent and child-process' render node FDs. After "fork" the render node
FD points to the same struct file in parent and child. After restoring
the CRIU checkpoint, they are separate struct files, created by separate
"open" system calls. Therefore the mmap call that restores the VMA fails
in the child process.

At least for KFD, there is no point inheriting BOs from a child process,
because the GPU has no way of accessing the BOs in the child process.
The child process has no GPU address space, no user mode queues, no way
to do anything with the GPU before it completely reinitializes its KFD
context.

We can workaround this issue in user mode with madvise(...,
MADV_DONTFORK). In fact we've already done this for some BOs to avoid a
memory leak in the parent process while a child process exists. But it's
slightly racy because there is a short time window where VMA exists
without the VM_DONTCOPY flag. A fork during that time window could still
create a child process with an inherited VMA.

Therefore a safer solution is to set the vm_flags in the VMA in the
driver when the VMA is first created.

Thanks for the full explanation, it makes much more sense now.

Regards,
Christian.

Regards,
   Felix

And thanks for pointing this out, this indeed makes the whole change
extremely questionable.

Regards,
Christian.

Cheers, Daniel