Am 2022-01-07 um 3:56 a.m. schrieb Christian König: > Am 06.01.22 um 17:51 schrieb Felix Kuehling: >> Am 2022-01-06 um 11:48 a.m. schrieb Christian König: >>> Am 06.01.22 um 17:45 schrieb Felix Kuehling: >>>> Am 2022-01-06 um 4:05 a.m. schrieb Christian König: >>>> [SNIP] >>>> Also, why does your ACK or NAK depend on this at all. If it's the >>>> right >>>> thing to do, it's the right thing to do regardless of who benefits >>>> from >>>> it. In addition, how can a child process that doesn't even use the GPU >>>> be in violation of any GPU-driver related specifications. >>> The argument is that the application is broken and needs to be fixed >>> instead of worked around inside the kernel. >> I still don't get how they the application is broken. Like I said, the >> child process is not using the GPU. How can the application be fixed in >> this case? > > Sounds like I'm still missing some important puzzle pieces for the > full picture to figure out why this doesn't work. > >> Are you saying, any application that forks and doesn't immediately call >> exec is broken? > > More or less yes. We essentially have three possible cases here: > > 1. Application is already using (for example) OpenGL or OpenCL to do > some rendering on the GPU and then calls fork() and expects to use > OpenGL both in the parent and the child at the same time. > As far as I know that's illegal from the Khronos specification > point of view and working around inside the kernel for something not > allowed in the first place is seen as futile effort. > > 2. Application opened the file descriptor, forks and then initializes > OpenGL/Vulkan/OpenCL. > That's what some compositors are doing to drop into the backround > and is explicitly legal. > > 3. Application calls fork() and then doesn't use the GPU in the child. > Either by not using it or calling exec. > That should be legal and not cause any problems in the first place. > > But from your description I still don't get why we are running into > problems here. > > I was assuming that you have case #1 because we previously had some > examples of this with this python library, but from your description > it seems to be case #3. Correct. #3 has at least one issue we previously worked around in the Thunk: The inherited VMAs prevent BOs from being freed in the parent process. This manifests as an apparent memory leak. Therefore the Thunk calls madvise(..., MADV_DONTFORK) on all its VRAM allocation. The BOs that are causing problems with CRIU are GTT BOs that weren't covered by this previous workaround. The new issue with CRIU is, that CRIU saves and restores all the VMAs. When trying to restore the inherited VMAs in the child process, the mmap call fails because the child process' render node FD is no longer inherited from the parent, but is instead created by its own "open" system call. The mmap call in the child fails for at least two reasons: * The child process' render node FD doesn't have permission to map the parent process' BO any more * The parent BO doesn't get restored in the child process, so its mmap offset doesn't get updated to the new mmap offset of the restored parent BO by the amdgpu CRIU plugin We could maybe add a whole bunch of complexity in CRIU and our CRIU plugin to fix this. But it's pointless because like you said, actually doing anything with the BO in the child process is illegal anyway (scenario #1 above). The easiest solution seems to be, to just not inherit the VMA in the first place. Regards, Felix > >> Or does an application that forks need to be aware that some other part >> of the application used the GPU and explicitly free any GPU resources? > > Others might fill that information in, but I think that was the plan > for newer APIs like Vulkan. > > Regards, > Christian. > >> >> Thanks, >> Felix >> >> >>> Regards, >>> Christian. >>> >>>> Regards, >>>> Felix >>>> >>>> >>>>> Let's talk about this on Mondays call. Thanks for giving the whole >>>>> context. >>>>> >>>>> Regards, >>>>> Christian. >>>>> >>>>>> Regards, >>>>>> Felix >>>>>> >