Am 2021-05-01 um 1:03 p.m. schrieb Adrian Reber: > On Fri, Apr 30, 2021 at 09:57:45PM -0400, Felix Kuehling wrote: >> We have been working on a prototype supporting CRIU (Checkpoint/Restore >> In Userspace) for accelerated compute applications running on AMD GPUs >> using ROCm (Radeon Open Compute Platform). We're happy to finally share >> this work publicly to solicit feedback and advice. The end-goal is to >> get this work included upstream in Linux and CRIU. A short whitepaper >> describing our design and intention can be found on Github: >> https://github.com/RadeonOpenCompute/criu/tree/criu-dev/test/others/ext-kfd/README.md >> >> We have RFC patch series for the kernel (based on Alex Deucher's >> amd-staging-drm-next branch) and for CRIU including a new plugin and a >> few core CRIU changes. I will send those to the respective mailing lists >> separately in a minute. They can also be found on Github. >> >> CRIU+plugin: https://github.com/RadeonOpenCompute/criu/commits/criu-dev >> Kernel (KFD): >> https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/commits/fxkamd/criu-wip >> >> At this point this is very much a work in progress and not ready for >> upstream inclusion. There are still several missing features, known >> issues, and open questions that we would like to start addressing with >> your feedback. >> >> What's working and tested at this point: >> >> * Checkpoint and restore accelerated machine learning apps: PyTorch >> running Bert on systems with 1 or 2 GPUs (MI50 or MI100), 100% >> unmodified user mode stack >> * Checkpoint on one system, restore on a different system >> * Checkpoint on one GPU, restore on a different GPU > This is very impressive. As far as I know this is the first larger > plugin written for CRIU and publicly published. It is also the first GPU > supported and people have been asking this for many years. It is in fact > the first hardware device supported through a plugin. > >> Major Known issues: >> >> * The KFD ioctl API is not final: Needs a complete redesign to allow >> future extension without breaking the ABI >> * Very slow: Need to implement DMA to dump VRAM contents >> >> Missing or incomplete features: >> >> * Support for the new KFD SVM API >> * Check device topology during restore >> * Checkpoint and restore multiple processes >> * Support for applications using Mesa for video decode/encode >> * Testing with more different GPUs and workloads >> >> Big Open questions: >> >> * What's the preferred way to publish our CRIU plugin? In-tree or >> out-of-tree? > I would do it in-tree. > >> * What's the preferred way to distribute our CRIU plugin? Source? >> Binary .so? Whole CRIU? Just in-box support? > As you are planing to publish the source I would make it part of the > CRIU repository and this way it will find its way to the packages in the > different distributions. Thanks. These are the answers I was hoping for. > > Does the plugin require any additional dependencies? If there is no > additional dependency to a library the plugin can be easily be part of > the existing packages. The DMA solution we're considering for saving VRAM contents would add a dependency on libdrm and libdrm-amdgpu. > >> * If our plugin can be upstreamed in the CRIU tree, what would be the >> right directory? > I would just put it into criu/plugins/ Sounds good. > > It would also be good to have your patchset submitted as a PR on github > to have our normal CI test coverage of the changes. We'll probably have to recreate our repository to start as a fork of the upstream CRIU repository, so that we can easily send pull-requests. We're not going to be ready for upstreaming for a few more months, probably. Do you want to get occasionaly pull requests anyway, just to run CI on our work-in-progress code? Regards, Felix > > Adrian _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx