Re: [RFC PATCH 0/4] Add support for DRM cgroup memory accounting.

Maarten Lankhorst <maarten.lankhorst@xxxxxxxxxxxxxxx> · Wed, 10 May 2023 16:59:01 +0200

    Hey,

    On 2023-05-05 21:50, Tejun Heo wrote:

      Hello,

On Wed, May 03, 2023 at 10:34:56AM +0200, Maarten Lankhorst wrote:

        RFC as I'm looking for comments.

For long running compute, it can be beneficial to partition the GPU memory
between cgroups, so each cgroup can use its maximum amount of memory without
interfering with other scheduled jobs. Done properly, this can alleviate the
need for eviction, which might result in a job being terminated if the GPU
doesn't support mid-thread preemption or recoverable page faults.

This is done by adding a bunch of knobs to cgroup:
drm.capacity: Shows maximum capacity of each resource region.
drm.max: Display or limit max amount of memory.
drm.current: Current amount of memory in use.

TTM has not been made cgroup aware yet, so instead of evicting from
the current cgroup to stay within the cgroup limits, it simply returns
the error -ENOSPC to userspace.

I've used Tvrtko's cgroup controller series as a base, but it implemented
scheduling weight, not memory accounting, so I only ended up keeping the
base patch.

Xe is not upstream yet, so the driver specific patch will only apply on
https://gitlab.freedesktop.org/drm/xe/kernel

      Some high-level feedbacks.

* There have been multiple attempts at this but the track record is kinda
  poor. People don't seem to agree what should constitute DRM memory and how
  they should be accounted / controlled.

    Thanks for the feedback.

I think for a lot of drivers, what is VRAM might have different meaning, but the intention
is it being accounted in the same way. Most drivers use TTM, which has a standard way
of allocating memory, and a standard way of evicting VRAM.

This makes it very useful for the usecase which I'm looking at, long running compute.
When you have long running jobs, you don't want them to be interrupted because a completely
unrelated process needs some VRAM, and one of the compute jobs buffers are being evicted.

Some hardware does not support mid-thread preemption or page fault recovery, this means that
when memory is evicted, the compute job is terminated.

The full problem statement is in drm-compute.rst in the memory accounting patch.

      * I like Tvrtko's scheduling patchset because it exposes a generic interface
  which makes sense regardless of hardware details and then each driver can
  implement the configured control in whatever way they can. However, even
  for that, there doesn't seem much buy-in from other drivers.

    Yeah, that is correct. But it tries to solve a different part of the problem.

      * This proposal seems narrowly scoped trying to solve a specific problem
  which may not translate to different hardware configurations. Please let
  me know if I got that wrong, but if that's the case, I think a better and
  easier approach might be just being a part of the misc controller. That
  doesn't require much extra code and should be able to provide everything
  necessary for statically limiting specific resources.

    The misc controller is not granular enough. A single computer may have any number of
graphics cards, some of them with multiple regions of vram inside a single card.

For compute and shared hosting you might want to limit the usage of a single memory
region on a single card, and then limit the same limits for the rest too, to prevent
triggering eviction.

The current version doesn't handle eviction correctly, because I was still working
on it and I wanted to post a RFC. As a result, the case where resource limit is hit
will evict the device's entire memory or get stuck in a loop. With some changes, the
next version will not have this bug. This results in a few changes to the core code. [1]

In the next version, I will move all the code for handling the resource limit to
TTM's eviction layer, because otherwise it cannot handle the resource limit correctly.

The effect of moving the code to TTM, is that it will make the code even more generic
for drivers that have vram and use TTM. When using TTM, you only have to describe your
VRAM, update some fields in the TTM manager and (un)register your device with the
cgroup handler on (un)load. It's quite trivial to add vram accounting to amdgpu and
nouveau. [2]

If you want to add a knob for scheduling weight for a process, it makes sense to
also add resource usage as a knob, otherwise the effect of that knob is very
limited. So even for Tvrtko's original proposed usecase, it would make sense.

Cheers,
~Maarten

--------
[1] Compared to this version:
 static inline int drmcg_try_charge(struct drmcgroup_state **drmcs,
+                                  struct drmcgroup_state **limitcs,
                                   struct drmcgroup_device *cgdev,
                                   u32 index, u64 size)

This now returns which cgroup's limit is hit on -EAGAIN.

+bool drmcs_grouped(struct drmcgroup_state *limitcs,
+                  struct drmcgroup_state *testcs);
Tells if testcs is the same as limitcs, or a subgroup of it. This allows us to
skip evicting when it's unneeded. If we want to add a min, it will make sense
to pass the size too, to skip some subcgroups below min.

+void drmcs_put(struct drmcgroup_state *drmcs);
Drops the limitcs ref.
-------------------
[2] With the next version, I can very easily implement the cgroup handling on amdgpu too:
- embed a struct drmcgroup_device inside amdgpu_device.
- In amdgpu_vram_mgr_init, populate the struct drmcgroup_device.regions[0] for vram,
  and set ttm_resource_manager->cg to &adev->drmcgroup_device
- Call drmcg_register_device after, and drmcg_unregister_device after cleaning up vram.

So if anyone wants to limit VRAM on amdgpu or qxl or nouveau (left as exercise for reader)
afterwards, it will work as intended, while the driver doesn't have to be cgroups aware.