Re: [RFC PATCH v3 09/11] drm, cgroup: Add per cgroup bw measure and control

Daniel Vetter <daniel@xxxxxxxx> · Thu, 27 Jun 2019 08:11:53 +0200

On Thu, Jun 27, 2019 at 12:34:05AM -0400, Kenny Ho wrote:
> On Wed, Jun 26, 2019 at 12:25 PM Daniel Vetter <daniel@xxxxxxxx> wrote:
> >
> > On Wed, Jun 26, 2019 at 11:05:20AM -0400, Kenny Ho wrote:
> > > The bandwidth is measured by keeping track of the amount of bytes moved
> > > by ttm within a time period.  We defined two type of bandwidth: burst
> > > and average.  Average bandwidth is calculated by dividing the total
> > > amount of bytes moved within a cgroup by the lifetime of the cgroup.
> > > Burst bandwidth is similar except that the byte and time measurement is
> > > reset after a user configurable period.
> >
> > So I'm not too sure exposing this is a great idea, at least depending upon
> > what you're trying to do with it. There's a few concerns here:
> >
> > - I think bo movement stats might be useful, but they're not telling you
> >   everything. Applications can also copy data themselves and put buffers
> >   where they want them, especially with more explicit apis like vk.
> >
> > - which kind of moves are we talking about here? Eviction related bo moves
> >   seem not counted here, and if you have lots of gpus with funny
> >   interconnects you might also get other kinds of moves, not just system
> >   ram <-> vram.
> Eviction move is counted but I think I placed the delay in the wrong
> place (the tracking of byte moved is in previous patch in
> ttm_bo_handle_move_mem, which is common to all move as far as I can
> tell.)
> 
> > - What happens if we slow down, but someone else needs to evict our
> >   buffers/move them (ttm is atm not great at this, but Christian König is
> >   working on patches). I think there's lots of priority inversion
> >   potential here.
> >
> > - If the goal is to avoid thrashing the interconnects, then this isn't the
> >   full picture by far - apps can use copy engines and explicit placement,
> >   again that's how vulkan at least is supposed to work.
> >
> > I guess these all boil down to: What do you want to achieve here? The
> > commit message doesn't explain the intended use-case of this.
> Thrashing prevention is the intent.  I am not familiar with Vulkan so
> I will have to get back to you on that.  I don't know how those
> explicit placement translate into the kernel.  At this stage, I think
> it's still worth while to have this as a resource even if some
> applications bypass the kernel.  I certainly welcome more feedback on
> this topic.

The trouble with thrashing prevention like this is that either you don't
limit all the bo moves, and then you don't count everything. Or you limit
them all, and then you create priority inversions in the ttm eviction
handler, essentially rate-limiting everyone who's thrashing. Or at least
you run the risk of that happening.

Not what you want I think :-)

I also think that the blkcg people are still trying to figure out how to
make this work fully reliable (it's the same problem really), and a
critical piece is knowing/estimating the overall bandwidth. Without that
the admin can't really do something meaningful. The problem with that is
you don't know, not just because of vk, but any userspace that has buffers
in the pci gart uses the same interconnect just as part of its rendering
job. So if your goal is to guaranteed some minimal amount of bo move
bandwidth, then this wont work, because you have no idea how much bandwith
there even is for bo moves.

Getting thrashing limited is very hard.

I feel like a better approach would by to add a cgroup for the various
engines on the gpu, and then also account all the sdma (or whatever the
name of the amd copy engines is again) usage by ttm_bo moves to the right
cgroup. I think that's a more meaningful limitation. For direct thrashing
control I think there's both not enough information available in the
kernel (you'd need some performance counters to watch how much bandwidth
userspace batches/CS are wasting), and I don't think the ttm eviction
logic is ready to step over all the priority inversion issues this will
bring up. Managing sdma usage otoh will be a lot more straightforward (but
still has all the priority inversion problems, but in the scheduler that
might be easier to fix perhaps with the explicit dependency graph - in the
i915 scheduler we already have priority boosting afaiui).
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch