Re: [PATCH v2 00/25] AMDKFD kernel driver

Jerome Glisse <j.glisse@xxxxxxxxx> · Wed, 23 Jul 2014 16:25:54 -0400

On Wed, Jul 23, 2014 at 03:49:57PM -0400, Alex Deucher wrote:
> On Wed, Jul 23, 2014 at 10:56 AM, Jerome Glisse <j.glisse@xxxxxxxxx> wrote:
> > On Wed, Jul 23, 2014 at 09:04:24AM +0200, Christian König wrote:
> >> Am 23.07.2014 08:50, schrieb Oded Gabbay:
> >> >On 22/07/14 14:15, Daniel Vetter wrote:
> >> >>On Tue, Jul 22, 2014 at 12:52:43PM +0300, Oded Gabbay wrote:
> >> >>>On 22/07/14 12:21, Daniel Vetter wrote:
> >> >>>>On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay <oded.gabbay@xxxxxxx>
> >> >>>>wrote:
> >> >>>>>>Exactly, just prevent userspace from submitting more. And if you
> >> >>>>>>have
> >> >>>>>>misbehaving userspace that submits too much, reset the gpu and
> >> >>>>>>tell it
> >> >>>>>>that you're sorry but won't schedule any more work.
> >> >>>>>
> >> >>>>>I'm not sure how you intend to know if a userspace misbehaves or
> >> >>>>>not. Can
> >> >>>>>you elaborate ?
> >> >>>>
> >> >>>>Well that's mostly policy, currently in i915 we only have a check for
> >> >>>>hangs, and if userspace hangs a bit too often then we stop it. I guess
> >> >>>>you can do that with the queue unmapping you've describe in reply to
> >> >>>>Jerome's mail.
> >> >>>>-Daniel
> >> >>>>
> >> >>>What do you mean by hang ? Like the tdr mechanism in Windows (checks
> >> >>>if a
> >> >>>gpu job takes more than 2 seconds, I think, and if so, terminates the
> >> >>>job).
> >> >>
> >> >>Essentially yes. But we also have some hw features to kill jobs quicker,
> >> >>e.g. for media workloads.
> >> >>-Daniel
> >> >>
> >> >
> >> >Yeah, so this is what I'm talking about when I say that you and Jerome
> >> >come from a graphics POV and amdkfd come from a compute POV, no offense
> >> >intended.
> >> >
> >> >For compute jobs, we simply can't use this logic to terminate jobs.
> >> >Graphics are mostly Real-Time while compute jobs can take from a few ms to
> >> >a few hours!!! And I'm not talking about an entire application runtime but
> >> >on a single submission of jobs by the userspace app. We have tests with
> >> >jobs that take between 20-30 minutes to complete. In theory, we can even
> >> >imagine a compute job which takes 1 or 2 days (on larger APUs).
> >> >
> >> >Now, I understand the question of how do we prevent the compute job from
> >> >monopolizing the GPU, and internally here we have some ideas that we will
> >> >probably share in the next few days, but my point is that I don't think we
> >> >can terminate a compute job because it is running for more than x seconds.
> >> >It is like you would terminate a CPU process which runs more than x
> >> >seconds.
> >>
> >> Yeah that's why one of the first things I've did was making the timeout
> >> configurable in the radeon module.
> >>
> >> But it doesn't necessary needs be a timeout, we should also kill a running
> >> job submission if the CPU process associated with the job is killed.
> >>
> >> >I think this is a *very* important discussion (detecting a misbehaved
> >> >compute process) and I would like to continue it, but I don't think moving
> >> >the job submission from userspace control to kernel control will solve
> >> >this core problem.
> >>
> >> We need to get this topic solved, otherwise the driver won't make it
> >> upstream. Allowing userpsace to monopolizing resources either memory, CPU or
> >> GPU time or special things like counters etc... is a strict no go for a
> >> kernel module.
> >>
> >> I agree that moving the job submission from userpsace to kernel wouldn't
> >> solve this problem. As Daniel and I pointed out now multiple times it's
> >> rather easily possible to prevent further job submissions from userspace, in
> >> the worst case by unmapping the doorbell page.
> >>
> >> Moving it to an IOCTL would just make it a bit less complicated.
> >>
> >
> > It is not only complexity, my main concern is not really the amount of memory
> > pinned (well it would be if it was vram which by the way you need to remove
> > the api that allow to allocate vram just so that it clearly shows that vram is
> > not allowed).
> >
> > Issue is with GPU address space fragmentation, new process hsa queue might be
> > allocated in middle of gtt space and stays there for so long that i will forbid
> > any big buffer to be bind to gtt. Thought with virtual address space for graphics
> > this is less of an issue and only the kernel suffer but still it might block the
> > kernel from evicting some VRAM because i can not bind a system buffer big enough
> > to GTT because some GTT space is taken by some HSA queue.
> >
> > To mitigate this at very least, you need to implement special memory allocation
> > inside ttm and radeon to force this per queue to be allocate for instance from
> > top of GTT space. Like reserve top 8M of GTT and have it grow/shrink depending
> > on number of queue.
> 
> This same sort of thing can already happen with gfx, although it's
> less likely since the workloads are usually shorter.  That said, we
> can issue compute jobs right today with the current CS ioctl and we
> may end up with a buffer pinned in an inopportune spot.

I thought compute was using virtual address space (well on > cayman at least).

> I'm not sure
> reserving a static pool at init really helps that much.  If you aren't
> using any HSA apps, it just wastes gtt space.  So you have a trade
> off: waste memory for a possibly unused MQD descriptor pool or
> allocate MQD descriptors on the fly, but possibly end up with a long
> running one stuck in a bad location.  Additionally, we already have a
> ttm flag for whether we want to allocate from the top or bottom of the
> pool.  We use it today for gfx depending on the buffer (e.g., buffers
> smaller than 512k are allocated from the bottom and buffers larger
> than 512 are allocated from the top).  So we can't really re-size a
> static buffer easily as there may already be other buffers pinned up
> there.

Again here iirc only kernel use the GTT space everything else (userspace)
is using virtual address space or am i forgeting something ?

My point was not so much to be static but to enforce doing it from one
end of the address space and to have shrink/grow depending on usage forcing
anything else out of that range.

On VM GPU only thing left using the "global" GTT is the kernel, it uses it
for ring and for moving buffer around. I would assume that pining ring buffers
at begining of address space no matter what there size is would be a good idea
as anyway those will not fragment ie there lifetime is the lifetime of the
driver.

My point is that all the HSA queue buffer can have a lifetime way bigger than
anything we have now, really now we can bind/unbind any buffer btw cs submission
modulo OpenCL task.

> 
> If we add sysfs controls to limit the amount of hsa processes, and
> queues per process so you could use this to dynamically limit the max
> amount gtt memory that would be in use for MQD descriptors.

No this can not be set dynamicly, once a process has created its queue
it has it and i see no channel to tell userspace: "sorry buddy but no
more room for you"

> 
> Alex

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>