Re: [PATCH v2 00/25] AMDKFD kernel driver

Christian König <deathsimple@xxxxxxxxxxx> · Wed, 23 Jul 2014 09:04:24 +0200

Am 23.07.2014 08:50, schrieb Oded Gabbay:
On 22/07/14 14:15, Daniel Vetter wrote:
On Tue, Jul 22, 2014 at 12:52:43PM +0300, Oded Gabbay wrote:
On 22/07/14 12:21, Daniel Vetter wrote:
On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay <oded.gabbay@xxxxxxx> 
wrote:
Exactly, just prevent userspace from submitting more. And if you 
have
misbehaving userspace that submits too much, reset the gpu and 
tell it
that you're sorry but won't schedule any more work.

I'm not sure how you intend to know if a userspace misbehaves or 
not. Can
you elaborate ?

Well that's mostly policy, currently in i915 we only have a check for
hangs, and if userspace hangs a bit too often then we stop it. I guess
you can do that with the queue unmapping you've describe in reply to
Jerome's mail.
-Daniel

What do you mean by hang ? Like the tdr mechanism in Windows (checks 
if a
gpu job takes more than 2 seconds, I think, and if so, terminates 
the job).

Essentially yes. But we also have some hw features to kill jobs quicker,
e.g. for media workloads.
-Daniel

Yeah, so this is what I'm talking about when I say that you and Jerome 
come from a graphics POV and amdkfd come from a compute POV, no 
offense intended.

For compute jobs, we simply can't use this logic to terminate jobs. 
Graphics are mostly Real-Time while compute jobs can take from a few 
ms to a few hours!!! And I'm not talking about an entire application 
runtime but on a single submission of jobs by the userspace app. We 
have tests with jobs that take between 20-30 minutes to complete. In 
theory, we can even imagine a compute job which takes 1 or 2 days (on 
larger APUs).

Now, I understand the question of how do we prevent the compute job 
from monopolizing the GPU, and internally here we have some ideas that 
we will probably share in the next few days, but my point is that I 
don't think we can terminate a compute job because it is running for 
more than x seconds. It is like you would terminate a CPU process 
which runs more than x seconds.

Yeah that's why one of the first things I've did was making the timeout 
configurable in the radeon module.

But it doesn't necessary needs be a timeout, we should also kill a 
running job submission if the CPU process associated with the job is killed.

I think this is a *very* important discussion (detecting a misbehaved 
compute process) and I would like to continue it, but I don't think 
moving the job submission from userspace control to kernel control 
will solve this core problem.

We need to get this topic solved, otherwise the driver won't make it 
upstream. Allowing userpsace to monopolizing resources either memory, 
CPU or GPU time or special things like counters etc... is a strict no go 
for a kernel module.

I agree that moving the job submission from userpsace to kernel wouldn't 
solve this problem. As Daniel and I pointed out now multiple times it's 
rather easily possible to prevent further job submissions from 
userspace, in the worst case by unmapping the doorbell page.

Moving it to an IOCTL would just make it a bit less complicated.

Christian.

    Oded

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>