>-----Original Message----- >From: Christian König [mailto:deathsimple@xxxxxxxxxxx] >Sent: Wednesday, July 23, 2014 3:04 AM >To: Gabbay, Oded; Jerome Glisse; David Airlie; Alex Deucher; Andrew >Morton; Bridgman, John; Joerg Roedel; Lewycky, Andrew; Daenzer, Michel; >Goz, Ben; Skidanov, Alexey; linux-kernel@xxxxxxxxxxxxxxx; dri- >devel@xxxxxxxxxxxxxxxxxxxxx; linux-mm; Sellek, Tom >Subject: Re: [PATCH v2 00/25] AMDKFD kernel driver > >Am 23.07.2014 08:50, schrieb Oded Gabbay: >> On 22/07/14 14:15, Daniel Vetter wrote: >>> On Tue, Jul 22, 2014 at 12:52:43PM +0300, Oded Gabbay wrote: >>>> On 22/07/14 12:21, Daniel Vetter wrote: >>>>> On Tue, Jul 22, 2014 at 10:19 AM, Oded Gabbay ><oded.gabbay@xxxxxxx> >>>>> wrote: >>>>>>> Exactly, just prevent userspace from submitting more. And if you >>>>>>> have misbehaving userspace that submits too much, reset the gpu >>>>>>> and tell it that you're sorry but won't schedule any more work. >>>>>> >>>>>> I'm not sure how you intend to know if a userspace misbehaves or >>>>>> not. Can you elaborate ? >>>>> >>>>> Well that's mostly policy, currently in i915 we only have a check >>>>> for hangs, and if userspace hangs a bit too often then we stop it. >>>>> I guess you can do that with the queue unmapping you've describe in >>>>> reply to Jerome's mail. >>>>> -Daniel >>>>> >>>> What do you mean by hang ? Like the tdr mechanism in Windows (checks >>>> if a gpu job takes more than 2 seconds, I think, and if so, >>>> terminates the job). >>> >>> Essentially yes. But we also have some hw features to kill jobs >>> quicker, e.g. for media workloads. >>> -Daniel >>> >> >> Yeah, so this is what I'm talking about when I say that you and Jerome >> come from a graphics POV and amdkfd come from a compute POV, no >> offense intended. >> >> For compute jobs, we simply can't use this logic to terminate jobs. >> Graphics are mostly Real-Time while compute jobs can take from a few >> ms to a few hours!!! And I'm not talking about an entire application >> runtime but on a single submission of jobs by the userspace app. We >> have tests with jobs that take between 20-30 minutes to complete. In >> theory, we can even imagine a compute job which takes 1 or 2 days (on >> larger APUs). >> >> Now, I understand the question of how do we prevent the compute job >> from monopolizing the GPU, and internally here we have some ideas that >> we will probably share in the next few days, but my point is that I >> don't think we can terminate a compute job because it is running for >> more than x seconds. It is like you would terminate a CPU process >> which runs more than x seconds. > >Yeah that's why one of the first things I've did was making the timeout >configurable in the radeon module. > >But it doesn't necessary needs be a timeout, we should also kill a running job >submission if the CPU process associated with the job is killed. > >> I think this is a *very* important discussion (detecting a misbehaved >> compute process) and I would like to continue it, but I don't think >> moving the job submission from userspace control to kernel control >> will solve this core problem. > >We need to get this topic solved, otherwise the driver won't make it >upstream. Allowing userpsace to monopolizing resources either memory, >CPU or GPU time or special things like counters etc... is a strict no go for a >kernel module. > >I agree that moving the job submission from userpsace to kernel wouldn't >solve this problem. As Daniel and I pointed out now multiple times it's rather >easily possible to prevent further job submissions from userspace, in the >worst case by unmapping the doorbell page. > >Moving it to an IOCTL would just make it a bit less complicated. Hi Christian; HSA uses usermode queues so that programs running on GPU can dispatch work to themselves or to other GPUs with a consistent dispatch mechanism for CPU and GPU code. We could potentially use s_msg and trap every GPU dispatch back through CPU code but that gets slow and ugly very quickly. > >Christian. > >> >> Oded -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href