>-----Original Message----- >From: dri-devel [mailto:dri-devel-bounces@xxxxxxxxxxxxxxxxxxxxx] On Behalf >Of Jerome Glisse >Sent: Monday, July 21, 2014 7:06 PM >To: Gabbay, Oded >Cc: Lewycky, Andrew; Pinchuk, Evgeny; Daenzer, Michel; linux- >kernel@xxxxxxxxxxxxxxx; dri-devel@xxxxxxxxxxxxxxxxxxxxx; linux-mm; >Skidanov, Alexey; Andrew Morton >Subject: Re: [PATCH v2 00/25] AMDKFD kernel driver > >On Tue, Jul 22, 2014 at 12:56:13AM +0300, Oded Gabbay wrote: >> On 21/07/14 22:28, Jerome Glisse wrote: >> > On Mon, Jul 21, 2014 at 10:23:43PM +0300, Oded Gabbay wrote: >> >> On 21/07/14 21:59, Jerome Glisse wrote: >> >>> On Mon, Jul 21, 2014 at 09:36:44PM +0300, Oded Gabbay wrote: >> >>>> On 21/07/14 21:14, Jerome Glisse wrote: >> >>>>> On Mon, Jul 21, 2014 at 08:42:58PM +0300, Oded Gabbay wrote: >> >>>>>> On 21/07/14 18:54, Jerome Glisse wrote: >> >>>>>>> On Mon, Jul 21, 2014 at 05:12:06PM +0300, Oded Gabbay wrote: >> >>>>>>>> On 21/07/14 16:39, Christian König wrote: >> >>>>>>>>> Am 21.07.2014 14:36, schrieb Oded Gabbay: >> >>>>>>>>>> On 20/07/14 20:46, Jerome Glisse wrote: >> >>>>>>>>>>> On Thu, Jul 17, 2014 at 04:57:25PM +0300, Oded Gabbay wrote: >> >>>>>>>>>>>> Forgot to cc mailing list on cover letter. Sorry. >> >>>>>>>>>>>> >> >>>>>>>>>>>> As a continuation to the existing discussion, here is a >> >>>>>>>>>>>> v2 patch series restructured with a cleaner history and >> >>>>>>>>>>>> no totally-different-early-versions of the code. >> >>>>>>>>>>>> >> >>>>>>>>>>>> Instead of 83 patches, there are now a total of 25 >> >>>>>>>>>>>> patches, where 5 of them are modifications to radeon driver >and 18 of them include only amdkfd code. >> >>>>>>>>>>>> There is no code going away or even modified between >patches, only added. >> >>>>>>>>>>>> >> >>>>>>>>>>>> The driver was renamed from radeon_kfd to amdkfd and >> >>>>>>>>>>>> moved to reside under drm/radeon/amdkfd. This move was >> >>>>>>>>>>>> done to emphasize the fact that this driver is an >> >>>>>>>>>>>> AMD-only driver at this point. Having said that, we do >> >>>>>>>>>>>> foresee a generic hsa framework being implemented in the >future and in that case, we will adjust amdkfd to work within that >framework. >> >>>>>>>>>>>> >> >>>>>>>>>>>> As the amdkfd driver should support multiple AMD gfx >> >>>>>>>>>>>> drivers, we want to keep it as a seperate driver from >> >>>>>>>>>>>> radeon. Therefore, the amdkfd code is contained in its >> >>>>>>>>>>>> own folder. The amdkfd folder was put under the radeon >> >>>>>>>>>>>> folder because the only AMD gfx driver in the Linux >> >>>>>>>>>>>> kernel at this point is the radeon driver. Having said >> >>>>>>>>>>>> that, we will probably need to move it (maybe to be directly >under drm) after we integrate with additional AMD gfx drivers. >> >>>>>>>>>>>> >> >>>>>>>>>>>> For people who like to review using git, the v2 patch set is >located at: >> >>>>>>>>>>>> http://cgit.freedesktop.org/~gabbayo/linux/log/?h=kfd-nex >> >>>>>>>>>>>> t-3.17-v2 >> >>>>>>>>>>>> >> >>>>>>>>>>>> Written by Oded Gabbayh <oded.gabbay@xxxxxxx> >> >>>>>>>>>>> >> >>>>>>>>>>> So quick comments before i finish going over all patches. >> >>>>>>>>>>> There is many things that need more documentation >> >>>>>>>>>>> espacialy as of right now there is no userspace i can go look at. >> >>>>>>>>>> So quick comments on some of your questions but first of >> >>>>>>>>>> all, thanks for the time you dedicated to review the code. >> >>>>>>>>>>> >> >>>>>>>>>>> There few show stopper, biggest one is gpu memory pinning >> >>>>>>>>>>> this is a big no, that would need serious arguments for >> >>>>>>>>>>> any hope of convincing me on that side. >> >>>>>>>>>> We only do gpu memory pinning for kernel objects. There are >> >>>>>>>>>> no userspace objects that are pinned on the gpu memory in >> >>>>>>>>>> our driver. If that is the case, is it still a show stopper ? >> >>>>>>>>>> >> >>>>>>>>>> The kernel objects are: >> >>>>>>>>>> - pipelines (4 per device) >> >>>>>>>>>> - mqd per hiq (only 1 per device) >> >>>>>>>>>> - mqd per userspace queue. On KV, we support up to 1K >> >>>>>>>>>> queues per process, for a total of 512K queues. Each mqd is >> >>>>>>>>>> 151 bytes, but the allocation is done in >> >>>>>>>>>> 256 alignment. So total *possible* memory is 128MB >> >>>>>>>>>> - kernel queue (only 1 per device) >> >>>>>>>>>> - fence address for kernel queue >> >>>>>>>>>> - runlists for the CP (1 or 2 per device) >> >>>>>>>>> >> >>>>>>>>> The main questions here are if it's avoid able to pin down >> >>>>>>>>> the memory and if the memory is pinned down at driver load, >> >>>>>>>>> by request from userspace or by anything else. >> >>>>>>>>> >> >>>>>>>>> As far as I can see only the "mqd per userspace queue" might >> >>>>>>>>> be a bit questionable, everything else sounds reasonable. >> >>>>>>>>> >> >>>>>>>>> Christian. >> >>>>>>>> >> >>>>>>>> Most of the pin downs are done on device initialization. >> >>>>>>>> The "mqd per userspace" is done per userspace queue creation. >> >>>>>>>> However, as I said, it has an upper limit of 128MB on KV, and >> >>>>>>>> considering the 2G local memory, I think it is OK. >> >>>>>>>> The runlists are also done on userspace queue >> >>>>>>>> creation/deletion, but we only have 1 or 2 runlists per device, so >it is not that bad. >> >>>>>>> >> >>>>>>> 2G local memory ? You can not assume anything on userside >> >>>>>>> configuration some one might build an hsa computer with 512M >> >>>>>>> and still expect a functioning desktop. >> >>>>>> First of all, I'm only considering Kaveri computer, not "hsa" >computer. >> >>>>>> Second, I would imagine we can build some protection around it, >> >>>>>> like checking total local memory and limit number of queues >> >>>>>> based on some percentage of that total local memory. So, if >> >>>>>> someone will have only 512M, he will be able to open less queues. >> >>>>>> >> >>>>>> >> >>>>>>> >> >>>>>>> I need to go look into what all this mqd is for, what it does >> >>>>>>> and what it is about. But pinning is really bad and this is an >> >>>>>>> issue with userspace command scheduling an issue that >> >>>>>>> obviously AMD fails to take into account in design phase. >> >>>>>> Maybe, but that is the H/W design non-the-less. We can't very >> >>>>>> well change the H/W. >> >>>>> >> >>>>> You can not change the hardware but it is not an excuse to allow >> >>>>> bad design to sneak in software to work around that. So i would >> >>>>> rather penalize bad hardware design and have command submission >> >>>>> in the kernel, until AMD fix its hardware to allow proper scheduling >by the kernel and proper control by the kernel. >> >>>> I'm sorry but I do *not* think this is a bad design. S/W >> >>>> scheduling in the kernel can not, IMO, scale well to 100K queues and >10K processes. >> >>> >> >>> I am not advocating for having kernel decide down to the very last >> >>> details. I am advocating for kernel being able to preempt at any >> >>> time and be able to decrease or increase user queue priority so >> >>> overall kernel is in charge of resources management and it can handle >rogue client in proper fashion. >> >>> >> >>>> >> >>>>> Because really where we want to go is having GPU closer to a CPU >> >>>>> in term of scheduling capacity and once we get there we want the >> >>>>> kernel to always be able to take over and do whatever it wants >behind process back. >> >>>> Who do you refer to when you say "we" ? AFAIK, the hw scheduling >> >>>> direction is where AMD is now and where it is heading in the future. >> >>>> That doesn't preclude the option to allow the kernel to take over >> >>>> and do what he wants. I agree that in KV we have a problem where >> >>>> we can't do a mid-wave preemption, so theoretically, a long >> >>>> running compute kernel can make things messy, but in Carrizo, we >> >>>> will have this ability. Having said that, it will only be through >> >>>> the CP H/W scheduling. So AMD is _not_ going to abandon H/W >> >>>> scheduling. You can dislike it, but this is the situation. >> >>> >> >>> We was for the overall Linux community but maybe i should not >> >>> pretend to talk for anyone interested in having a common standard. >> >>> >> >>> My point is that current hardware do not have approriate hardware >> >>> support for preemption hence, current hardware should use ioctl to >> >>> schedule job and AMD should think a bit more on commiting to a >> >>> design and handwaving any hardware short coming as something that >> >>> can be work around in the software. The pinning thing is broken by >> >>> design, only way to work around it is through kernel cmd queue >scheduling that's a fact. >> >> >> >>> >> >>> Once hardware support proper preemption and allows to move >> >>> around/evict buffer use on behalf of userspace command queue then >> >>> we can allow userspace scheduling but until then my personnal >> >>> opinion is that it should not be allowed and that people will have >> >>> to pay the ioctl price which i proved to be small, because really >> >>> if you 100K queue each with one job, i would not expect that all >> >>> those 100K job will complete in less time than it takes to execute >> >>> an ioctl ie by even if you do not have the ioctl delay what ever you >schedule will have to wait on previously submited jobs. >> >> >> >> But Jerome, the core problem still remains in effect, even with >> >> your suggestion. If an application, either via userspace queue or >> >> via ioctl, submits a long-running kernel, than the CPU in general >> >> can't stop the GPU from running it. And if that kernel does >> >> while(1); than that's it, game's over, and no matter how you >> >> submitted the work. So I don't really see the big advantage in your >> >> proposal. Only in CZ we can stop this wave (by CP H/W scheduling >> >> only). What are you saying is basically I won't allow people to use >> >> compute on Linux KV system because it _may_ get the system stuck. >> >> >> >> So even if I really wanted to, and I may agree with you >> >> theoretically on that, I can't fulfill your desire to make the >> >> "kernel being able to preempt at any time and be able to decrease >> >> or increase user queue priority so overall kernel is in charge of >> >> resources management and it can handle rogue client in proper >> >> fashion". Not in KV, and I guess not in CZ as well. >> >> >> >> Oded >> > >> > I do understand that but using kernel ioctl provide the same kind of >> > control as we have now ie we can bind/unbind buffer on per command >> > buffer submission basis, just like with current graphic or compute stuff. >> > >> > Yes current graphic and compute stuff can launch a while and never >> > return back and yes currently we have nothing against that but we >> > should and solution would be simple just kill the gpu thread. >> > >> OK, so in that case, the kernel can simple unmap all the queues by >> simply writing an UNMAP_QUEUES packet to the HIQ. Even if the queues >> are userspace, they will not be mapped to the internal CP scheduler. >> Does that satisfy the kernel control level you want ? > >This raises questions, what does happen to currently running thread when >you unmap queue ? Do they keep running until done ? If not than this means >this will break user application and those is not an acceptable solution. > >Otherwise, infrastructre inside radeon would be needed to force this queue >unmap on bo_pin failure so gfx pinning can be retry. > >Also how do you cope with doorbell exhaustion ? Do you just plan to error >out ? >In which case this is another DDOS vector but only affecting the gpu. > >And there is many other questions that need answer, like my kernel memory >map question because as of right now i assume that kfd allow any thread on >the gpu to access any kernel memory. > >Otherthings are how ill formated packet are handled by the hardware ? I do >not see any mecanism to deal with SIGBUS or SIGFAULT. > > >Also it is a worrisome prospect of seeing resource management completely >ignore for future AMD hardware. Kernel exist for a reason ! Kernel main >purpose is to provide resource management if AMD fails to understand that, >this is not looking good on long term and i expect none of the HSA >technology will get momentum and i would certainly advocate against any >use of it inside product i work on. Hi Jerome; I was following along until the above comment. It seems to be the exact opposite of what Oded has been saying, which is that future AMD hardware *does* have more capabilities for resource management and that we do have some capabilities today. Can you help me understand what the comment it was based on ? Thanks, JB > >Cheers, >Jérôme > >> >> Oded >> >> >> >>> >> >>>>> >> >>>>>>>>> >> >>>>>>>>>>> >> >>>>>>>>>>> It might be better to add a drivers/gpu/drm/amd directory >> >>>>>>>>>>> and add common stuff there. >> >>>>>>>>>>> >> >>>>>>>>>>> Given that this is not intended to be final HSA api AFAICT >> >>>>>>>>>>> then i would say this far better to avoid the whole kfd module >and add ioctl to radeon. >> >>>>>>>>>>> This would avoid crazy communication btw radeon and kfd. >> >>>>>>>>>>> >> >>>>>>>>>>> The whole aperture business needs some serious >> >>>>>>>>>>> explanation. Especialy as you want to use userspace >> >>>>>>>>>>> address there is nothing to prevent userspace program from >> >>>>>>>>>>> allocating things at address you reserve for lds, scratch, >> >>>>>>>>>>> ... only sane way would be to move those lds, scratch inside >the virtual address reserved for kernel (see kernel memory map). >> >>>>>>>>>>> >> >>>>>>>>>>> The whole business of locking performance counter for >> >>>>>>>>>>> exclusive per process access is a big NO. Which leads me >> >>>>>>>>>>> to the questionable usefullness of user space command ring. >> >>>>>>>>>> That's like saying: "Which leads me to the questionable >> >>>>>>>>>> usefulness of HSA". I find it analogous to a situation >> >>>>>>>>>> where a network maintainer nacking a driver for a network >> >>>>>>>>>> card, which is slower than a different network card. >> >>>>>>>>>> Doesn't seem reasonable this situation is would happen. He >> >>>>>>>>>> would still put both the drivers in the kernel because people >want to use the H/W and its features. So, I don't think this is a valid reason to >NACK the driver. >> >>>>>>> >> >>>>>>> Let me rephrase, drop the the performance counter ioctl and >> >>>>>>> modulo memory pinning i see no objection. In other word, i am >> >>>>>>> not NACKING whole patchset i am NACKING the performance ioctl. >> >>>>>>> >> >>>>>>> Again this is another argument for round trip to the kernel. >> >>>>>>> As inside kernel you could properly do exclusive gpu counter >> >>>>>>> access accross single user cmd buffer execution. >> >>>>>>> >> >>>>>>>>>> >> >>>>>>>>>>> I only see issues with that. First and foremost i would >> >>>>>>>>>>> need to see solid figures that kernel ioctl or syscall has >> >>>>>>>>>>> a higher an overhead that is measurable in any meaning >> >>>>>>>>>>> full way against a simple function call. I know the >> >>>>>>>>>>> userspace command ring is a big marketing features that >> >>>>>>>>>>> please ignorant userspace programmer. But really this only >brings issues and for absolutely not upside afaict. >> >>>>>>>>>> Really ? You think that doing a context switch to kernel >> >>>>>>>>>> space, with all its overhead, is _not_ more expansive than >> >>>>>>>>>> just calling a function in userspace which only puts a buffer on a >ring and writes a doorbell ? >> >>>>>>> >> >>>>>>> I am saying the overhead is not that big and it probably will >> >>>>>>> not matter in most usecase. For instance i did wrote the most >> >>>>>>> useless kernel module that add two number through an ioctl >> >>>>>>> (http://people.freedesktop.org/~glisse/adder.tar) and it takes >> >>>>>>> ~0.35microseconds with ioctl while function is ~0.025microseconds >so ioctl is 13 times slower. >> >>>>>>> >> >>>>>>> Now if there is enough data that shows that a significant >> >>>>>>> percentage of jobs submited to the GPU will take less that >> >>>>>>> 0.35microsecond then yes userspace scheduling does make sense. >> >>>>>>> But so far all we have is handwaving with no data to support any >facts. >> >>>>>>> >> >>>>>>> >> >>>>>>> Now if we want to schedule from userspace than you will need >> >>>>>>> to do something about the pinning, something that gives >> >>>>>>> control to kernel so that kernel can unpin when it wants and >> >>>>>>> move object when it wants no matter what userspace is doing. >> >>>>>>> >> >>>>>>>>>>> >> >>> >> >>> -- >> >>> To unsubscribe, send a message with 'unsubscribe linux-mm' in the >> >>> body to majordomo@xxxxxxxxx. For more info on Linux MM, >> >>> see: http://www.linux-mm.org/ . >> >>> Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a> >> >>> >> >> >> >> _______________________________________________ >> dri-devel mailing list >> dri-devel@xxxxxxxxxxxxxxxxxxxxx >> http://lists.freedesktop.org/mailman/listinfo/dri-devel >_______________________________________________ >dri-devel mailing list >dri-devel@xxxxxxxxxxxxxxxxxxxxx >http://lists.freedesktop.org/mailman/listinfo/dri-devel -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href