Hi Russell, et al. first let me say I'm sorry that I have been less responsive than some of you would have hoped. I'm trying to get better at that, but juggling a large pile of components where none of them are currently mainline isn't an easy task, especially with downstream priorities not quite lining up with the mainlining efforts. I'm trying to get a better time management in place with slots reserved for the mainline stuff. To make this a bit easier for me I would like to ask you to please keep the relevant discussions on E-Mail. Following another communication channel like IRC just makes things harder for me. I'm prepared to take critique on public mail, no need to keep things hidden. ;) Am Donnerstag, den 28.05.2015, 00:03 +0100 schrieb Russell King - ARM Linux: > On Wed, May 27, 2015 at 02:49:17PM +0200, Lucas Stach wrote: > > Hi Alexander, > > > > Am Mittwoch, den 27.05.2015, 14:45 +0200 schrieb Alexander Holler: > > > Hello, > > > > > > I've just build and booted the Etnaviv driver as module with Kernel 4.0.4. > > > > > > When I unload the driver with rmmod an oops happens: > > > > > Thanks for the report. > > > > I'm currently working on the patchstack to get it into shape for another > > submission. I'll take a look at this problem. > > Lucas, > > We definitely need to talk about this... please can you take a look at > my patch stack which I mentioned in my reply to Alexander. I know that > you already based your patch set off one of my out-dated patch stacks, > but unless you grab a more recent one, you're going to be solving a lot > of bugs that I've fixed. > Right, thanks for the pointer. I took a look at this and it doesn't seem to clash with things I did until now, so it should be pretty easy for me to resync your changes into a common tree. > Also, I've heard via IRC that you've updated my DDX - which is something > I've also done (I think I did mention that I would be doing this... and > I also have a bunch of work on the Xrender backend.) I suspect that > makes your work there redundant. > I've minimally beat it into shape to work on top of the new kernel driver and fixed some XV accel bugs when used with GStreamer. I'm also looking at accelerating rotated blits to speed up the rotated display case. I consider none of this to be stable and tested enough to push it out right now. I'll keep watching the things you do the driver and ping you if I have anything worthwhile to integrate. > There's at least two more issues that need to be discussed, which is > concerning the command buffers, and their DMA-coherent nature, which > makes the kernel command parser expensive. In my perf measurements, > it's right at the top of the list as the most expensive function. > > I've made some improvements to the parser which reduces its perf > figure a little, but it's still up there as the number one hot > function, inspite of the code being as tight as the compiler can > manage. > > This is probably because we're reading from uncached memory: the CPU > can't speculatively prefetch into the cache any of the data from memory. > Yes, I've already seen this coming up in the traces. I was thinking about exposing cached memory to userspace for the command streams and then copying the commands into a write-combined buffer while validating them. This isn't far from your idea and I think it should let us optimize the CPU access for validation and buffer reloc patching, while avoiding the need to do additional cache synchronization and plugging the submit-then-change attack vector. > It's _probably_ (I haven't benchmarked it) going to be faster to > copy_from_user() the command buffers into the kernel, either directly > into DMA coherent memory, or into cacheable memory, and then run them > through the command parser, followed by either a copy to DMA coherent > memory or using the DMA streaming API to push the cache lines out. > > This has other advantages: by not directly exposing the memory which the > GPU executes its command stream into userspace, we prevent submit-then- > change "attacks" bypassing the kernel command parser. > > Another issue is that we incur quite a heavy overhead if we allocate an > etnadrm buffer, and then map it into userspace. We end up allocating > all pages for the buffer as soon as any page is faulted in (consider if > it's a 1080p buffer...) and then setup the scatterlists. That's useless > overhead if we later decide that we're not going to pass it to the GPU > (eg, because Xorg allocated a pixmap which it then only performed CPU > operations on.) I have "changes" for this which aren't prepared as > proper patches yet (in other words, they're currently as a playground of > changes.) > I think this is a userspace problem. Userspace should not allocate any GEM objects until it is clear that the buffer is going to be used by the GPU. Experience with the Intel DDX shows that a crucial part for good X accel performance is to have a proper strategy for migrating pixmaps between the CPU and GPU in place, even with UMA. > I have other improvements pending which needs proper perf analysis > before I can sort them out properly. These change the flow for a > submitted command set - reading all data from userspace before taking > dev->struct_mutex, since holding this lock over a page fault isn't > particularly graceful. Do we really want to stall graphics on the > entire GPU while some process' page happens to be swapped out to disk? > I suspect not. > As long as those changes don't incur any userspace visible changes it should be safe to keep them in the playground until the basics are mainline. Nonetheless I would be grateful if you could point me to those patches, so I can get a picture of how those changes would look like. Regards, Lucas -- Pengutronix e.K. | Lucas Stach | Industrial Linux Solutions | http://www.pengutronix.de/ | _______________________________________________ dri-devel mailing list dri-devel@xxxxxxxxxxxxxxxxxxxxx http://lists.freedesktop.org/mailman/listinfo/dri-devel