On Wed, Feb 13, 2019 at 09:35:30AM +0100, Christian König wrote: > Am 13.02.19 um 08:59 schrieb Daniel Vetter: > > On Wed, Feb 13, 2019 at 2:44 AM Rob Herring <robh@xxxxxxxxxx> wrote: > > > On Tue, Feb 12, 2019 at 7:00 PM Eric Anholt <eric@xxxxxxxxxx> wrote: > > > > Rob Herring <robh@xxxxxxxxxx> writes: > > > > > > > > > On Thu, Feb 7, 2019 at 9:51 AM Daniel Vetter <daniel@xxxxxxxx> wrote: > > > > > > On Thu, Feb 07, 2019 at 11:21:52PM +0800, Qiang Yu wrote: > > > > > > > On Thu, Feb 7, 2019 at 5:09 PM Daniel Vetter <daniel@xxxxxxxx> wrote: > > > > > > > > On Wed, Feb 06, 2019 at 09:14:55PM +0800, Qiang Yu wrote: > > > > > > > > > Kernel DRM driver for ARM Mali 400/450 GPUs. > > > > > > > > > > > > > > > > > > Since last RFC, all feedback has been addressed. Most Mali DTS > > > > > > > > > changes are already upstreamed by SoC maintainers. The kernel > > > > > > > > > driver and user-kernel interface are quite stable for several > > > > > > > > > months, so I think it's ready to be upstreamed. > > > > > > > > > > > > > > > > > > This implementation mainly take amdgpu DRM driver as reference. > > > > > > > > > > > > > > > > > > - Mali 4xx GPUs have two kinds of processors GP and PP. GP is for > > > > > > > > > OpenGL vertex shader processing and PP is for fragment shader > > > > > > > > > processing. Each processor has its own MMU so prcessors work in > > > > > > > > > virtual address space. > > > > > > > > > - There's only one GP but multiple PP (max 4 for mali 400 and 8 > > > > > > > > > for mali 450) in the same mali 4xx GPU. All PPs are grouped > > > > > > > > > togather to handle a single fragment shader task divided by > > > > > > > > > FB output tiled pixels. Mali 400 user space driver is > > > > > > > > > responsible for assign target tiled pixels to each PP, but mali > > > > > > > > > 450 has a HW module called DLBU to dynamically balance each > > > > > > > > > PP's load. > > > > > > > > > - User space driver allocate buffer object and map into GPU > > > > > > > > > virtual address space, upload command stream and draw data with > > > > > > > > > CPU mmap of the buffer object, then submit task to GP/PP with > > > > > > > > > a register frame indicating where is the command stream and misc > > > > > > > > > settings. > > > > > > > > > - There's no command stream validation/relocation due to each user > > > > > > > > > process has its own GPU virtual address space. GP/PP's MMU switch > > > > > > > > > virtual address space before running two tasks from different > > > > > > > > > user process. Error or evil user space code just get MMU fault > > > > > > > > > or GP/PP error IRQ, then the HW/SW will be recovered. > > > > > > > > > - Use TTM as MM. TTM_PL_TT type memory is used as the content of > > > > > > > > > lima buffer object which is allocated from TTM page pool. all > > > > > > > > > lima buffer object gets pinned with TTM_PL_FLAG_NO_EVICT when > > > > > > > > > allocation, so there's no buffer eviction and swap for now. > > > > > > > > All other render gpu drivers that have unified memory (aka is on the SoC) > > > > > > > > use GEM directly, with some of the helpers we have. So msm, etnaviv, vc4 > > > > > > > > (and i915 is kinda the same too really). TTM makes sense if you have some > > > > > > > > discrete memory to manage, but imo not in any other place really. > > > > > > > > > > > > > > > > What's the design choice behind this? > > > > > > > To be honest, it's just because TTM offers more helpers. I did implement > > > > > > > a GEM way with cma alloc at the beginning. But when implement paged mem, > > > > > > > I found TTM has mem pool alloc, sync and mmap related helpers which covers > > > > > > > much of my existing code. It's totally possible with GEM, but not as easy as > > > > > > > TTM to me. And virtio-gpu seems an example to use TTM without discrete > > > > > > > mem. Shouldn't TTM a super set of both unified mem and discrete mem? > > > > > > virtio does have fake vram and migration afaiui. And sure, you can use TTM > > > > > > without the vram migration, it's just that most of the complexity of TTM > > > > > > is due to buffer placement and migration and all that stuff. If you never > > > > > > need to move buffers, then you don't need that ever. > > > > > > > > > > > > Wrt lack of helpers, what exactly are you looking for? A big part of these > > > > > > for TTM is that TTM is a bid a midlayer, so reinvents a bunch of things > > > > > > provided by e.g. dma-api. It's cleaner to use the dma-api directly. Basing > > > > > > the lima kernel driver on vc4, freedreno or etnaviv (last one is probably > > > > > > closest, since it doesn't have a display block either) would be better I > > > > > > think. > > > > > FWIW, I'm working on the panfrost driver and am using the shmem > > > > > helpers from Noralf. It's the early stages though. I started a patch > > > > > for etnaviv to use it too, but found I need to rework it to sub-class > > > > > the shmem GEM object. > > > > Did you just convert the shmem helpers over to doing alloc_coherent? If > > > > so, I'd be interested in picking them up for v3d, and that might help > > > > get another patch out of your stack. > > > I haven't really fully addressed that yet, but yeah, my plan is just > > > to switch to WC alloc and mappings. I was going to try to make it > > > configurable, but there is a comment in the ARM dma mapping code which > > > makes me wonder if tinydrm using streaming DMA for SPI is > > > fundamentally broken (and maybe CMA is less broken?). If not broken, > > > not guaranteed to work. > > > > > > /* > > > * The whole dma_get_sgtable() idea is fundamentally unsafe - it seems > > > * that the intention is to allow exporting memory allocated via the > > > * coherent DMA APIs through the dma_buf API, which only accepts a > > > * scattertable. This presents a couple of problems: > > > * 1. Not all memory allocated via the coherent DMA APIs is backed by > > > * a struct page > > > * 2. Passing coherent DMA memory into the streaming APIs is not allowed > > > * as we will try to flush the memory through a different alias to that > > > * actually being used (and the flushes are redundant.) > > > */ > > The sg table is only for device access, which avoids both of these > > issues. That's the idea at least, except all ttm-based drivers don't > > care, instead they expect a struct page and then use that to build a > > ttm_bo. And then use all the ttm cpu side access functions, instead of > > using the dma-buf interfaces (which need to exist to avoid the above > > issues). > > Actually that is not correct any more. I've fixed this while working on > directly sharing BOs between amdgpu devices. > > TTM now uses the DMA addresses from the sg table and I actually wanted to > remove the pages for imported DMA-buf BOs for a while now. Nice! And yeah it's been a while since I looked at this ... So just a bit of cleanup work left to do, fundamentals are in place. Shouldn't be too hard to get rid of the pages, since the dma-buf cpu accessor functions have been modelled after the ttm_bo interfaces. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch _______________________________________________ dri-devel mailing list dri-devel@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/dri-devel