Re: [PATCH 0/2] Lima DRM driver

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Am 13.02.19 um 10:38 schrieb Daniel Vetter:
On Wed, Feb 13, 2019 at 09:35:30AM +0100, Christian König wrote:
Am 13.02.19 um 08:59 schrieb Daniel Vetter:
On Wed, Feb 13, 2019 at 2:44 AM Rob Herring <robh@xxxxxxxxxx> wrote:
On Tue, Feb 12, 2019 at 7:00 PM Eric Anholt <eric@xxxxxxxxxx> wrote:
Rob Herring <robh@xxxxxxxxxx> writes:

On Thu, Feb 7, 2019 at 9:51 AM Daniel Vetter <daniel@xxxxxxxx> wrote:
On Thu, Feb 07, 2019 at 11:21:52PM +0800, Qiang Yu wrote:
On Thu, Feb 7, 2019 at 5:09 PM Daniel Vetter <daniel@xxxxxxxx> wrote:
On Wed, Feb 06, 2019 at 09:14:55PM +0800, Qiang Yu wrote:
Kernel DRM driver for ARM Mali 400/450 GPUs.

Since last RFC, all feedback has been addressed. Most Mali DTS
changes are already upstreamed by SoC maintainers. The kernel
driver and user-kernel interface are quite stable for several
months, so I think it's ready to be upstreamed.

This implementation mainly take amdgpu DRM driver as reference.

- Mali 4xx GPUs have two kinds of processors GP and PP. GP is for
    OpenGL vertex shader processing and PP is for fragment shader
    processing. Each processor has its own MMU so prcessors work in
    virtual address space.
- There's only one GP but multiple PP (max 4 for mali 400 and 8
    for mali 450) in the same mali 4xx GPU. All PPs are grouped
    togather to handle a single fragment shader task divided by
    FB output tiled pixels. Mali 400 user space driver is
    responsible for assign target tiled pixels to each PP, but mali
    450 has a HW module called DLBU to dynamically balance each
    PP's load.
- User space driver allocate buffer object and map into GPU
    virtual address space, upload command stream and draw data with
    CPU mmap of the buffer object, then submit task to GP/PP with
    a register frame indicating where is the command stream and misc
    settings.
- There's no command stream validation/relocation due to each user
    process has its own GPU virtual address space. GP/PP's MMU switch
    virtual address space before running two tasks from different
    user process. Error or evil user space code just get MMU fault
    or GP/PP error IRQ, then the HW/SW will be recovered.
- Use TTM as MM. TTM_PL_TT type memory is used as the content of
    lima buffer object which is allocated from TTM page pool. all
    lima buffer object gets pinned with TTM_PL_FLAG_NO_EVICT when
    allocation, so there's no buffer eviction and swap for now.
All other render gpu drivers that have unified memory (aka is on the SoC)
use GEM directly, with some of the helpers we have. So msm, etnaviv, vc4
(and i915 is kinda the same too really). TTM makes sense if you have some
discrete memory to manage, but imo not in any other place really.

What's the design choice behind this?
To be honest, it's just because TTM offers more helpers. I did implement
a GEM way with cma alloc at the beginning. But when implement paged mem,
I found TTM has mem pool alloc, sync and mmap related helpers which covers
much of my existing code. It's totally possible with GEM, but not as easy as
TTM to me. And virtio-gpu seems an example to use TTM without discrete
mem. Shouldn't TTM a super set of both unified mem and discrete mem?
virtio does have fake vram and migration afaiui. And sure, you can use TTM
without the vram migration, it's just that most of the complexity of TTM
is due to buffer placement and migration and all that stuff. If you never
need to move buffers, then you don't need that ever.

Wrt lack of helpers, what exactly are you looking for? A big part of these
for TTM is that TTM is a bid a midlayer, so reinvents a bunch of things
provided by e.g. dma-api. It's cleaner to use the dma-api directly. Basing
the lima kernel driver on vc4, freedreno or etnaviv (last one is probably
closest, since it doesn't have a display block either) would be better I
think.
FWIW, I'm working on the panfrost driver and am using the shmem
helpers from Noralf. It's the early stages though. I started a patch
for etnaviv to use it too, but found I need to rework it to sub-class
the shmem GEM object.
Did you just convert the shmem helpers over to doing alloc_coherent?  If
so, I'd be interested in picking them up for v3d, and that might help
get another patch out of your stack.
I haven't really fully addressed that yet, but yeah, my plan is just
to switch to WC alloc and mappings. I was going to try to make it
configurable, but there is a comment in the ARM dma mapping code which
makes me wonder if tinydrm using streaming DMA for SPI is
fundamentally broken (and maybe CMA is less broken?). If not broken,
not guaranteed to work.

/*
   * The whole dma_get_sgtable() idea is fundamentally unsafe - it seems
   * that the intention is to allow exporting memory allocated via the
   * coherent DMA APIs through the dma_buf API, which only accepts a
   * scattertable.  This presents a couple of problems:
   * 1. Not all memory allocated via the coherent DMA APIs is backed by
   *    a struct page
   * 2. Passing coherent DMA memory into the streaming APIs is not allowed
   *    as we will try to flush the memory through a different alias to that
   *    actually being used (and the flushes are redundant.)
   */
The sg table is only for device access, which avoids both of these
issues. That's the idea at least, except all ttm-based drivers don't
care, instead they expect a struct page and then use that to build a
ttm_bo. And then use all the ttm cpu side access functions, instead of
using the dma-buf interfaces (which need to exist to avoid the above
issues).
Actually that is not correct any more. I've fixed this while working on
directly sharing BOs between amdgpu devices.

TTM now uses the DMA addresses from the sg table and I actually wanted to
remove the pages for imported DMA-buf BOs for a while now.
Nice! And yeah it's been a while since I looked at this ... So just a bit
of cleanup work left to do, fundamentals are in place. Shouldn't be too
hard to get rid of the pages, since the dma-buf cpu accessor functions
have been modelled after the ttm_bo interfaces.

Well at least in radeon and amdgpu CPU mapping an imported BO is forbidden (userspace directly maps the DMA-buf fd).

The only case left is mapping a BO in the kernel, and that in turn is only used in very few places in radeon/amdgpu:
1. Command stream patching.
2. CPU based page table updates.
3. Debugging

And I think all of them doesn't make sense on a DMA-buf imported BO.

Regards,
Christian.

-Daniel

_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/dri-devel




[Index of Archives]     [Linux DRI Users]     [Linux Intel Graphics]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [XFree86]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux