Re: [linux.git drm/ttm]: NULL pointer dereference upon driver probe

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Am 10.08.20 um 22:05 schrieb Dave Airlie:
On Tue, 11 Aug 2020 at 05:24, Christian König <christian.koenig@xxxxxxx> wrote:
Am 10.08.20 um 20:51 schrieb Dave Airlie:
On Mon, 10 Aug 2020 at 22:20, Christian König <christian.koenig@xxxxxxx> wrote:
Am 07.08.20 um 09:02 schrieb Christian König:
Am 06.08.20 um 20:50 schrieb Roland Scheidegger:
Am 06.08.20 um 17:28 schrieb Christian König:
My best guess is that you are facing two separate bugs here.

Crash #1 is somehow related to CRTCs and might even be cause by the
atomic-helper change you noted below.

Crash #2 is caused because vmw_bo_create_and_populate() tries to
manually populate a BO object instead of relying on TTM to do it when
necessary. This indeed doesn't work any more because of "drm/ttm: make
TT creation purely optional v3".

Question is why vmwgfx is doing this?
Not really sure unfortunately, it's possible vmwgfx is doing it because
ttm lacked some capabilities at some point?
I think so as well, yes.

    Trying to figure this one out...
Problem is that what vmwgfx is doing here is questionable at best.

By definition BOs in the SYSTEM domain are not accessible by the GPU,
even if it is a virtual one.

And what vmwgfx does is allocating one in the SYSTEM domain as not
evictable and then bypassing TTM in filling and mapping it to the GPU.

That doesn't really makes sense to me, why shouldn't that BO be put in
the GTT domain then in the first place?
Well I think I figured out what VMWGFX is doing here, but you won't like it.

See VMWGFX doesn't support TTMs GTT domain. So to implement the mob and
otable BOs it is allocating system domain BOs, pinning them and manually
filling them with pages.

The correct fix would be to audit VMWGFX and fix this handling so that
it doesn't mess any more with TTM internal object state.

Till that happens we can only revert the patch for now.
Probably good to do, at least we know the problem now.

However I found myself in the same place yesterday so we should
discuss how to fix it going forward.

At least on Intel IGPs you have GTT and PPGTT (per-process table). GTT
on later hw is only needed for certain objects, like scanout etc. Not
every object needs to be in the GTT domain.
We have the same situation on amdgpu. GART objects are only allocated
for scanout and VMID0 access.

See out amdgpu_gtt_mgr.c.

But when you get an execbuffer and you want to bind the PPGTT objects,
you need to either move the object to the GTT domain pointlessly and
suboptimally, since the GTT domain could fill up and start needing
evictions.
That is intentional behavior. The GTT domain is the over all memory
which is currently GPU accessible.

The GART can be much smaller than the GTT domain.

So the option is to get SYSTEM domain objects, only move them to
TTM_PL_TT when pinning for scanout etc, but otherwise generate the
pages lists from the objects. In my playing around I've hacked up a TT
create/populate path, with no bind.
We already tried this and it turned out to be a bad idea.

See amdgpu_ttm_alloc_gart() how to easily do it with the GTT domain.
Okay I think this needs some commenting. Because it's not immediately
obvious what it means if you have an invalid address here and how that
comes about.

My reading is that you use lpfn to decide if something should be
allocated GTT space at all, and only set lpfn when you have requested
GART space explicitly?

Yes, exactly :)

To me it feels like we are working around TTM here, I think this sort
of feature should be more first-class in the TTM API instead of having
every driver write figure it out on a discover your own journey path.

Completely agree.

I suppose separating the concept of TTM_PL_TT from the concept of
being bound to a global TT is what we need to do here somehow.

Well how about completely removing the concept of a global TT from TTM?

What TTM should do is managing domains and help with the transitions between those domains.

That one of those domains maps the backing pages into a global TT is completely specific to that domain and shouldn't bother TTM in any way.

We can of course provide some default functions to manage AGP and classic GART, but TTM shouldn't enforce using those.

Regards,
Christian.


Dave.

_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/dri-devel




[Index of Archives]     [Linux DRI Users]     [Linux Intel Graphics]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [XFree86]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux