Re: It appears drm-next TTM cleanup broke something . . .

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Kevin,

OpenChrome DDX was sending TTM_PL_FLAG_* based flags to OpenChrome DRM.

Ugh, that would be an absolute no-go for upstreaming.

Is it too much to ask for using more BUG_ON null pointer assertions for TTM callbacks?

I don't think that this is useful at all. See a BUG_ON() has the same result as a NULL pointer dereference.

While that may be true, I do plan to implement acceleration later, and this is why I do not want to settle with the GEM VRAM helpers.

TTM is quite a mess and the effort here is essential to clean it up and kill most of the driver specific workarounds we have in there.

As the maintainer of all of this I would probably reject any newly added driver which is using the layer directly instead of the VRAM GEM wrapper.

Regards,
Christian.

Am 19.10.20 um 18:20 schrieb Kevin Brace:
Hi Christian,

I looked into a few more things, and figured out why OpenChrome DRM was not booting X Server.
Now the situation is under control in my side of the world (OpenChrome development world), and I got X Server working again with drm-next 5.10 code and OpenChrome DRM.
Code will be committed into OpenChrome DRM upstream repository's drm-next-5.10 branch in the next few days.
     I can see that OpenChrome DRM "just happened to" work without ttm_tt_create/destroy callbacks being specified.
I can fix this issue easily.
The other issue I was facing was that commit 48e07c23cbeba2a2cda7ca73be0015e727818536 (drm/ttm: nuke memory type flags) discontinued TTM_PL_FLAG_* flags.
This caused incompatibility between OpenChrome DDX and the updated OpenChrome DRM that incorporated the changes made with commit 48e07c2.
OpenChrome DDX was sending TTM_PL_FLAG_* based flags to OpenChrome DRM.
I changed OpenChrome DDX code to use TTM_PL_* instead for allocating memory via TTM, and OpenChrome DRM based Linux kernel was able to boot X Server properly.
Again, the code change to the DDX will happen in a few days.
     Before I moved on from this issue, I will like to ask for one request regarding TTM and its callbacks.
Is it too much to ask for using more BUG_ON null pointer assertions for TTM callbacks?
Although I did not bring it up with you back then, I did get hit with an issue similar to this 3 years ago.

https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcgit.freedesktop.org%2Fopenchrome%2Fdrm-openchrome%2Fcommit%2F%3Fh%3Ddrm-next-4.13%26id%3D3c08ec601bb1ccd5ff58a9101317b728aa7204bd&data=02%7C01%7Cchristian.koenig%40amd.com%7C053fff6809b14e2976d408d8744aee28%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637387212462965991&sdata=wmGsqFinBsTLJSgp6G5ygoW7J%2Fppph4CzgO9q7q4h34%3D&reserved=0
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcgit.freedesktop.org%2Fopenchrome%2Fdrm-openchrome%2Fcommit%2F%3Fh%3Ddrm-next-4.13%26id%3D2064175f977d9859831be653df16e3ea10415a8a&data=02%7C01%7Cchristian.koenig%40amd.com%7C053fff6809b14e2976d408d8744aee28%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637387212462965991&sdata=1Pl3PZ%2FY8P2aFE9in3oYCgl1CW6Nq%2FIQiHe75U8Dp0I%3D&reserved=0


I did not send you an e-mail about it, I do recall complaining to Alex Deucher about the above io_mem_pfn null pointer bug at XDC2017 (I did attend it, and I presented about the state of OpenChrome Project.).
I brought this up with Alex because I stuck for more than 2 weeks to figure out the issue (the commit for overcoming io_mem_pfn callback issue was made on 2017-09-13, but the commit prior to that was on 2017-08-28).

https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcgit.freedesktop.org%2Fopenchrome%2Fdrm-openchrome%2Flog%2F%3Fh%3Ddrm-next-4.13%26ofs%3D50&data=02%7C01%7Cchristian.koenig%40amd.com%7C053fff6809b14e2976d408d8744aee28%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637387212462965991&sdata=zsLWqjAPDFhokHptCAHl4XRCZJSjcWsnvAimoUyb7Mk%3D&reserved=0


Anyway, someone else got hit by the same issue a month or two later

https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Farchives%2Fdri-devel%2F2017-November%2F159002.html&data=02%7C01%7Cchristian.koenig%40amd.com%7C053fff6809b14e2976d408d8744aee28%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637387212462965991&sdata=wbkUKUL0XVOviw%2Bvt0WFGvSyJiOF%2Bdz%2FZVSMSPTwuwI%3D&reserved=0
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Farchives%2Fdri-devel%2F2017-December%2F161168.html&data=02%7C01%7Cchristian.koenig%40amd.com%7C053fff6809b14e2976d408d8744aee28%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637387212462965991&sdata=nJ8n0lFRTX%2FoxPsKCvV%2FYvtUvmvzxLGVikUWtBsZy5U%3D&reserved=0


These are the commits that fixed the issue.

https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcgit.freedesktop.org%2Fdrm%2Fdrm%2Fcommit%2F%3Fid%3Dea642c3216cb2a60d1c0e760ae47ee85c9c16447&data=02%7C01%7Cchristian.koenig%40amd.com%7C053fff6809b14e2976d408d8744aee28%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637387212462965991&sdata=aL1C28MKl5SPRsP%2F9A6XWFMmtXmxymmGic3vLZe5%2FNw%3D&reserved=0
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcgit.freedesktop.org%2Fdrm%2Fdrm%2Fcommit%2F%3Fid%3Dc67fa6edc8b11afe22c88a23963170bf5f151acf&data=02%7C01%7Cchristian.koenig%40amd.com%7C053fff6809b14e2976d408d8744aee28%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637387212462965991&sdata=%2BY1z8sgjIf%2FPiJG9PBIOnsHuRCFq1uHNSVJodWQz%2B5k%3D&reserved=0


If you compare the commit date, I figured it out io_mem_pfn null pointer bug 2 to 3 months earlier.
I think the problem we got with the latest code change to the TTM is, many TTM callbacks do not check for a null pointer for mandatory callbacks.
Also, the ttm_bo_driver struct comment section inside ttm_bo_driver.h does not make it 100% clear that what is mandatory and what is not.
If these were done, I am sure I would have left ttm_tt_create/destroy callbacks intact.

https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcgit.freedesktop.org%2Fopenchrome%2Fdrm-openchrome%2Fcommit%2F%3Fh%3Ddrm-next-5.10%26id%3D64947142a1cf8b83faa73da7aa35a17f6a24568a&data=02%7C01%7Cchristian.koenig%40amd.com%7C053fff6809b14e2976d408d8744aee28%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637387212462965991&sdata=egysIQ%2BgxKi3ujKmzyRNRvI93SY%2BFI5dlW%2FvyGglAn8%3D&reserved=0
(scroll down to openchrome/openchrome_ttm.c)


     Regarding your suggestion that I should abandon OpenChrome DRM's current GEM / TTM implementation and instead I should use the GEM VRAM helpers, since I was able to figure out the issues with some suggestions from you and Dave, I do not think it is worth switching to GEM VRAM helpers at this point.
Obviously, the current implementation does not support acceleration, hence, it appears to be a good candidate for switching to GEM VRAM helpers.
While that may be true, I do plan to implement acceleration later, and this is why I do not want to settle with the GEM VRAM helpers.

Regards,

Kevin Brace
Brace Computer Laboratory blog
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbracecomputerlab.com%2F&data=02%7C01%7Cchristian.koenig%40amd.com%7C053fff6809b14e2976d408d8744aee28%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637387212462965991&sdata=a11wNdmtVlgNBWHuBnxT%2BhCZRC%2BCtZw2xrVebNylSLU%3D&reserved=0


Sent: Monday, October 19, 2020 at 3:13 AM
From: "Christian König" <ckoenig.leichtzumerken@xxxxxxxxx>
To: "Kevin Brace" <kevinbrace@xxxxxxx>, "Dave Airlie" <airlied@xxxxxxxxx>, "dri-devel" <dri-devel@xxxxxxxxxxxxxxxxxxxxx>
Subject: Re: It appears drm-next TTM cleanup broke something . . .

Hi Kevin,

the basic problem you are facing is that ttm_tt_create/destroy is
mandatory (It always was). You need an implementation or otherwise you
won't be able to use the system domain (additional to the optional GTT
domain).

My best guess is that the difference is that we now force to initiate
the system domain for all drivers.

If that is correct you just that you never ran into because you never
correctly initialized TTM to support buffer moves.

I'm not sure what exactly the OpenChrome DRM driver is doing, but I
strongly suggest to just drop TTM support completely and use the GEM
VRAM helper layer instead.

Regards,
Christian.

Am 19.10.20 um 09:23 schrieb Kevin Brace:
Hi Dave,

Yeah, with the workaround I mentioned in my previous e-mail, OpenChrome DRM does not crash for "ttm_tt_create" member being null.
It is still not able to boot X Server due to some other TTM related memory allocation issue it is suffering from.
I think making huge changes to TTM during this development cycle broke OpenChrome DRM.
      Following up on the question I raised during the previous e-mail.
Shouldn't "use_tt" parameter being "false" for ttm_range_man_init() disable TTM TT functionality?
I feel like that should be the expected behavior.
Again, there is only 5 to 6 more days left until Linux 5.10-rc2, so I decided to contact you on Sunday (I consider this bug to be urgent.).
Assuming what I am asserting is correct, I think the reason why this was not discovered earlier was due to the following reasons.

1) nouveau, radeon, and amdgpu already use TTM TT functionality.
2) ast uses GEM VRAM helper that internally uses TTM. It populates "ttm_tt_create" and "ttm_tt_destroy" members, hence, the developers did not notice the breakage.
3) OpenChrome DRM is still not in the mainline tree, so no one other than myself noticed the problem until now.


Regarding the TTM TT functionality, OpenChrome DRM currently does not support acceleration, hence, I did not believe it was necessary to populate "ttm_tt_create" and "ttm_tt_destroy" members.
That implementation worked fine until the previous development cycle code.
Of course, I will eventually add support for acceleration, hence, TTM TT functionality will be utilized at some point.

Regards,

Kevin Brace
Brace Computer Laboratory blog
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbracecomputerlab.com%2F&amp;data=02%7C01%7Cchristian.koenig%40amd.com%7C053fff6809b14e2976d408d8744aee28%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637387212462965991&amp;sdata=a11wNdmtVlgNBWHuBnxT%2BhCZRC%2BCtZw2xrVebNylSLU%3D&amp;reserved=0


Sent: Sunday, October 18, 2020 at 12:50 PM
From: "Dave Airlie" <airlied@xxxxxxxxx>
To: "Kevin Brace" <kevinbrace@xxxxxxx>, "Christian König" <ckoenig.leichtzumerken@xxxxxxxxx>
Cc: "dri-devel" <dri-devel@xxxxxxxxxxxxxxxxxxxxx>, "Dave Airlie" <airlied@xxxxxxxxxx>
Subject: Re: It appears drm-next TTM cleanup broke something . . .

On Mon, 19 Oct 2020 at 05:15, Kevin Brace <kevinbrace@xxxxxxx> wrote:
Hi Dave,

It is a little urgent, so I am writing this right now.
As usual, I pulled in DRM repository code for an out of tree OpenChrome DRM repository a few days ago.
While going through the changes I need to make to OpenChrome DRM to compile with the latest Linux kernel, I noticed that ttm_bo_init_mm() was discontinued, and it was replaced with ttm_range_man_init().
ttm_range_man_init() has a parameter called "bool use_tt", but honestly, I do not think it is functioning correctly.
If I keep "ttm_tt_create" member of ttm_bo_driver struct null by not specifying it, TTM still tries to call it, and crashes due to a null pointer access.
The workaround I found so far is to specify the "ttm_tt_create" member by copying bo_driver_ttm_tt_create() from drm/drm_gem_vram_helper.c.
This is what the call trace looks like without specifying the "ttm_tt_create" member (i.e., this member is null).
cc'ing Christian,

I can't remember if we did this deliberately or if just worked by
accident previously.

Either way, you should probably need a ttm_tt_create going forward.

Dave.

_______________________________________________
. . .
kernel: [   34.310674] [drm:openchrome_bo_create [openchrome]] Entered openchrome_bo_create.
kernel: [   34.310697] [drm:openchrome_ttm_domain_to_placement [openchrome]] Entered openchrome_ttm_domain_to_placement.
kernel: [   34.310706] [drm:openchrome_ttm_domain_to_placement [openchrome]] Exiting openchrome_ttm_domain_to_placement.
kernel: [   34.310737] BUG: kernel NULL pointer dereference, address: 0000000000000000
kernel: [   34.310742] #PF: supervisor instruction fetch in kernel mode
kernel: [   34.310745] #PF: error_code(0x0010) - not-present page
. . .
kernel: [   34.310807] Call Trace:
kernel: [   34.310827]  ttm_tt_create+0x5f/0xa0 [ttm]
kernel: [   34.310839]  ttm_bo_validate+0xb8/0x140 [ttm]
kernel: [   34.310886]  ? drm_vma_offset_add+0x56/0x70 [drm]
kernel: [   34.310897]  ? openchrome_gem_create_ioctl+0x150/0x150 [openchrome]
. . .
_______________________________________________

The erroneous call to  "ttm_tt_create" member happens right after TTM placement is performed (openchrome_ttm_domain_to_placement()).
Currently, OpenChrome DRM's TTM implementation does not use "ttm_tt_create" member, and this arrangement worked fine until Linux 5.9's drm-next code.
It appears that Linux 5.10's drm-next code broke the code.

Regards,

Kevin Brace
Brace Computer Laboratory blog
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbracecomputerlab.com%2F&amp;data=02%7C01%7Cchristian.koenig%40amd.com%7C053fff6809b14e2976d408d8744aee28%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637387212462975986&amp;sdata=oQr07H953zWryuBQJyPgFsqtOkXAtfprsxiA4ny%2F4LE%3D&amp;reserved=0

_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Fmailman%2Flistinfo%2Fdri-devel&amp;data=02%7C01%7Cchristian.koenig%40amd.com%7C053fff6809b14e2976d408d8744aee28%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637387212462975986&amp;sdata=gq2l57u0hZD2arzcWU2ppTuA0mgcshZHI%2BVvHYJ8hcs%3D&amp;reserved=0


_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/dri-devel




[Index of Archives]     [Linux DRI Users]     [Linux Intel Graphics]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [XFree86]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux