Re: Radeon RS780 - BUG: unable to handle kernel NULL pointer dereference

Thomas Hellstrom <thellstrom@xxxxxxxxxx> · Tue, 09 Nov 2010 11:07:20 +0100

On 11/09/2010 10:53 AM, Thomas Hellstrom wrote:
On 11/09/2010 10:29 AM, Markus Trippelsdorf wrote:
On Mon, Nov 08, 2010 at 11:29:16PM +0100, Thomas Hellstrom wrote:
On 11/08/2010 09:53 PM, Jerome Glisse wrote:
On Mon, Nov 8, 2010 at 2:02 PM, Markus Trippelsdorf
<markus@xxxxxxxxxxxxxxx>   wrote:
On Mon, Nov 08, 2010 at 07:43:02PM +0100, Markus Trippelsdorf wrote:
On Mon, Nov 08, 2010 at 06:07:37PM +0100, Markus Trippelsdorf wrote:
On Mon, Nov 08, 2010 at 06:02:21PM +0100, Markus Trippelsdorf 
wrote:
I can trigger a kernel crash on my system by simply loading 
this png
image with firefox:
http://mediaarchive.cern.ch/MediaArchive/Photo/Public/2010/1011251/1011251_01/1011251_01-A4-at-144-dpi.jpg 

Sorry the above link is wrong, this is the right one (that 
triggers the
crash):
http://cdsweb.cern.ch/record/1305179/files/HI-150431-630470-huge.png 

I triggered it a few more times and took the attached picture.
It points to the BUG() call at drivers/gpu/drm/ttm/ttm_bo.c:1628 .
(Sorry for the bad picture quality)
And here the same BUG in plaintext (should be a bit easier to read):

Nov  8 19:28:23 arch kernel: ------------[ cut here ]------------
Nov  8 19:28:23 arch kernel: kernel BUG at 
drivers/gpu/drm/ttm/ttm_bo.c:1628!

Thomas this bug seems to point to a case where we endup trying adding
an entry to
same offset in the rb tree for addr_space_mm. After reviewing
carefully the locking
around the rb tree modification&   addr_space_mm i am fairly confident
that no race can
occur. Would you have any idea on what might go wrong here ? I 
guess i would
ultimately need to dump mm&   rb tree state when BUG get trigger to 
try
to understand
states of things.
I agree there shouldn't be a race in this case.
The locking around these operations is simple and straightforward.

So this IMHO should either be a memory corruption or a bug in the
range manager. I've never seen this BUG trigger before. Dumping mm /
rb tree contents or bisecting should probably find the culprit.
OK I've found the buggy commit by bisection:

e376573f7267390f4e1bdc552564b6fb913bce76 is the first bad commit
commit e376573f7267390f4e1bdc552564b6fb913bce76
Author: Michel Dänzer<daenzer@xxxxxxxxxx>
Date:   Thu Jul 8 12:43:28 2010 +1000

     drm/radeon: fall back to GTT if bo creation/validation in VRAM 
fails.

     This fixes a problem where on low VRAM cards we'd run out of 
space for validation.

     [airlied: Tested on my M7, Thinkpad T42, compiz works with no 
problems.]

     Signed-off-by: Michel Dänzer<daenzer@xxxxxxxxxx>
     Cc: stable@xxxxxxxxxx
     Signed-off-by: Dave Airlie<airlied@xxxxxxxxxx>

Please note that this is an old commit from 2.6.36-rc. When I revert 
it the
kernel no longer crashes. Instead I see the following in my dmesg:

Hmm, so this sounds like something in the Radeon eviction error path 
is causing corruption.
I had a similar problem with vmwgfx, when I tried to unref a BO 
_after_ ttm_bo_init() failed.
ttm_bo_init() is really supposed to call unref itself for various 
reasons,  so calling unref() or kfree() after a failed ttm_bo_init() 
will cause corruption.

In any case, the error below also suggests something is a bit fragile 
in the Radeon driver:

First, an accelerated eviction may fail, like in the message below, 
but then there must always be a backup plan, like unaccelerated 
eviction to system. On BO creation, there are a number of placement 
strategies, but if all else fails, it should be possible to initially 
place the BO in system memory.

Second, If bo validation fails during a command submission, due to 
insufficient VRAM / TT, then the driver should retry the complete 
validation cycle after first blocking all other validators and then 
evicting everything not pinned, to avoid failures due to fragmentation.

/Thomas

Indeed, it seems like the commit you mention just retries ttm_bo_init() 
after it previously failed. At that point the bo has been destroyed, so 
that is probably what's causing the BUG you are seeing.

Admittedly, ttm_bo_init() calling unref on failure is not properly 
documented in the function description.  The reason for doing so is to 
have a single path for freeing all BO resources already allocated on the 
point of failure.

/Thomas

[TTM] Failed to find memory space for buffer 0xffff880113e10e48 
eviction.
[TTM] No space for ffff880113e10e48 (25650 pages, 102600K, 100M)
[TTM]   placement[0]=0x00070002 (1)
[TTM]     has_type: 1
[TTM]     use_type: 1
[TTM]     flags: 0x0000000A
[TTM]     gpu_offset: 0xA0000000
[TTM]     size: 131072
[TTM]     available_caching: 0x00070000
[TTM]     default_caching: 0x00010000
[TTM]  0x00000000-0x00000001:        1: used
[TTM]  0x00000001-0x00000011:       16: used
[TTM]  0x00000011-0x00000111:      256: used
[TTM]  0x00000111-0x00000211:      256: used
[TTM]  0x00000211-0x00000248:       55: free
[TTM]  0x00000248-0x0000024c:        4: used
[TTM]  0x0000024c-0x00001976:     5930: free
[TTM]  0x00001976-0x000021aa:     2100: used
[TTM]  0x000021aa-0x0000285f:     1717: free
[TTM]  0x0000285f-0x00002860:        1: used
[TTM]  0x00002860-0x00002873:       19: free
[TTM]  0x00002873-0x000029b3:      320: used
[TTM]  0x000029b3-0x00020000:   120397: free
[TTM]  total: 131072, used 2954 free 128118
[drm:radeon_cs_ioctl] *ERROR* Failed to parse relocation -12!
radeon 0000:01:05.0: object_init failed for (117555200, 0x00000004)
[drm:radeon_gem_object_create] *ERROR* Failed to allocate GEM object 
(117555200, 4, 4096, -12)
radeon 0000:01:05.0: object_init failed for (117555200, 0x00000004)
[drm:radeon_gem_object_create] *ERROR* Failed to allocate GEM object 
(117555200, 4, 4096, -12)
radeon 0000:01:05.0: object_init failed for (117555200, 0x00000004)
[drm:radeon_gem_object_create] *ERROR* Failed to allocate GEM object 
(117555200, 4, 4096, -12)
radeon 0000:01:05.0: object_init failed for (117555200, 0x00000004)
[drm:radeon_gem_object_create] *ERROR* Failed to allocate GEM object 
(117555200, 4, 4096, -12)
radeon 0000:01:05.0: object_init failed for (117555200, 0x00000004)
...

And the following in the xorg log buffer:

Failed to alloc memory
Failed to allocat:
    size:     : 117555200 bytes
    alignment : 0 bytes
    domains   : 4
...

_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/dri-devel

_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
http://lists.freedesktop.org/mailman/listinfo/dri-devel