SDMA out-of-bounds write access of tiled surface (was: Re: [amd-gfx] AMD Carrizo - GPU fault detected: 146 0x0842b714)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Mads,

setting R600_DEBUG=nodma in the X server should work around your problem 
for now.

Marek, perhaps an out-of-bounds check for tiled texture memory access 
similar to the linear access check is necessary? I wonder if you've seen 
something about that in the docs.

I've annotated the sDMA IB dump. It's a linear-to-display-tiled copy on 
Carrizo. I tried to reproduce with the attached patch, but failed to do 
so even with amdgpu.vm_debug=1. With the patch, I get DMA copies that 
are identical to the one that causes the VM fault except for a different 
bank_height and macro_tile_aspect, so the issue is likely related to those.

Nicolai

On 21.06.2016 19:32, Nicolai Hähnle wrote:
> On 21.06.2016 19:16, Mads wrote:
>> I sent this for 1.5 hours ago, but since it hasn't arrived to the
>> mailing list yet, I try again...
>
> It arrived, no worries :)
>
> I'll take a look later.
>
> Nicolai
>
>>
>> On 2016-06-21 17:48, Mads wrote:
>>
>>> On 2016-06-21 10:12, Mads wrote:
>>>
>>> On 2016-06-21 09:39, Nicolai Hähnle wrote:
>>>
>>> Thanks. However, I still don't think this is going to help. Your
>>> earlier trace experiments showed that the problematic SDMA commands
>>> came from the X server, _not_ from plasmashell.
>>>
>>> So what we see here is likely just the first set of GPU commands sent
>>> by plasmashell after the VM fault occurred. Since the plasmashell
>>> process is unable to tell who caused the VM fault, it takes the blame
>>> incorrectly. Are you sure the X server is using your self-compiled
>>> radeonsi_dri.so and has the environment variable set? If it creates a
>>> ddebug_dump, it might be somewhere else (it's based off the HOME
>>> environment variable, which may be different).
>>> I'll take a second look to see if there's an X dump there too, but
>>> unfortunately it'll be in about ~8 hours before I have the machine at
>>> hand again..
>>>
>>> And yes, I'm sure, everything is built through portage, so there is no
>>> "self-compiled" on the system per se. There's always just one lib
>>> available at any time :)
>>
>> You were right! X didn't have R600_DEBUG=check_vm in environment (no
>> login shell/sourcing of /etc/profile).
>>
>> Here's what i ran:
>>
>>> $ XAUTHORITY=.Xauthority DISPLAY=:0 LIBGL_DEBUG=verbose dolphin
>>> libGL: pci id for fd 9: 1002:9874, driver radeonsi
>>> libGL: OpenDriver: trying /usr/lib64/dri/tls/radeonsi_dri.so
>>> libGL: OpenDriver: trying /usr/lib64/dri/radeonsi_dri.so
>>> si_vm_fault_occured: failed to parse line '                Either
>>> enable ECC checking or force module loading by setting
>>> 'ecc_enable_override'.
>>> '
>>> libGL: Using DRI3 for screen 0
>>> Trying to convert empty KLocalizedString to QString.
>>> Cannot creat accessible child interface for object:
>>> PlacesView(0x118d670)  index:  5
>>> QPixmap::scaled: Pixmap is a null pixmap
>>> QPixmap::scaled: Pixmap is a null pixmap
>>> (... etc ...)
>>> The X11 connection broke (error 1). Did the X11 server die?
>>
>> Attaching dmesg and ddebug_dump.
>>
>> - Mads

-------------- next part --------------
VM fault report.

Driver vendor: X.Org
Device vendor: AMD
Device name: AMD CARRIZO (DRM 3.1.0 / 4.6.2-gentoo, LLVM 3.9.0)

Failing VM page: 0x00101508

Buffer list (in units of pages = 4kB):
        Size    VM start page         VM end page           Usage
           8    0x0000000100035       0x000000010003d       IB1
         843    -- hole --
         975    0x0000000100388       0x0000000100757       SDMA_BUFFER
        2473    -- hole --
        1032    0x0000000101100       0x0000000101508       SDMA_BUFFER

Note: The holes represent memory not used by the IB.
      Other buffers can still be allocated there.

------------------ sDMA IB begin ------------------
 00000501 COPY, TILED_SUB_WINDOW
 01100000 tiled_address_lo
 00000001 tiled_address_hi
 001d0000 tiled_x = 0, tiled_y = 29
 00ab0000 tiled_z = 0, pitch_tile_max = 0xab = 171
 0000407f slice_tile_max = 0x407f = 16511
 02481822
 00388000 linear_address_lo
 00000001 linear_address_hi
 00000000 linear_x = 0, linear_y = 0
 057f0000 linear_z = 0, linear_pitch = 0x580 = 1408
 000f3b7f linear_slice_pitch = 0xf3b80 = 998272
 02c40555 copy_width_aligned = 0x556 = 1366, copy_height = 709
 00000000 copy_depth = 1
 00000000 NOP
------------------- sDMA IB end -------------------

linear_height = 709
log(bpe) = 2, bpe = 4
array_mode = 4 (ARRAY_2D_TILED_THIN1)
micro_tile_mode = 0 (DISPLAY_MICRO_TILING)
log(tile_split) = 3
bank_width = 0
bank_height = 2
num_banks = 2
macro_tile_aspect = 2
pipe_config = 0

tiled_pitch = 172 * 8 = 1376
tiled_slice_pitch = 16512 * 64 = 1056768
-> tiled_height = 768

My Carrizo: tile bits 01401822
bank_height = 0
num_banks = 2
macro_tile_aspect = 1

SDMA Dump Done.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: reproduction-attempt.patch
Type: text/x-patch
Size: 3783 bytes
Desc: not available
URL: <https://lists.freedesktop.org/archives/amd-gfx/attachments/20160622/c49aae43/attachment.bin>


[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux