SDMA out-of-bounds write access of tiled surface

nhaehnle@xxxxxxxxx (Nicolai Hähnle) · Wed, 22 Jun 2016 11:15:56 +0200

On 22.06.2016 09:53, Christian KÃ¶nig wrote:
> Hi Nocolai,
>
> If we don't already have an option for this try to double the size of
> the VM area allocate for each BO in userspace.
>
> That should give you a nice hole between each BO and so should help to
> catch cases when somebody writes over the end of a BO.

Tried that (+ forcing the buffer cache to re-use BOs only with the exact 
size), but no change in observed behavior.

Cheers,
Nicolai

>
> Regards,
> Christian.
>
> Am 22.06.2016 um 09:50 schrieb Nicolai HÃ¤hnle:
>> Hi Mads,
>>
>> setting R600_DEBUG=nodma in the X server should work around your
>> problem for now.
>>
>> Marek, perhaps an out-of-bounds check for tiled texture memory access
>> similar to the linear access check is necessary? I wonder if you've
>> seen something about that in the docs.
>>
>> I've annotated the sDMA IB dump. It's a linear-to-display-tiled copy
>> on Carrizo. I tried to reproduce with the attached patch, but failed
>> to do so even with amdgpu.vm_debug=1. With the patch, I get DMA copies
>> that are identical to the one that causes the VM fault except for a
>> different bank_height and macro_tile_aspect, so the issue is likely
>> related to those.
>>
>> Nicolai
>>
>> On 21.06.2016 19:32, Nicolai HÃ¤hnle wrote:
>>> On 21.06.2016 19:16, Mads wrote:
>>>> I sent this for 1.5 hours ago, but since it hasn't arrived to the
>>>> mailing list yet, I try again...
>>>
>>> It arrived, no worries :)
>>>
>>> I'll take a look later.
>>>
>>> Nicolai
>>>
>>>>
>>>> On 2016-06-21 17:48, Mads wrote:
>>>>
>>>>> On 2016-06-21 10:12, Mads wrote:
>>>>>
>>>>> On 2016-06-21 09:39, Nicolai HÃ¤hnle wrote:
>>>>>
>>>>> Thanks. However, I still don't think this is going to help. Your
>>>>> earlier trace experiments showed that the problematic SDMA commands
>>>>> came from the X server, _not_ from plasmashell.
>>>>>
>>>>> So what we see here is likely just the first set of GPU commands sent
>>>>> by plasmashell after the VM fault occurred. Since the plasmashell
>>>>> process is unable to tell who caused the VM fault, it takes the blame
>>>>> incorrectly. Are you sure the X server is using your self-compiled
>>>>> radeonsi_dri.so and has the environment variable set? If it creates a
>>>>> ddebug_dump, it might be somewhere else (it's based off the HOME
>>>>> environment variable, which may be different).
>>>>> I'll take a second look to see if there's an X dump there too, but
>>>>> unfortunately it'll be in about ~8 hours before I have the machine at
>>>>> hand again..
>>>>>
>>>>> And yes, I'm sure, everything is built through portage, so there is no
>>>>> "self-compiled" on the system per se. There's always just one lib
>>>>> available at any time :)
>>>>
>>>> You were right! X didn't have R600_DEBUG=check_vm in environment (no
>>>> login shell/sourcing of /etc/profile).
>>>>
>>>> Here's what i ran:
>>>>
>>>>> $ XAUTHORITY=.Xauthority DISPLAY=:0 LIBGL_DEBUG=verbose dolphin
>>>>> libGL: pci id for fd 9: 1002:9874, driver radeonsi
>>>>> libGL: OpenDriver: trying /usr/lib64/dri/tls/radeonsi_dri.so
>>>>> libGL: OpenDriver: trying /usr/lib64/dri/radeonsi_dri.so
>>>>> si_vm_fault_occured: failed to parse line ' Either
>>>>> enable ECC checking or force module loading by setting
>>>>> 'ecc_enable_override'.
>>>>> '
>>>>> libGL: Using DRI3 for screen 0
>>>>> Trying to convert empty KLocalizedString to QString.
>>>>> Cannot creat accessible child interface for object:
>>>>> PlacesView(0x118d670)  index:  5
>>>>> QPixmap::scaled: Pixmap is a null pixmap
>>>>> QPixmap::scaled: Pixmap is a null pixmap
>>>>> (... etc ...)
>>>>> The X11 connection broke (error 1). Did the X11 server die?
>>>>
>>>> Attaching dmesg and ddebug_dump.
>>>>
>>>> - Mads
>>
>>
>>
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>