SDMA out-of-bounds write access of tiled surface (was: Re: [amd-gfx] AMD Carrizo - GPU fault detected: 146 0x0842b714)

alexdeucher@xxxxxxxxx (Alex Deucher) · Wed, 22 Jun 2016 09:33:56 -0400

On Wed, Jun 22, 2016 at 8:21 AM, Marek OlÅ¡Ã¡k <maraeo at gmail.com> wrote:
> I don't think so.
>
> The VM faults can only occur when accessing the linear texture, and
> the Mesa code should use the correct workarounds already.
>
> The tiled texture is just a collection of 1D tiles (8x8 pixels) and
> SDMA operates on those 1D tiles. It doesn't access memory outside of
> 1D tile boundaries it's supposed to access. 2D tiling is just a
> different ordering of 1D tiles with greater alignment requirements.
> The 2D tile parameters such as bank_height and macro_tile_aspect only
> affect that ordering. 1D tiles are always the same regardless of the
> higher tile mode. Given that, I don't see how SDMA can behave
> differently here.
>
> There are 2 possible explanations for VM faults from tiled access:
> - The tile parameters passed to SDMA don't agree with the parameters
> determined by addrlib. (or there can be a bug in passing those between
> processes)
> - Unknown or undiscovered SDMA bug.
>
> Note that no docs describe the VM fault bug from linear access.
>
> If you both have Carrizo, you should get the same 2D tile parameters.
> If you don't, it's weird.

The row size varies based on the memory configuration and the number
of banks populated.  It might be worth adjusting the row size in
gfx_v8_0_gpu_early_init() to see if that helps reproduce the issue.

Alex

>
> Marek
>
> On Wed, Jun 22, 2016 at 9:50 AM, Nicolai HÃ¤hnle <nhaehnle at gmail.com> wrote:
>> Hi Mads,
>>
>> setting R600_DEBUG=nodma in the X server should work around your problem for
>> now.
>>
>> Marek, perhaps an out-of-bounds check for tiled texture memory access
>> similar to the linear access check is necessary? I wonder if you've seen
>> something about that in the docs.
>>
>> I've annotated the sDMA IB dump. It's a linear-to-display-tiled copy on
>> Carrizo. I tried to reproduce with the attached patch, but failed to do so
>> even with amdgpu.vm_debug=1. With the patch, I get DMA copies that are
>> identical to the one that causes the VM fault except for a different
>> bank_height and macro_tile_aspect, so the issue is likely related to those.
>>
>> Nicolai
>>
>> On 21.06.2016 19:32, Nicolai HÃ¤hnle wrote:
>>>
>>> On 21.06.2016 19:16, Mads wrote:
>>>>
>>>> I sent this for 1.5 hours ago, but since it hasn't arrived to the
>>>> mailing list yet, I try again...
>>>
>>>
>>> It arrived, no worries :)
>>>
>>> I'll take a look later.
>>>
>>> Nicolai
>>>
>>>>
>>>> On 2016-06-21 17:48, Mads wrote:
>>>>
>>>>> On 2016-06-21 10:12, Mads wrote:
>>>>>
>>>>> On 2016-06-21 09:39, Nicolai HÃ¤hnle wrote:
>>>>>
>>>>> Thanks. However, I still don't think this is going to help. Your
>>>>> earlier trace experiments showed that the problematic SDMA commands
>>>>> came from the X server, _not_ from plasmashell.
>>>>>
>>>>> So what we see here is likely just the first set of GPU commands sent
>>>>> by plasmashell after the VM fault occurred. Since the plasmashell
>>>>> process is unable to tell who caused the VM fault, it takes the blame
>>>>> incorrectly. Are you sure the X server is using your self-compiled
>>>>> radeonsi_dri.so and has the environment variable set? If it creates a
>>>>> ddebug_dump, it might be somewhere else (it's based off the HOME
>>>>> environment variable, which may be different).
>>>>> I'll take a second look to see if there's an X dump there too, but
>>>>> unfortunately it'll be in about ~8 hours before I have the machine at
>>>>> hand again..
>>>>>
>>>>> And yes, I'm sure, everything is built through portage, so there is no
>>>>> "self-compiled" on the system per se. There's always just one lib
>>>>> available at any time :)
>>>>
>>>>
>>>> You were right! X didn't have R600_DEBUG=check_vm in environment (no
>>>> login shell/sourcing of /etc/profile).
>>>>
>>>> Here's what i ran:
>>>>
>>>>> $ XAUTHORITY=.Xauthority DISPLAY=:0 LIBGL_DEBUG=verbose dolphin
>>>>> libGL: pci id for fd 9: 1002:9874, driver radeonsi
>>>>> libGL: OpenDriver: trying /usr/lib64/dri/tls/radeonsi_dri.so
>>>>> libGL: OpenDriver: trying /usr/lib64/dri/radeonsi_dri.so
>>>>> si_vm_fault_occured: failed to parse line '                Either
>>>>> enable ECC checking or force module loading by setting
>>>>> 'ecc_enable_override'.
>>>>> '
>>>>> libGL: Using DRI3 for screen 0
>>>>> Trying to convert empty KLocalizedString to QString.
>>>>> Cannot creat accessible child interface for object:
>>>>> PlacesView(0x118d670)  index:  5
>>>>> QPixmap::scaled: Pixmap is a null pixmap
>>>>> QPixmap::scaled: Pixmap is a null pixmap
>>>>> (... etc ...)
>>>>> The X11 connection broke (error 1). Did the X11 server die?
>>>>
>>>>
>>>> Attaching dmesg and ddebug_dump.
>>>>
>>>> - Mads
>>
>>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx at lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx