Re: [PATCH] drm/radeon: avoid page fault during gpu reset

Christian König <ckoenig.leichtzumerken@xxxxxxxxx> · Tue, 28 Jan 2020 17:22:31 +0100

    Am 28.01.20 um 14:15 schrieb Andreas
      Messer:

      On Sat, Jan 25, 2020 at 07:01:36PM +0000, Koenig, Christian wrote:

Am 25.01.2020 19:47 schrieb Andreas Messer <andi@xxxxxxxxxxxx>:
When backing up a ring, validate pointer to avoid page fault.
[ cut description / kernel messages ] 

NAK, that was suggested multiple times now and is essentially the wrong
approach.

The problem is that the value is invalid because the hardware is not
functional any more. Returning here without backing up the ring just
papers over the real problem.

This is just the first occurance of this and you would need to fix a
couple of hundred register accesses (both inside and outside of the
driver) to make that really work reliable.

      Sure, it wont fix the hardware. But since the page fault is most prominent
part in kernel log, people will continue suggesting it. With that change,
the kernel messages are full of ring and atom bios timeouts and might make
users more likely to consider a hardware issue in the first place.

    That is correct, but the problem is that we currently have 2209
    places where we read a register and usually expect that the values
    to be in a valid range.

    If you really want to avoid all crashes you would need to audit and
    fix all occurrences where for example the register value is used as
    index in an array or similar.

    And the radeon code is only the beginning, the whole PCIe subsystem
    would need an audit in a similar way. That is a huge lot of work we
    are not willing to do.

       Anyway:

        The only advice I can give you is to replace the hardware. From
experience those symptoms mean that your GPU will die rather soon.

      I think my hardware is fine. I have monitored gpu temp and fan pwm now for
a while and found the pwm to be driven at ~60% only although the gpu
already got quite high temperature during gameplay. When forcing the pwm
to ~80% no crash occurs anymore. I suppose it is not the GPU crashing but
instead the VRMs, not getting enough airflow.

I have compared the Bios fan tables of my card with them of other cards
bios (downloaded from web) of same GPU type and similar design.
Although they differ in cooler construction and used fan, all of them
despite one model have exactly the same fan regulation points with PWMHigh
at 80% for 90°C. This single model with other settings has 100% for this
temp and generally much more sane looking regulation curve.

I suppose most of the vendors just copied some reference design,
maybe the vendor's windows driver adjust the curve to a better one,
I don't know.

I think I'll add some sysfs attributes or module parameter to adjust 
the curve to my needs.

    The issue is that this is most likely not a temperature problem at
    all. If you have a temperature problem the ASIC usually just hangs
    in a shader or so, but the BIF is still fully functional (e.g. you
    can probe PCI-IDs etc...).

    That looks more like the ESD protection is kicking in for some
    reason. In other words what you got here is a cold/broken solder
    point on the SMD components which happens to loose contact because
    the material expands when it warms up.

    That is a serious hardware fault and a really good indicator that
    you should replace the faulty component ASAP.

    Regards,

    Christian.

        [ Patch cut out ]

      cheers,
Andreas

      _______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx