On Tue, Dec 12, 2017 at 5:36 PM, Christian König <ckoenig.leichtzumerken at gmail.com> wrote: > Am 12.12.2017 um 15:57 schrieb Marek Olšák: >> >> On Tue, Dec 12, 2017 at 10:01 AM, Christian König >> <ckoenig.leichtzumerken at gmail.com> wrote: >>> >>> Am 11.12.2017 um 22:29 schrieb Marek Olšák: >>>> >>>> From: Marek Olšák <marek.olsak at amd.com> >>>> >>>> Signed-off-by: Marek Olšák <marek.olsak at amd.com> >>>> --- >>>> >>>> Is this really correct? I have no easy way to test it. >>> >>> >>> It's a step in the right direction, but I would rather vote for something >>> else: >>> >>> Instead of disabling the timeout by default we only disable the GPU >>> reset/recovery. >>> >>> The idea is to add a new parameter amdgpu_gpu_recovery which makes >>> amdgpu_gpu_recover only prints out an error and doesn't touch the GPU at >>> all >>> (on bare metal systems). >>> >>> Then we finally set the amdgpu_lockup_timeout to a non zero value by >>> default. >>> >>> Andrey could you take care of this when you have time? >> >> I don't understand this. >> >> Why can't we keep the previous behavior where amdgpu.lockup_timeout=0 >> disabled GPU reset? Why do we have to add another option for the same >> thing? > > > lockup_timeout=0 never disabled the GPU reset, it just disabled the timeout. It disabled the automatic reset before we had those interrupt callbacks. > > You could still manually trigger a reset and also invalid commands, invalid > register writes and requests from the SRIOV hypervisor could trigger this. That's OK. Manual resets should always be allowed. > > And as Monk explained GPU resets are mandatory for SRIOV, you can't disable > them at all in this case. What is preventing Monk from setting amdgpu.lockup_timeout > 0, which should be the default state anyway? Let's just say lockup_timeout=0 has undefined behavior with SRIOV. > > Additional to that we probably want the error message that something timed > out, but not touching the hardware in any way. Yes that is a fair point. Marek