On 10/23/2021 11:36, Thomas Hellström wrote:
On 10/23/21 20:18, Matthew Brost wrote:
On Sat, Oct 23, 2021 at 07:46:48PM +0200, Thomas Hellström wrote:
On 10/22/21 20:09, John Harrison wrote:
And to be clear, the engine reset is not supposed to fail. Whether
issued by GuC or i915, the GDRST register is supposed to self clear
according to the bspec. If we are being sent the G2H notification
for an
engine reset failure then the assumption is that the hardware is
broken.
This is not a situation that is ever intended to occur in a production
system. Therefore, it is not something we should spend huge amounts of
effort on making a perfect selftest for.
I don't agree. Selftests are there to verify that assumptions made and
contracts in the code hold and that hardware behaves as intended /
assumed.
No selftest should ideally trigger in a production driver / system.
That
doesn't mean we can remove all selftests or ignore updating them for
altered
assumptions / contracts. I think it's important here to acknowledge
the fact
that this and the perf selftest have found two problems that need
consideration for fixing for a production system.
I'm confused - we are going down the rabbit hole here.
Back to this patch. This test was written for very specific execlists
behavior. It was updated to also support the GuC. In that update we
missed fixing the failure path, well because it always passed. Now it
has failed, we see that it doesn't fail gracefully, and takes down the
machine. This patch fixes that. It also openned my eyes to the horror
show reset locking that needs to be fixed long term.
Well the email above wasn't really about the correctness of this
particular patch (I should probably have altered the subject to
reflect that) but rather about the assumption that failures that
should never occur in a production system are not worth spending time
on selftests for.
My point is that we have to make assumptions that the hardware is
basically functional. Otherwise, where do you stop? Do you write a
selftest for every conceivable operation of the hardware and prove that
it still works every single day? No. That is pointless and we don't have
the resources to test everything that the hardware can possibly do. We
have to cope as gracefully as possible in the case where the hardware
does not behave as intended, such as not killing the entire OS when a
selftest fails. But I don't think we should be spending time on writing
a perfect test for something that is supposed to be impossible at the
hardware level. The purpose of the selftests is to test the driver
behaviour, not the hardware.
John.
For the patch itself, I'll take a deeper look at the patch and get back.
/Thomas