On 11/3/2022 02:38, Tvrtko Ursulin wrote:
On 03/11/2022 09:18, Tvrtko Ursulin wrote:
On 03/11/2022 01:33, John Harrison wrote:
On 11/2/2022 07:20, Tvrtko Ursulin wrote:
On 02/11/2022 12:12, Jani Nikula wrote:
On Tue, 01 Nov 2022, John.C.Harrison@xxxxxxxxx wrote:
From: John Harrison <John.C.Harrison@xxxxxxxxx>
At the end of each test, IGT does a drop caches call via sysfs with
sysfs?
Sorry, that was meant to say debugfs. I've also been working on some
sysfs IGT issues and evidently got my wires crossed!
special flags set. One of the possible paths waits for idle with an
infinite timeout. That causes problems for debugging issues when CI
catches a "can't go idle" test failure. Best case, the CI system
times
out (after 90s), attempts a bunch of state dump actions and then
reboots the system to recover it. Worst case, the CI system can't do
anything at all and then times out (after 1000s) and simply reboots.
Sometimes a serial port log of dmesg might be available,
sometimes not.
So rather than making life hard for ourselves, change the timeout to
be 10s rather than infinite. Also, trigger the standard
wedge/reset/recover sequence so that testing can continue with a
working system (if possible).
Signed-off-by: John Harrison <John.C.Harrison@xxxxxxxxx>
---
drivers/gpu/drm/i915/i915_debugfs.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/i915/i915_debugfs.c
b/drivers/gpu/drm/i915/i915_debugfs.c
index ae987e92251dd..9d916fbbfc27c 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -641,6 +641,9 @@
DEFINE_SIMPLE_ATTRIBUTE(i915_perf_noa_delay_fops,
DROP_RESET_ACTIVE | \
DROP_RESET_SEQNO | \
DROP_RCU)
+
+#define DROP_IDLE_TIMEOUT (HZ * 10)
I915_IDLE_ENGINES_TIMEOUT is defined in i915_drv.h. It's also only
used
here.
So move here, dropping i915 prefix, next to the newly proposed one?
Sure, can do that.
I915_GEM_IDLE_TIMEOUT is defined in i915_gem.h. It's only used in
gt/intel_gt.c.
Move there and rename to GT_IDLE_TIMEOUT?
I915_GT_SUSPEND_IDLE_TIMEOUT is defined and used only in
intel_gt_pm.c.
No action needed, maybe drop i915 prefix if wanted.
These two are totally unrelated and in code not being touched by
this change. I would rather not conflate changing random other
things with fixing this specific issue.
I915_IDLE_ENGINES_TIMEOUT is in ms, the rest are in jiffies.
Add _MS suffix if wanted.
My head spins.
I follow and raise that the newly proposed DROP_IDLE_TIMEOUT
applies to DROP_ACTIVE and not only DROP_IDLE.
My original intention for the name was that is the 'drop caches
timeout for intel_gt_wait_for_idle'. Which is quite the mouthful and
hence abbreviated to DROP_IDLE_TIMEOUT. But yes, I realised later
that name can be conflated with the DROP_IDLE flag. Will rename.
Things get refactored, code moves around, bits get left behind, who
knows. No reason to get too worked up. :) As long as people are
taking a wider view when touching the code base, and are not afraid
to send cleanups, things should be good.
On the other hand, if every patch gets blocked in code review
because someone points out some completely unrelated piece of code
could be a bit better then nothing ever gets fixed. If you spot
something that you think should be improved, isn't the general idea
that you should post a patch yourself to improve it?
There's two maintainers per branch and an order of magnitude or two
more developers so it'd be nice if cleanups would just be incoming on
self-initiative basis. ;)
For the actual functional change at hand - it would be nice if code
paths in question could handle SIGINT and then we could punt the
decision on how long someone wants to wait purely to userspace. But
it's probably hard and it's only debugfs so whatever.
The code paths in question will already abort on a signal won't
they? Both intel_gt_wait_for_idle() and
intel_guc_wait_for_pending_msg(), which is where the
uc_wait_for_idle eventually ends up, have an 'if(signal_pending)
return -EINTR;' check. Beyond that, it sounds like what you are
asking for is a change in the IGT libraries and/or CI framework to
start sending signals after some specific timeout. That seems like a
significantly more complex change (in terms of the number of
entities affected and number of groups involved) and unnecessary.
If you say so, I haven't looked at them all. But if the code path in
question already aborts on signals then I am not sure what is the
patch fixing? I assumed you are trying to avoid the write stuck in D
forever, which then prevents driver unload and everything, requiring
the test runner to eventually reboot. If you say SIGINT works then
you can already recover from userspace, no?
Whether or not 10s is enough CI will hopefully tell us. I'd
probably err on the side of safety and make it longer, but at most
half from the test runner timeout.
This is supposed to be test clean up. This is not about how long a
particular test takes to complete but about how long it takes to
declare the system broken after the test has already finished. I
would argue that even 10s is massively longer than required.
I am not convinced that wedging is correct though. Conceptually
could be just that the timeout is too short. What does wedging
really give us, on top of limiting the wait, when latter AFAIU is
the key factor which would prevent the need to reboot the machine?
It gives us a system that knows what state it is in. If we can't
idle the GT then something is very badly wrong. Wedging indicates
that. It also ensure that a full GT reset will be attempted before
the next test is run. Helping to prevent a failure on test X from
propagating into failures of unrelated tests X+1, X+2, ... And if
the GT reset does not work because the system is really that badly
broken then future tests will not run rather than report erroneous
failures.
This is not about getting a more stable system for end users by
sweeping issues under the carpet and pretending all is well. End
users don't run IGTs or explicitly call dodgy debugfs entry points.
The sole motivation here is to get more accurate results from CI.
That is, correctly identifying which test has hit a problem, getting
valid debug analysis for that test (logs and such) and allowing
further testing to complete correctly in the case where the system
can be recovered.
I don't really oppose shortening of the timeout in principle, just
want a clear statement if this is something IGT / test runner could
already do or not. It can apply a timeout, it can also send SIGINT,
and it could even trigger a reset from outside. Sure it is debugfs
hacks so general "kernel should not implement policy" need not be
strictly followed, but lets have it clear what are the options.
One conceptual problem with applying this policy is that the code is:
if (val & (DROP_IDLE | DROP_ACTIVE)) {
ret = intel_gt_wait_for_idle(gt, MAX_SCHEDULE_TIMEOUT);
if (ret)
return ret;
}
if (val & DROP_IDLE) {
ret = intel_gt_pm_wait_for_idle(gt);
if (ret)
return ret;
}
So if someone passes in DROP_IDLE and then why would only the first
branch have a short timeout and wedge. Yeah some bug happens to be
there at the moment, but put a bug in a different place and you hang
on the second branch and then need another patch. Versus perhaps
making it all respect SIGINT and handle from outside.
The pm_wait_for_idle is can only called after gt_wait_for_idle has
completed successfully. There is no route to skip the GT idle or to do
the PM idle even if the GT idle fails. So the chances of the PM idle
failing are greatly reduced. There would have to be something outside of
a GT keeping the GPU awake and there isn't a whole lot of hardware left
at that point!
Regarding signals, the PM idle code ends up at
wait_var_event_killable(). I assume that is interruptible via at least a
KILL signal if not any signal. Although it's not entirely clear trying
to follow through the implementation of this code. Also, I have no idea
if there is a safe way to add a timeout to that code (or why it wasn't
already written with a timeout included). Someone more familiar with the
wakeref internals would need to comment.
However, I strongly disagree that we should not fix the driver just
because it is possible to workaround the issue by re-writing the CI
framework. Feel free to bring a redesign plan to the IGT WG and whatever
equivalent CI meetings in parallel. But we absolutely should not have
infinite waits in the kernel if there is a trivial way to not have
infinite waits.
Also, sending a signal does not result in the wedge happening. I
specifically did not want to change that code path because I was
assuming there was a valid reason for it. If you have been interrupted
then you are in the territory of maybe it would have succeeded if you
just left it for a moment longer. Whereas, hitting the timeout says that
someone very deliberately said this is too long to wait and therefore
the system must be broken.
Plus, infinite wait is not a valid code path in the first place so any
change in behaviour is not really a change in behaviour. Code can't be
relying on a kernel call to never return for its correct operation!
And if you don't wedge then you don't recover. Each subsequent test
would just hit the infinite timeout, get killed by the CI framework's
shiny new kill signal and be marked as yet another unrelated bug that
will be logged separately. Whereas, using a sensible timeout and then
wedging will at least attempt to recover the situation. And if it can be
recovered, future tests will pass. If it can't then future testing will
be aborted.
John.
Regards,
Tvrtko