Re: [Intel-gfx] [PATCH] drm/i915: Don't wait forever in drop_caches

John Harrison <john.c.harrison@xxxxxxxxx> · Wed, 9 Nov 2022 22:20:22 -0800

On 11/9/2022 03:35, Tvrtko Ursulin wrote:
On 08/11/2022 19:37, John Harrison wrote:
On 11/8/2022 01:08, Tvrtko Ursulin wrote:
On 07/11/2022 19:45, John Harrison wrote:
On 11/7/2022 06:09, Tvrtko Ursulin wrote:
On 04/11/2022 17:45, John Harrison wrote:
On 11/4/2022 03:01, Tvrtko Ursulin wrote:
On 03/11/2022 19:16, John Harrison wrote:
On 11/3/2022 02:38, Tvrtko Ursulin wrote:
On 03/11/2022 09:18, Tvrtko Ursulin wrote:
On 03/11/2022 01:33, John Harrison wrote:
On 11/2/2022 07:20, Tvrtko Ursulin wrote:
On 02/11/2022 12:12, Jani Nikula wrote:
On Tue, 01 Nov 2022, John.C.Harrison@xxxxxxxxx wrote:
From: John Harrison <John.C.Harrison@xxxxxxxxx>

At the end of each test, IGT does a drop caches call via 
sysfs with

sysfs?
Sorry, that was meant to say debugfs. I've also been working 
on some sysfs IGT issues and evidently got my wires crossed!


special flags set. One of the possible paths waits for 
idle with an
infinite timeout. That causes problems for debugging 
issues when CI
catches a "can't go idle" test failure. Best case, the CI 
system times
out (after 90s), attempts a bunch of state dump actions 
and then
reboots the system to recover it. Worst case, the CI 
system can't do
anything at all and then times out (after 1000s) and 
simply reboots.
Sometimes a serial port log of dmesg might be available, 
sometimes not.

So rather than making life hard for ourselves, change the 
timeout to
be 10s rather than infinite. Also, trigger the standard
wedge/reset/recover sequence so that testing can continue 
with a
working system (if possible).

Signed-off-by: John Harrison <John.C.Harrison@xxxxxxxxx>
---
  drivers/gpu/drm/i915/i915_debugfs.c | 7 ++++++-
  1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/i915_debugfs.c 
b/drivers/gpu/drm/i915/i915_debugfs.c
index ae987e92251dd..9d916fbbfc27c 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -641,6 +641,9 @@ 
DEFINE_SIMPLE_ATTRIBUTE(i915_perf_noa_delay_fops,
            DROP_RESET_ACTIVE | \
            DROP_RESET_SEQNO | \
            DROP_RCU)
+
+#define DROP_IDLE_TIMEOUT    (HZ * 10)

I915_IDLE_ENGINES_TIMEOUT is defined in i915_drv.h. It's 
also only used
here.

So move here, dropping i915 prefix, next to the newly 
proposed one?
Sure, can do that.


I915_GEM_IDLE_TIMEOUT is defined in i915_gem.h. It's only 
used in
gt/intel_gt.c.

Move there and rename to GT_IDLE_TIMEOUT?

I915_GT_SUSPEND_IDLE_TIMEOUT is defined and used only in 
intel_gt_pm.c.

No action needed, maybe drop i915 prefix if wanted.

These two are totally unrelated and in code not being 
touched by this change. I would rather not conflate changing 
random other things with fixing this specific issue.

I915_IDLE_ENGINES_TIMEOUT is in ms, the rest are in jiffies.

Add _MS suffix if wanted.

My head spins.

I follow and raise that the newly proposed 
DROP_IDLE_TIMEOUT applies to DROP_ACTIVE and not only 
DROP_IDLE.
My original intention for the name was that is the 'drop 
caches timeout for intel_gt_wait_for_idle'. Which is quite 
the mouthful and hence abbreviated to DROP_IDLE_TIMEOUT. But 
yes, I realised later that name can be conflated with the 
DROP_IDLE flag. Will rename.



Things get refactored, code moves around, bits get left 
behind, who knows. No reason to get too worked up. :) As 
long as people are taking a wider view when touching the 
code base, and are not afraid to send cleanups, things 
should be good.
On the other hand, if every patch gets blocked in code 
review because someone points out some completely unrelated 
piece of code could be a bit better then nothing ever gets 
fixed. If you spot something that you think should be 
improved, isn't the general idea that you should post a 
patch yourself to improve it?

There's two maintainers per branch and an order of magnitude 
or two more developers so it'd be nice if cleanups would just 
be incoming on self-initiative basis. ;)

For the actual functional change at hand - it would be nice 
if code paths in question could handle SIGINT and then we 
could punt the decision on how long someone wants to wait 
purely to userspace. But it's probably hard and it's only 
debugfs so whatever.

The code paths in question will already abort on a signal 
won't they? Both intel_gt_wait_for_idle() and 
intel_guc_wait_for_pending_msg(), which is where the 
uc_wait_for_idle eventually ends up, have an 
'if(signal_pending) return -EINTR;' check. Beyond that, it 
sounds like what you are asking for is a change in the IGT 
libraries and/or CI framework to start sending signals after 
some specific timeout. That seems like a significantly more 
complex change (in terms of the number of entities affected 
and number of groups involved) and unnecessary.

If you say so, I haven't looked at them all. But if the code 
path in question already aborts on signals then I am not sure 
what is the patch fixing? I assumed you are trying to avoid 
the write stuck in D forever, which then prevents driver 
unload and everything, requiring the test runner to 
eventually reboot. If you say SIGINT works then you can 
already recover from userspace, no?

Whether or not 10s is enough CI will hopefully tell us. I'd 
probably err on the side of safety and make it longer, but 
at most half from the test runner timeout.
This is supposed to be test clean up. This is not about how 
long a particular test takes to complete but about how long 
it takes to declare the system broken after the test has 
already finished. I would argue that even 10s is massively 
longer than required.


I am not convinced that wedging is correct though. 
Conceptually could be just that the timeout is too short. 
What does wedging really give us, on top of limiting the 
wait, when latter AFAIU is the key factor which would 
prevent the need to reboot the machine?

It gives us a system that knows what state it is in. If we 
can't idle the GT then something is very badly wrong. 
Wedging indicates that. It also ensure that a full GT reset 
will be attempted before the next test is run. Helping to 
prevent a failure on test X from propagating into failures 
of unrelated tests X+1, X+2, ... And if the GT reset does 
not work because the system is really that badly broken then 
future tests will not run rather than report erroneous 
failures.

This is not about getting a more stable system for end users 
by sweeping issues under the carpet and pretending all is 
well. End users don't run IGTs or explicitly call dodgy 
debugfs entry points. The sole motivation here is to get 
more accurate results from CI. That is, correctly 
identifying which test has hit a problem, getting valid 
debug analysis for that test (logs and such) and allowing 
further testing to complete correctly in the case where the 
system can be recovered.

I don't really oppose shortening of the timeout in principle, 
just want a clear statement if this is something IGT / test 
runner could already do or not. It can apply a timeout, it 
can also send SIGINT, and it could even trigger a reset from 
outside. Sure it is debugfs hacks so general "kernel should 
not implement policy" need not be strictly followed, but lets 
have it clear what are the options.

One conceptual problem with applying this policy is that the 
code is:

    if (val & (DROP_IDLE | DROP_ACTIVE)) {
        ret = intel_gt_wait_for_idle(gt, MAX_SCHEDULE_TIMEOUT);
        if (ret)
            return ret;
    }

    if (val & DROP_IDLE) {
        ret = intel_gt_pm_wait_for_idle(gt);
        if (ret)
            return ret;
    }

So if someone passes in DROP_IDLE and then why would only the 
first branch have a short timeout and wedge. Yeah some bug 
happens to be there at the moment, but put a bug in a 
different place and you hang on the second branch and then 
need another patch. Versus perhaps making it all respect 
SIGINT and handle from outside.

The pm_wait_for_idle is can only called after gt_wait_for_idle 
has completed successfully. There is no route to skip the GT 
idle or to do the PM idle even if the GT idle fails. So the 
chances of the PM idle failing are greatly reduced. There would 
have to be something outside of a GT keeping the GPU awake and 
there isn't a whole lot of hardware left at that point!

Well "greatly reduced" is beside my point. Point is today bug is 
here and we add a timeout, tomorrow bug is there and then the 
same dance. It can be just a sw bug which forgets to release the 
pm ref in some circumstances, doesn't really matter.

Huh?

Greatly reduced is the whole point. Today there is a bug and it 
causes a kernel hang which requires the CI framework to reboot 
the system in an extremely unfriendly way which makes it very 
hard to work out what happened. Logs are likely not available. We 
don't even necessarily know which test was being run at the time. 
Etc. So we replace the infinite timeout with a meaningful 
timeout. CI now correctly marks the single test as failing, 
captures all the correct logs, creates a useful bug report and 
continues on testing more stuff.

So what is preventing CI to collect logs if IGT is forever stuck 
in interruptible wait? Surely it can collect the logs at that 
point if the kernel is healthy enough. If it isn't then I don't 
see how wedging the GPU will make the kernel any healthier.

Is i915 preventing better log collection or could test runner be 
improved?

Sure, there is still the chance of hitting an infinite timeout. 
But that one is significantly more complicated to remove. And the 
chances of hitting that one are significantly smaller than the 
chances of hitting the first one.

This statement relies on intimate knowledge implementation details 
and a bit too much white box testing approach but that's okay, 
lets move past this one.

So you are arguing that because I can't fix the last 0.1% of 
possible failures, I am not allowed to fix the first 99.9% of the 
failures?

I am clearly not arguing for that. But we are also not talking 
about "fixing failures" here. Just how to make CI cope better with 
a class of i915 bugs.

Regarding signals, the PM idle code ends up at 
wait_var_event_killable(). I assume that is interruptible via 
at least a KILL signal if not any signal. Although it's not 
entirely clear trying to follow through the implementation of 
this code. Also, I have no idea if there is a safe way to add a 
timeout to that code (or why it wasn't already written with a 
timeout included). Someone more familiar with the wakeref 
internals would need to comment.

However, I strongly disagree that we should not fix the driver 
just because it is possible to workaround the issue by 
re-writing the CI framework. Feel free to bring a redesign plan 
to the IGT WG and whatever equivalent CI meetings in parallel. 
But we absolutely should not have infinite waits in the kernel 
if there is a trivial way to not have infinite waits.

I thought I was clear that I am not really opposed to the timeout.

The rest of the paragraph I don't really care - point is moot 
because it's debugfs so we can do whatever, as long as it is not 
burdensome to i915, which this isn't. If either wasn't the case 
then we certainly wouldn't be adding any workarounds in the 
kernel if it can be achieved in IGT.

Also, sending a signal does not result in the wedge happening. 
I specifically did not want to change that code path because I 
was assuming there was a valid reason for it. If you have been 
interrupted then you are in the territory of maybe it would 
have succeeded if you just left it for a moment longer. 
Whereas, hitting the timeout says that someone very 
deliberately said this is too long to wait and therefore the 
system must be broken.

I wanted to know specifically about wedging - why can't you 
wedge/reset from IGT if DROP_IDLE times out in quiescent or 
wherever, if that's what you say is the right thing? 
Huh?

DROP_IDLE has two waits. One that I am trying to change from 
infinite to finite + wedge. One that would take considerable 
effort to change and would be quite invasive to a lot more of the 
driver and which can only be hit if the first timeout actually 
completed successfully and is therefore of less importance 
anyway. Both of those time outs appear to respect signal interrupts.

That's a policy decision so why would i915 wedge if an arbitrary 
timeout expired? I915 is not controlling how much work there is 
outstanding at the point the IGT decides to call DROP_IDLE.

Because this is a debug test interface that is used solely by IGT 
after it has finished its testing. This is not about wedging the 
device at some random arbitrary point because an AI compute 
workload takes three hours to complete. This is about a very 
specific test framework cleaning up after testing is completed 
and making sure the test did not fry the system.

And even if an IGT test was calling DROP_IDLE in the middle of a 
test for some reason, it should not be deliberately pushing 10+ 
seconds of work through and then calling a debug only interface 
to flush it out. If a test wants to verify that the system can 
cope with submitting a minutes worth of rendering and then 
waiting for it to complete then the test should be using official 
channels for that wait.


Plus, infinite wait is not a valid code path in the first place 
so any change in behaviour is not really a change in behaviour. 
Code can't be relying on a kernel call to never return for its 
correct operation!

Why infinite wait wouldn't be valid? Then you better change the 
other one as well. ;P
In what universe is it ever valid to wait forever for a test to 
complete?

Well above you claimed both paths respect SIGINT. If that is so 
then the wait is as infinite as the IGT wanted it to be.

See above, the PM code would require much more invasive changes. 
This was low hanging fruit. It was supposed to be a two minute 
change to a very self contained section of code that would 
provide significant benefit to debugging a small class of very 
hard to debug problems.

Sure, but I'd still like to know why can't you do what you want 
from the IGT framework.

Have the timeout reduction in i915, again that's fine assuming 10 
seconds it enough to not break something by accident.
CI showed no regressions. And if someone does find a valid reason 
why a post test drop caches call should legitimately take a 
stupidly long time then it is easy to track back where the ETIME 
error came from and bump the timeout.


With that change you already have broken the "infinite wait". It 
makes the debugfs write return -ETIME in time much shorter than 
the test runner timeout(s). What is the thing that you cannot do 
from IGT at that point is my question? You want to wedge then? 
Send DROP_RESET_ACTIVE to do it for you? If that doesn't work add 
a new flag which will wedge explicitly.

We are again degrading into a huge philosophical discussion and 
all I wanted to start with is to hear how exactly things go bad.

I have no idea what you are wanting. I am trying to have a 
technical discussion about improving the stability of the driver 
during CI testing. I have no idea if you are arguing that this 
change is good, bad, broken, wrong direction or what.

Things go bad as explained in the commit message. The CI framework 
does not use signals. The IGT framework does not use signals. There 
is no watchdog that sends a TERM or KILL signal after a specified 
timeout. All that happens is the IGT sits there forever waiting for 
the drop caches IOCTL to return. The CI framework eventually gives 
up waiting for the test to complete and tries to recover. There are 
many different CI frameworks in use across Intel. Some timeout 
quickly, some timeout slowly. But basically, they all eventually 
give up and don't bother trying any kind of remedial action but 
just hit the reset button (sometimes by literally power cycling the 
DUT). As result, background processes that are saving dmesg, 
stdout, etc do not necessarily terminate cleanly. That results in 
logs that are at best truncated, at worst missing entirely. It also 
results in some frameworks aborting testing at that point. So no 
results are generated for all the other tests that have yet to be 
run. Some frameworks also run tests in batches. All they log is 
that something, somewhere in the batch died. So you don't even know 
which specific test actually hit the problem.

Can the CI frameworks be improved? Undoubtedly. In very many ways. 
Is that something we have the ability to do with a simple patch? 
No. Would re-writing the IGT framework to add watchdog mechanisms 
improve things? Yes. Can it be done with a simple patch? No. Would 
a simple patch to i915 significantly improve the situation? Yes. 
Will it solve every possible CI hang? No. Will it fix any actual 
end user visible bugs? No. Will it introduce any new bugs? No. Will 
it help us to debug at least some CI failures? Yes.

To unblock, I suggest you go with the patch which caps the wait 
only, and propose a wedging as an IGT patch to gem_quiescent_gpu(). 
That should involve the CI/IGT folks into discussion on what logs 
will be, or will not be collected once gem_quiescent_gpu() fails due 
-ETIME. In fact probably you should copy CI/IGT folks on the v2 of 
the i915 patch as well since I now think their acks would be good to 
have - from the point of view of the current test runner behaviour 
with hanging tests.

Simply returning -ETIME without wedging will actually make the 
situation worse. At the moment, you get 'all testing stopped due to 
machine not responding' bugs being logged. Which is a right pain and 
has very little useful information, but at least is not claiming 
random tests are broken when they are not. If you return ETIME 
without wedging then test A will 

Several times I asked why can't you wedge from gem_quiescent_gpu() 
since that is done on driver open. So the chain of failing tests 
describing below is not relevant to my question.
Actually, no. You have mentioned gem_quiescent_gpu() once and as an IGT 
patch. Which presumably means an entire new API between IGT and i915.


Whole point is why add policy to i915 if it can be done from 
userspace. Current API is called "wait for idle", not "wait for idle 
ten seconds max" (although fine since IGT will fail on timeout 
already), and not "wait for idle or wedge, sometimes".

Yes it's only debugfs, I said that multiple times already, so it could 
be whatever, but in principle adding code to kernel should always be 
the 2nd option. Especially since the implementation is only a 50% 
kludge (I am referring to the 2nd DROP_IDLE branch where the proposal 
does not add a timeout or wedging). So a half-policy even. Wedge if 
this stage of DROP_IDLE timed out, but don't wedge if this other stage 
of DROP_IDLE timed out or failed.

Which is why I was saying, if signals would be respected anyway, why 
couldn't you do the whole thing in IGT to start with.. "wrap" 
gem_quiescent_gpu with alarm(2), send SIGINT, wedge, whatever. If that 
works it would be the same effect. And policy where it belongs, zero 
kernel code required. If it works.. I haven't checked, you said it 
would though so what would be wrong with this approach?
Finding someone to do it. If you are familiar with the IGT framework 
internals then feel free. I am not. Whereas, this was a trivial change 
that could improve the situation while having no bad side effects 
(because if the alternative is hanging forever then any change is a good 
change).


And completely separate from the above line of discussion I am not 
even sure how the "no logs" relates to this all. Sure some bugs result 
in no logs since kernel crashes so badly. This patch will not improve 
that.
I never said it would solve every 'missing log' situation. I said it 
would help with the situation where the CI framework times out because 
of one specific class of failures. And in that case it does currently 
reboot with little or no attempt at recovery and therefore little or no 
log capture.


And if the kernel is not badly broken test runner will detect a 
timeout and sure all further testing will be invalid. It's not the 
first, or last, or the only way that can happen. There will be logs 
though so it can be debugged and fixed. (Unless there can't be logs 
anyway.) So you find the first failing test and fix the issue. How 
often does it happen anyway.

Or if I totally got this wrong please paste some CI or CIbuglog links 
to put me on the correct path.
As stated, there are very many bug reports of 'test timed out, 
rebooted'. It is impossible to know exactly how each particular instance 
got into that situation. So no, there is no CI report where I can 
categorically say this is exactly what happened. However, while 
debugging one such issue, I spotted this particular route into that 
situation and realised that it was something that could be trivially fixed.

Except apparently I'm not allowed to. So I give up. I don't have time to 
pursue this any further.

John.


Regards,

Tvrtko

hang and return ETIME. CI will log an ETIME bug against test A. CI 
will then try test B, which will fail with ETIME because the system 
is still broken but claiming to be working. So log a new bug against 
test B. Move on to test C, oh look, ETIME - log another bug and move 
on to test D... That is far worse, a whole slew of pointless and 
incorrect bugs have just been logged.

And how is it possibly considered a backwards breaking or dangerous 
change to wedge instead of hanging forever? Reboot versus wedge. 
Absolutely no defined behaviour at all because the system has simply 
stopped versus marking the system as broken and having a best effort 
at handling the situation. Yup, that's definitely a very dangerous 
change that could break all sorts of random user applications.

Re 'IGT folks' - whom? Ashutosh had already agreed to the original 
patch.

And CI folks are certainly aware of such issues. There are any number 
of comments in Jiras about 'no logs available, cannot analyse'.

John.


Regards,

Tvrtko