Am 06.07.20 um 14:42 schrieb Luben Tuikov:
On 2020-06-30 3:01 a.m., Christian König wrote:
Am 30.06.20 um 00:46 schrieb Luben Tuikov:
On 2020-06-26 1:04 p.m., Christian König wrote:
Am 26.06.20 um 18:12 schrieb Alex Jivin:
[SNIP]
Adding a delay here is probably just postponing that. Do we have some
note in the documentation that this is necessary?
Flushing the write buffer (by issuing a read if necessary) is different
than waiting for the hardware to process the request.
It's not flushing the write buffer. The hardware is often build in a way
that the next read is triggering the action instead of the write.
I don't think that the "read is triggering the action", in this case.
The write is triggering the action, when it makes it to the hardware.
Also, often enough, the coherency unit is built with dependency detection.
This means that a read following a write to the same "hardware" address,
would cause the read op to trigger the flushing of the write buffer
if the address is still in the write buffer. And then the read
would be issued, following the flush, usually by the bus unit (BU).
(So the bus would show a bunch of WRITES followed by our READ.)
You seem to agree as you said that in (A) above.
Yeah, but this is not what I meant here.
See a bit below for more description on what our hardware often does.
That's done because writes should be accepted by the hardware block
immediately while reads have a timeout associated with it.
That's is true but irrelevant here.
No it isn't. See the read of the register acts as a delay as well.
The read doesn't comes back to the CPU immediately when there is an
ongoing action to wait for.
Some of the SDMA, TC and HDP registers even go as far as delaying the
return of the read until some memory is flushed.
The maximum timeout here is rather large, something in the range of 65ms
IIRC.
As the code is present right now in the kernel driver,
if the hardware changes state on the last delay of 10 ms,
you're going to miss that as no checks are done after the delay.
The code clearly shows this.
All I'm saying is that to avoid this "race" condition,
the code should be modified, as there is no point in delaying 10
ms only to quit after that, when count gets to equal MAX_COUNT.
Well, that is indeed a good point!
The current code does flush the write buffer by reading back,
and then delaying (both in the loop), which does achieve this,
but it leaves a corner case as I wrote in my review.
The corner case is that if the status
register changes to "done" while in the last "delay()"
we then check the loop invariant and exit, as opposed to reading
the register and determining that it is done successfully.
Current, all in pseudo-code:
write()
for (count = 0; count < MAX_COUNT; count++) {
res = read()
if (res is success) break
mdelay(10) <-- if it changes here, we miss it on the last iteration
}
Optimal:
write() ; assume write buffer flush
mdelay(9)
do {
mdelay(1)
res = read()
} while (res != success && ++count < MAX_COUNT)
This solves the corner case, and ensures a time delay of 10 for
the hardware to process its job, but a delay of 1 for polling
the status, as it could be done "anytime now."
That seems to assume that the ACK bits are not immediately cleared on
the write. How do you got to that conclusion?
No such assumption was taken.
This is about the wasted 10 ms delay
as pointed to by the current code quoted above; the comment
on the line with an arrow in it.
Yeah, now I understand what you are trying to do here. Problem is that
this messes up the read delay feature of the hardware with that suggestion.
In other words we should not have is a delay between the initial write
and the first read since the hardware will take care of this for us.
Regards,
Christian.
Processing the request can take even longer than 10ms depending on what
is done, which reference clock is selected, the temperature of the
hardware, sleep mode etc etc...
That's why the loop.
[snip!]
Assuming that WREG32_P() flushes the write buffer, as it seems
that it does, the idea here is to give the hardware some time to process
the request (from writing a value to it), but when polling to poll
a shorter amount of time.
Not a bad idea, but this is based on the diagnostic code and already
used for nearly a decade.
Yes, I understand that it does work right now. Surely,
if you set the loop limit high enough, you can show
that you'd iterate at least one over the number required
to complete or fail, and thus "used for nearly a decade".
But from a computational point of view it is not correct
in that the very last delay is a wasted delay:
write()
for (count = 0; count < MAX_COUNT; count++) {
res = read()
if (res is success) break
mdelay(10) <-- if it changes here, we miss it on the last iteration
}
It's not necessary to fix this, but it is good to point out
that this is currently the case.
In other words this code is what the hardware engineers recommended, is
used *much* more often outside the driver and known to be working well.
Ah, yes, I didn't want to resort to the "hardware engineers recommended",
but here is what I've been told by hardware engineers (as seen by their
interface):
1. Write your request to the hardware.
2. Wait M ms for the hardware to do its thing.
3. Read back and check status.
4. If done, then you're good.
5. If not done, wait a tiny bit of time, N, and
then check again, until done or timeout.
Usually N << M, since M is the actual time for
hardware do do something, and N is a lot smaller
than M, since it is just back-off time as
the hardware can be done "anytime now".
As a waveform it would look like this:
Immediate success: WRITE, 10, READ.
Delayed success: WRITE, 10, READ, 1, READ 1, READ 1, ..., READ.
Delayed failure: WRITE, 10, READ, 1, READ 1, READ 1, ..., READ.
The do-while loop above accomplishes this. The loop always ends on a read.
The current code would end on a delay on a failure, like this:
Delayed failure (count == MAX_COUNT): WRITE, READ, 10, READ, 10, READ, 10.
But, as it is working right now "for a decade",
it can be left as is, as long as we realize that
the "last" delay is not necessary, when we're failing.
I would've also liked to see the mutex business fixed
as well from my original patch review, as it is easy to prove
that it is always taken, so no need to embed it inside the if().
That is indeed a software problem which should be fixed.
I've not seen a patch fixing this yet.
Regards,
Luben
Regards,
Christian.
Regards,
Luben
Christian.
/* Wait for CTLACK and CTLACK2 to get asserted */
for (i = 0; i < SI_MAX_CTLACKS_ASSERTION_WAIT; ++i) {
uint32_t mask = UPLL_CTLACK_MASK | UPLL_CTLACK2_MASK;
_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx