Re: [PATCH 2/2] drm/i915/guc: default to using GuC submission where possible

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 25/04/16 11:39, Chris Wilson wrote:
On Mon, Apr 25, 2016 at 11:07:13AM +0100, Dave Gordon wrote:
On 22/04/16 19:45, Chris Wilson wrote:
On Fri, Apr 22, 2016 at 07:22:55PM +0100, Dave Gordon wrote:
This patch simply changes the default value of "enable_guc_submission"
>from 0 (never) to -1 (auto). This means that GuC submission will be
used if the platform has a GuC, the GuC supports the request submission
protocol, and any required GuC firmwware was successfully loaded. If any
of these conditions are not met, the driver will fall back to using
execlist mode.

There are several shortcomings yet, in particular you've decided to add
new ABI...

i915_guc_wq_check_space(): Do not return ETIMEDOUT. This function's
return code is ABI.

Your choices are either EAGAIN if you think the hardware will catch up, or
more likely EIO and reset the hardware.

This layer doesn't have enough information to determine that we need
a reset, so it'll have to be EAGAIN. TDR should catch the case where
we continue to be unable to submit for an extended period, for
whatever reason.

You don't? This is an interface to the hardware, if you don't know if
the GuC is responding who does? The other layers only look at whether we
advancing through the requests (either breadcrumbs or timer ticks).

Well this particular function is only looking at the state of the WQ, and doesn't track how fast we're sending requests or how many requests are outstanding. So it can't really say whether a lack of space is a real issue, or just a transient condition that we'll recover from (but see below).

If you return EAGAIN, we busyspin through the entire execbuffer. That's
acceptable - as we will catch the reset at some point. If we busyspin
here, we have to hook into the reset to allow it to steal the
struct_mutex.

I don't like the spinwait anyway, so maybe we should just return the
immediate result of sampling the WQ space once, and let the upper
layers deal with how or whether to wait-and-retry.

I would have it busyspin for a few microseconds. None of the upper
layers have any more clue than you here, though one option is to keep
the list of requests and wait for the oldest to expire - and we should
be able to find that from the engine, I forget if there is a 1:1
correspondence between engine and workqueue though.

So, I tested the idea of just returning -EAGAIN immediately, and it seems that we *can* fill the WQ -- but only in one particular scenario. That was gem_ctx_switch -- presumably because that involves a lot of work by the GPU and extra overhead for the GuC. Otherwise, the WQ was never full. And in that test, the strategy of returning -EAGAIN worked perfectly well, with no other errors triggered as a result. So this may well be the right way to go -- it's certainly the simplest :)

guc_add_workqueue_item: cannot fail

It must be fixed such that it is a void function without possibility of
failure.

The code clearly indicates that this "shouldn't happen", and yes, it
would indicate an inconsistency in the internal state of the driver
if the WARN-and-return-error path were taken.

However, the reason for including such a check is because we're
crossing between logical submodules here. The GuC submission code
requires that the LRC code (which is NOT GuC-specific) has
previously called the precheck-for-space function and got the
go-ahead signal. So this is essentially checking that the internal
protocol has been followed correctly. If the LRC code were updated
such that it might miss the precheck-for-space in some paths, this
WARN-and-fail check would help identify the protocol error.

I can change it to a BUG() if you prefer!

Personally, yes. And if it is ever hit in practice and not by an igt,
that looks doubly bad.

add-request cannot fail as we cannot unwind the transactions for the
request. We could defer the global state updates until after the
transaction is committed, but don't today and we then have to find a way
to mitigate that cost to every request vs making the error path simpler.

Or we could make the global state updates reversible, as we're doing in the scheduler code. But one way or another, I think we should be moving towards a transactional model where it either works or fails with no side-effects.

Not that you even successfully hooked up the failure paths in
the current code.

The current code is broken at least because the whole callchain is
declared as returning an int error code, only to find that the
topmost level says that add_request/emit_request are not allowed to
fail!

It would be rather better if the top level handled all errors in
submission, even if it were only by dropping the request and
resetting the engine!

Same for guc_ring_doorbell.

The rationale for an error return is slightly different here, as
this would not indicate an internal logic error in the driver, but
rather a failure of communication between the driver and the
doorbell hardware (or possibly the GuC firmware, but I don't think a
GuC bug could cause more than one failure-to-update, hence the retry
count is only 2).

If the doorbell hardware is not working correctly, there is nothing
we can do about it here; the only option is to report the failure up
to some layer that can recover, probably by resetting the GPU.

Exactly. Just pretend it worked and we will reset the GPU when we find
out it didn't.

It's important to keep s/w state self-consistent, otherwise code will likely crash when its preconditions are not met; and it's good to keep s/w state consistent with that of the h/w, otherwise system behaviour is likely to be unpredictable.

So I'd rather actively report a detected problem than just let a background task nondeterministically find out later that "something went wrong". Much easier to pin down the first symptom of the failure that way, and therefore maybe easier to find the root cause.

And what exactly is that atomic64_cmpxchg() serialising with? There are
no other CPUs contending with the write, and neither does the GuC (and I
doubt it is taking any notice of the lock cmpxchg). Using cmpxchg where
a single WRITE_ONCE() of a 32bit value wins the perf prize for hotest
instruction and function in the kernel.

The doorbell controller hardware, I should think. The BSpec
describes using LOCK_CMPXCHG8B to update doorbells, so I think this
code is just based on what it says there. If the CPU hardware
doesn't implement it efficiently, surely the GPU h/w designers
wouldn't have mandated it in this way?

Wow, I'm surprised that they would put into the same domain. Still,
unless you are actually serialising with another writer, what is the
point of using lock cmpxchg? E.g. an xchg would be enough to enforce
ordering, and you should ask them again if this is not a little overkill
for one-way signaling.
-Chris

The doorbell controller is essentially a cache-snoop, looking for writes to the nominated line and sending an interrupt to the GuC when it sees one. If I were designing that, I might well think it would be simpler (for the hardware) only to watch for writes that were special in some way, such as having LOCK asserted. That seems a perfectly reasonable way of distinguishing the special "ring the doorbell, wake the GuC" write from just any old access that might at first look like a cache hit. (BTW, I don't think the current doorbell controller actually *does* watch only for locked cycles, but some version might). So I'd rather stick with the exact sequence that the BSpec suggests, at least until it's all been thoroughly tested ("correctness first, optimise later").

.Dave.
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx




[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]
  Powered by Linux