Re: [Intel-gfx] [RFC PATCH 60/97] drm/i915: Track 'serial' counts for virtual engines

Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx> · Tue, 1 Jun 2021 10:31:29 +0100

On 27/05/2021 18:01, John Harrison wrote:
On 5/27/2021 01:53, Tvrtko Ursulin wrote:
On 26/05/2021 19:45, John Harrison wrote:
On 5/26/2021 01:40, Tvrtko Ursulin wrote:
On 25/05/2021 18:52, Matthew Brost wrote:
On Tue, May 25, 2021 at 11:16:12AM +0100, Tvrtko Ursulin wrote:

On 06/05/2021 20:14, Matthew Brost wrote:
From: John Harrison <John.C.Harrison@xxxxxxxxx>

The serial number tracking of engines happens at the backend of
request submission and was expecting to only be given physical
engines. However, in GuC submission mode, the decomposition of 
virtual
to physical engines does not happen in i915. Instead, requests are
submitted to their virtual engine mask all the way through to the
hardware (i.e. to GuC). This would mean that the heart beat code
thinks the physical engines are idle due to the serial number not
incrementing.

This patch updates the tracking to decompose virtual engines into
their physical constituents and tracks the request against each. 
This
is not entirely accurate as the GuC will only be issuing the request
to one physical engine. However, it is the best that i915 can do 
given
that it has no knowledge of the GuC's scheduling decisions.

Commit text sounds a bit defeatist. I think instead of making up 
the serial
counts, which has downsides (could you please document in the 
commit what
they are), we should think how to design things properly.

IMO, I don't think fixing serial counts is the scope of this 
series. We
should focus on getting GuC submission in not cleaning up all the crap
that is in the i915. Let's make a note of this though so we can 
revisit
later.

I will say again - commit message implies it is introducing an 
unspecified downside by not fully fixing an also unspecified issue. 
It is completely reasonable, and customary even, to ask for both to 
be documented in the commit message.
Not sure what exactly is 'unspecified'. I thought the commit message 
described both the problem (heartbeat not running when using virtual 
engines) and the result (heartbeat running on more engines than 
strictly necessary). But in greater detail...

The serial number tracking is a hack for the heartbeat code to know 
whether an engine is busy or idle, and therefore whether it should be 
pinged for aliveness. Whenever a submission is made to an engine, the 
serial number is incremented. The heartbeat code keeps a copy of the 
value. If the value has changed, the engine is busy and needs to be 
pinged.

This works fine for execlist mode where virtual engine decomposition 
is done inside i915. It fails miserably for GuC mode where the 
decomposition is done by the hardware. The reason being that the 
heartbeat code only looks at physical engines but the serial count is 
only incremented on the virtual engine. Thus, the heartbeat sees 
everything as idle and does not ping.

So hangcheck does not work. Or it works because GuC does it anyway. 
Either way, that's one thing to explicitly state in the commit message.

This patch decomposes the virtual engines for the sake of 
incrementing the serial count on each sub-engine in order to keep the 
heartbeat code happy. The downside is that now the heartbeat sees all 
sub-engines as busy rather than only the one the submission actually 
ends up on. There really isn't much that can be done about that. The 
heartbeat code is in i915 not GuC, the scheduler is in GuC not i915. 
The only way to improve it is to either move the heartbeat code into 
GuC as well and completely disable the i915 side, or add some way for 
i915 to interrogate GuC as to which engines are or are not active. 
Technically, we do have both. GuC has (or at least had) an option to 
force a context switch on every execution quantum pre-emption. 
However, that is much, much, more heavy weight than the heartbeat. 
For the latter, we do (almost) have the engine usage statistics for 
PMU and such like. I'm not sure how much effort it would be to wire 
that up to the heartbeat code instead of using the serial count.

In short, the serial count is ever so slightly inefficient in that it 
causes heartbeat pings on engines which are idle. On the other hand, 
it is way more efficient and simpler than the current alternatives.

And the hack to make hangcheck work creates this inefficiency where 
heartbeats are sent to idle engines. Which is probably fine just needs 
to be explained.

Does that answer the questions?

With the two points I re-raise clearly explained, possibly even patch 
title changed, yeah. I am just wanting for it to be more easily 
obvious to patch reader what it is functionally about - not just what 
implementation details have been change but why as well.

My understanding is that we don't explain every piece of code in minute 
detail in every checkin email that touches it. I thought my description 
was already pretty verbose. I've certainly seen way less informative 
checkins that apparently made it through review without issue.

Regarding the problem statement, I thought this was fairly clear that 
the heartbeat was broken for virtual engines:

    This would mean that the heart beat code
    thinks the physical engines are idle due to the serial number not
    incrementing.

Regarding the inefficiency about heartbeating all physical engines in a 
virtual engine, again, this seems clear to me:

    decompose virtual engines into
    their physical constituents and tracks the request against each. This
    is not entirely accurate as the GuC will only be issuing the request
    to one physical engine.

For the subject, I guess you could say "Track 'heartbeat serial' counts 
for virtual engines". However, the serial tracking count is not 
explicitly named for heartbeats so it seems inaccurate to rename it for 
a checkin email subject.

If you have a suggestion for better wording then feel free to propose 
something.

Sigh, I am not asking for more low level detail but for more up to point 
high level naming and high level description.

"drm/i915: Fix hangchek for guc virtual engines"

"..Blah blah, but hack because it is not ideal due xyz which needlessly 
wakes up all engines which has an effect on power yes/no? Latency? 
Throughput when high prio pulse triggers pointless preemption?"

Also, can we fix it properly without introducing inefficiencies? Do we 
even need heartbeats when GuC is in charge of engine resets? And if we 
do can we make them work better?

Regards,

Tvrtko