Re: [PATCH 00/15] HuC loading for DG2

"Ye, Tony" <tony.ye@xxxxxxxxx> · Wed, 6 Jul 2022 13:11:42 -0700

On 7/6/2022 12:29 PM, Ceraolo Spurio, Daniele wrote:

On 7/6/2022 10:26 AM, Ye, Tony wrote:

On 7/5/2022 4:30 PM, Ceraolo Spurio, Daniele wrote:

On 6/15/2022 7:28 PM, Zhang, Carl wrote:
On From: Ye, Tony <tony.ye@xxxxxxxxx>
Sent: Thursday, June 16, 2022 12:15 AM

On 6/15/2022 3:13 AM, Tvrtko Ursulin wrote:
On 15/06/2022 00:15, Ye, Tony wrote:
On 6/14/2022 8:30 AM, Ceraolo Spurio, Daniele wrote:
On 6/14/2022 12:44 AM, Tvrtko Ursulin wrote:
On 13/06/2022 19:13, Ceraolo Spurio, Daniele wrote:
On 6/13/2022 10:39 AM, Tvrtko Ursulin wrote:
On 13/06/2022 18:06, Ceraolo Spurio, Daniele wrote:
On 6/13/2022 9:56 AM, Tvrtko Ursulin wrote:
On 13/06/2022 17:41, Ceraolo Spurio, Daniele wrote:
On 6/13/2022 9:31 AM, Tvrtko Ursulin wrote:
On 13/06/2022 16:39, Ceraolo Spurio, Daniele wrote:
On 6/13/2022 1:16 AM, Tvrtko Ursulin wrote:
On 10/06/2022 00:19, Daniele Ceraolo Spurio wrote:
On DG2, HuC loading is performed by the GSC, via a PXP
command. The load operation itself is relatively simple
(just send a message to the GSC with the physical 
address
of the HuC in LMEM), but there are timing changes that
requires special attention. In particular, to send a PXP
command we need to first export the GSC driver and then
wait for the mei-gsc and mei-pxp modules to start, which
means that HuC load will complete after i915 load is
complete. This means that there is a small window of 
time
after i915 is registered and before HuC is loaded during
which userspace could submit and/or checking the HuC 
load
status, although this is quite unlikely to happen 
(HuC is
usually loaded before kernel init/resume completes).
We've consulted with the media team in regards to how to
handle this and they've asked us to do the following:

1) Report HuC as loaded in the getparam IOCTL even if 
load
is still in progress. The media driver uses the IOCTL 
as a
way to check if HuC is enabled and then includes a
secondary check in the batches to get the actual status,
so doing it this way allows userspace to keep working
without changes.

2) Stall all userspace VCS submission until HuC is 
loaded.
Stalls are
expected to be very rare (if any), due to the fact that
HuC is usually loaded before kernel init/resume is
completed.
Motivation to add these complications into i915 are not
clear to me here. I mean there is no HuC on DG2 _yet_ is
the premise of the series, right? So no backwards
compatibility concerns. In this case why jump through the
hoops and not let userspace handle all of this by just
leaving the getparam return the true status?
The main areas impacted by the fact that we can't 
guarantee
that HuC load is complete when i915 starts accepting
submissions are boot and suspend/resume, with the latter
being the main problem; GT reset is not a concern because
HuC now survives it. A suspend/resume can be 
transparent to
userspace and therefore the HuC status can temporarily 
flip
from loaded to not without userspace knowledge, especially
if we start going into deeper suspend states and start
causing HuC resets when we go into runtime suspend. Note
that this is different from what happens during GT 
reset for
older platforms, because in that scenario we guarantee 
that
HuC reload is complete before we restart the submission
back-end, so userspace doesn't notice that the HuC status
change. We had an internal discussion about this problem
with both media and i915 archs and the conclusion was that
the best option is for i915 to stall media submission 
while
HuC (re-)load is in progress.
Resume is potentialy a good reason - I did not pick up on
that from the cover letter. I read the statement about the
unlikely and small window where HuC is not loaded during
kernel init/resume and I guess did not pick up on the 
resume
part.

Waiting for GSC to load HuC from i915 resume is not an 
option?
GSC is an aux device exported by i915, so AFAIU GSC resume
can't start until i915 resume completes.
I'll dig into this in the next few days since I want to
understand how exactly it works. Or someone can help explain.

If in the end conclusion will be that i915 resume indeed 
cannot
wait for GSC, then I think auto-blocking of queued up 
contexts
on media engines indeed sounds unavoidable. Otherwise, as you
explained, user experience post resume wouldn't be good.
Even if we could implement a wait, I'm not sure we should. GSC
resume and HuC reload takes ~300ms in most cases, I don't 
think
we want to block within the i915 resume path for that long.
Yeah maybe not. But entertaining the idea that it is 
technically
possible to block - we could perhaps add uapi for userspace to
mark contexts which want HuC access. Then track if there are 
any
such contexts with outstanding submissions and only wait in
resume if there are. If that would end up significantly less 
code
on the i915 side to maintain is an open.

What would be the end result from users point of view in case
where it suspended during video playback? The proposed solution
from this series sees the video stuck after resume. Maybe
compositor blocks as well since I am not sure how well they
handle one window not providing new data. Probably depends on
the
compositor.

And then with a simpler solution definitely the whole resume
would be delayed by 300ms.

With my ChromeOS hat the stalled media engines does sound 
like a
better solution. But with the maintainer hat I'd like all 
options
evaluated since there is attractiveness if a good enough 
solution
can be achieved with significantly less kernel code.

You say 300ms is typical time for HuC load. How long it is on
other platforms? If much faster then why is it so slow here?
The GSC itself has to come out of suspend before it can perform
the load, which takes a few tens of ms I believe. AFAIU the 
GSC is
also slower in processing the HuC load and auth compared to the
legacy path. The GSC FW team gave a 250ms limit for the time the
GSC FW needs from start of the resume flow to HuC load complete,
so I bumped that to ~300ms to account for all other SW
interactions, plus a bit of buffer. Note that a bit of the SW
overhead is caused by the fact that we have 2 mei modules in 
play
here: mei-gsc, which manages the GSC device itself (including
resume), and mei-pxp, which owns the pxp messaging, including 
HuC
load.
And how long on other platforms (not DG2) do you know? Presumably
there the wait is on the i915 resume path?
I don't have "official" expected load times at hand, but 
looking at
the BAT boot logs for this series for DG1 I see it takes ~10 ms to
load both GuC and HuC:

<7>[    8.157838] i915 0000:03:00.0: [drm:intel_huc_init 
[i915]] GSC
loads huc=no <6>[    8.158632] i915 0000:03:00.0: [drm] GuC 
firmware
i915/dg1_guc_70.1.1.bin version 70.1 <6>[ 8.158634] i915
0000:03:00.0: [drm] HuC firmware i915/dg1_huc_7.9.3.bin version 
7.9
<7>[    8.164255] i915 0000:03:00.0: [drm:guc_enable_communication
[i915]] GuC communication enabled <6>[ 8.166111] i915
0000:03:00.0: [drm] HuC authenticated

Note that we increase the GT frequency all the way to the max 
before
starting the FW load, which speeds things up.

However, do we really need to lie in the getparam? How about
extend or add a new one to separate the loading vs loaded
states? Since userspace does not support DG2 HuC yet this
should be doable.
I don't really have a preference here. The media team asked us
to do it this way because they wouldn't have a use for the
different "in progress" and "done" states. If they're ok with
having separate flags that's fine by me.
Tony, any feedback here?
We don't even have any docs in i915_drm.h in terms of what it
means:
#define I915_PARAM_HUC_STATUS         42

Seems to be a boolean. Status false vs true? Could you add some
docs?
There is documentation above intel_huc_check_status(), which is
also updated in this series. I can move that to i915_drm.h.
That would be great, thanks.

And with so rich return codes already documented and exposed via
uapi - would we really need to add anything new for DG2 apart for
userspace to know that if zero is returned (not a negative error
value) it should retry? I mean is there another negative error
missing which would prevent zero transitioning to one?
I think if the auth fails we currently return 0, because the uc
state in that case would be "TRANSFERRED", i.e. DMA complete 
but not
fully enabled. I don't have anything against changing the FW state
to "ERROR" in this scenario and leave the 0 to mean "not done 
yet",
but I'd prefer the media team to comment on their needs for this
IOCTL before committing to anything.

Currently media doesn't differentiate "delayed loading is in
progress" with "HuC is authenticated and running". If the HuC
authentication eventually fails, the user needs to check the 
debugfs
to know the reason. IMHO, it's not a big problem as this is what we
do even when the IOCTL returns non-zero values. + Carl to comment.
(Side note - debugfs can be assumed to not exist so it is not
interesting to users.)

There isn't currently a "delayed loading is in progress" state, 
that's
the discussion in this thread, if and how to add it.

Getparam it currently documents these states:

  -ENODEV if HuC is not present on this platform,
  -EOPNOTSUPP if HuC firmware is disabled,
  -ENOPKG if HuC firmware was not installed,
  -ENOEXEC if HuC firmware is invalid or mismatched,
  0 if HuC firmware is not running,
  1 if HuC firmware is authenticated and running.

This patch proposed to change this to:

  1 if HuC firmware is authenticated and running or if delayed 
load is
in progress,
  0 if HuC firmware is not running and delayed load is not in 
progress

Alternative idea is for DG2 (well in general) to add some more fine
grained states, so that i915 does not have to use 1 for both running
and loading. This may be adding a new error code for auth fails as
Daniele mentioned. Then UMD can know that if 0 is returned and
platform is DG2 it needs to query it again since it will 
transition to
either 1 or error eventually. This would mean the non error states
would be:

  0 not running (aka loading)
  1 running (and authenticated)

@Daniele - one more thing - can you make sure in the series (if you
haven't already) that if HuC status was in any error before suspend
reload is not re-tried on resume? My thinking is that the error is
likely to persist and we don't want to impose long delay on every
resume afterwards. Makes sense to you?

@Tony - one more question for the UMD. Or two.

How prevalent is usage of HuC on DG2 depending on what codecs 
need it?
Do you know in advance, before creating a GEM context, that HuC
commands will be sent to the engine or this changes at runtime?
HuC is needed for all codecs while HW bit rate control (CBR, VBR) 
is in use.
It's also used by content protection. And UMD doesn't know if it 
will be used
later at context creation time.

from UMD perspective, We don’t care much on the normal 
initialization process
because, I could not image that a system is boot up, and user 
select a crypted content
to playback, and huc is still not ready.
of course, We are  also ok to query the huc status twice, and wait 
if the status is "0 not running"
to avoid potential issue.

I suppose the main possible issue will happen in the 
hibernation/awake process, it is transparent to UMD.
UMD will not call ioctrl  to query huc status in this process, and 
will continue to send command buffer to KMD.

I think there is an agreement that it is ok to return 0 to mark the 
load still in progress and 1 for load & auth complete. However, 
double checking the code it turns out that we currently return 0 on 
load failure, even if that's not particularly clear from the 
comment. I can easily change that to be an error code, but not sure 
if that's considered an API breakage considering it's not a well 
documented behavior. I believe that on pre-DG2 userspace considers 1 
as ok and everything else as failure, so changing the ioctl to 
return an error code on failure and 0 for load pending (with the 
latter being a DG2-esclusive code for now) should be safe, but I'd 
like confirmation that I'm not breaking API before sending the 
relevant code.

The UMD code is like this:

    struct drm_i915_getparam gp;
    int32_t value;
    gp.param = I915_PARAM_HUC_STATUS;
    gp.value = &value;
    ret = ioctl(fd, DRM_IOCTL_I915_GETPARAM, &gp);
    if (ret != 0)
        hasHuC = 0
    else
        if (value == 0)
            hasHuC = 0;
        else
            hasHuC = 1;

Currently the behavior of i915 is:

    if there is an error, ioctl returns -1, and set errno as 
ENODEV/EOPNOTSUPP/ENOPKG/ENOEXEC;

Not exactly. The current map of FW states to ioctl behavior is:

FIRMWARE_NOT_SUPPORTED -> return -ENODEV;
FIRMWARE_DISABLED -> return -EOPNOTSUPP;
FIRMWARE_MISSING ->  return -ENOPKG;
FIRMWARE_ERROR -> return -ENOEXEC

For these 4 errors, ioctl is returning -1 and setting the errno as one 
of the error values above.

ioctl is not returning the error values. It only returns -1.

This is exactly the behavior of i915 observed from the user space. You 
can verify it with the code snippet.

The igt huc_copy test is also checking the errno instead of the ret value:

https://gitlab.freedesktop.org/drm/igt-gpu-tools/-/blob/master/tests/i915/gem_huc_copy.c#L66

FIRMWARE_INIT_FAIL -> return 0, value 0
FIRMWARE_LOAD_FAIL -> return 0, value 0

If you change these two conditions to return -EIO, then what the user 
space actually gets is going to be:

    ioctl returns -1, and errno==EIO.

It's not going to break the current UMD as you said below.

Thanks,

Tony

FIRMWARE_RUNNING -> return 0, value 1

So we do have 2 error state in which the ioctl succeeds.

    otherwise, set *(gp.value) as 0 if HuC is not running, or 1 if 
HuC is authenticated.

Hi Daniele, which value are you going to change - the "ret" or the 
"value"?

The idea is to change the 2 FAIL states above to return an error 
(probably -EIO) instead of setting value to 0. This would be 
compatible with your code snippet, because it'll hit the ret != 0 
condition. Value 0 can then be re-purposed for DG2 to indicate "load 
pending", which would not be compatible with your current code, but 
being a new addition for a new platform does not necessarily need to be.

This said, I'm not sure if changing the return behavior of INIT_FAIL 
and LOAD_FAIL is API breakage or not, given that it won't impact 
userspace expectations. Tvrtko, any feedback here?

Thanks,
Daniele

Thanks,

Tony

Thanks,
Daniele

Thanks,

Tony

Regards,

Tvrtko