Re: [char-misc-next 3/4] mei: pxp: re-enable client on errors

"Teres Alexis, Alan Previn" <alan.previn.teres.alexis@xxxxxxxxx> · Wed, 15 Nov 2023 15:58:37 +0000

On Wed, 2023-11-15 at 13:31 +0000, Tvrtko Ursulin wrote:
> On 14/11/2023 15:31, Teres Alexis, Alan Previn wrote:
> > On Tue, 2023-11-14 at 16:00 +0200, Ville Syrjälä wrote:
> > > On Wed, Oct 11, 2023 at 02:01:56PM +0300, Tomas Winkler wrote:
> > > 
> 
> Regardless of the mei_pxp_send_message being temporarily broken, doesn't 
> Ville's logs suggest the PXP detection is altogether messed up? AFAIR 
> the plan was exactly to avoid stalls during Mesa init and new uapi was 
> added to achieve that. But it doesn't seem to be working?!
> 
> commit 3b918f4f0c8b5344af4058f1a12e2023363d0097
> Author: Alan Previn <alan.previn.teres.alexis@xxxxxxxxx>
> Date:   Wed Aug 2 11:25:50 2023 -0700
> 
>      drm/i915/pxp: Optimize GET_PARAM:PXP_STATUS
> 
>      After recent discussions with Mesa folks, it was requested
>      that we optimize i915's GET_PARAM for the PXP_STATUS without
>      changing the UAPI spec.
> 
>      Add these additional optimizations:
>         - If any PXP initializatoin flow failed, then ensure that
>           we catch it so that we can change the returned PXP_STATUS
>           from "2" (i.e. 'PXP is supported but not yet ready')
>           to "-ENODEV". This typically should not happen and if it
>           does, we have a platform configuration issue.
>         - If a PXP arbitration session creation event failed
>           due to incorrect firmware version or blocking SOC fusing
>           or blocking BIOS configuration (platform reasons that won't
>           change if we retry), then reflect that blockage by also
>           returning -ENODEV in the GET_PARAM:PXP_STATUS.
>         - GET_PARAM:PXP_STATUS should not wait at all if PXP is
>           supported but non-i915 dependencies (component-driver /
>           firmware) we are still pending to complete the init flows.
>           In this case, just return "2" immediately (i.e. 'PXP is
>           supported but not yet ready').
> 
> AFAIU is things failed there shouldn't be long waits, repeated/constant 
> ones even less so.
> 
alan: I agree - but I don't think MESA is detecting for PXP for above case...
which is designed to be quick if using the GET_PARAM IOCTL - i believe MESA
may actually be opting in to enforce PXP. When this happens we are required
to be stringent when managing the protected-hw-sessions and for certain PXP
operations we retry when a failure occurs - one in particular is the PXP hw
session teardown or reset. We are expected to retry up to 3 times to ensure
that the session is properly torn down (requirements from architecture)
unless the error returned from FW indicated that we dont have proper fusing or
security assets in the platform (i.e. lower level aspects of the platform won't
permit PXP support). All of this is already coded.

So we have two problems here:

1. The code for enforcing PXP when PXP is explicitly requested by UMD is
expecting the mei-component driver to follow the mei-component-interface spec
but a change was introduced by Alex that caused a bug in this lowest layer spec
(see: https://elixir.bootlin.com/linux/latest/source/drivers/misc/mei/pxp/mei_pxp.c#L25)
"Return: 0 on Success, <0 on Failure" for send and "Return: bytes sent on
Success, <0 on Failure" for receive - the change they made was returning
a positive number on send's success and i915 was checking the according to spec:

         ret = pxp_component->ops->send(pxp_component->tee_dev, msg_in, msg_in_size,
                                        PXP_TRANSPORT_TIMEOUT_MS);
----->   if (ret) {
                 drm_err(&i915->drm, "Failed to send PXP TEE message\n");
                 goto unlock;
         }

         ret = pxp_component->ops->recv(pxp_component->tee_dev, msg_out, msg_out_max_size,
                                        PXP_TRANSPORT_TIMEOUT_MS);
         if (ret < 0) {
                 drm_err(&i915->drm, "Failed to receive PXP TEE message\n");
                 goto unlock;
         }

So i really cannot guarantee anything if the mei code itself is broken and not
complying to the spec we agreed on - i can change the above check for send
from "if (ret)" to "if (ret < 0)" but that would only be "working aroung"
the mei bug. As a side note, when we bail on a successful "send" thinking its
a failure, and try to "send" again, it triggers an link reset in the mei-hw
link because the receive was never picked up. IF i understand this correctly,
this is hw-fw design requirement, i.e. that the send-recv be an atomic FSM
sequence. That said, if the the send-failure was a real failure, the timeout
would have not been hit and the retry would have been much faster. As a side note,
Alex from mei team did reply to me offline with indication that the fix was
supposed to have been sent via "urgent queue" and he will follow up on that. 

2. I dont believe MESA is using the GET_PARAM ioctl for detecting PXP availibility,
I think MESA is explicitly opting into creating PXP contexts. That will of course
trigger the the backend PXP code that actually executes any PXP operation. Previously
in the older code MESA was creating PXP contexts on every context creation just to
test if PXP was available which is much worse - especially on MTL which can take
much longer for the underlying firmware to come up. I have an ongoing email thread
with the MESA folks which i will include you on to clarify this.

Problem #1 is resolved and the patch-fix will need to be merged. But Problem #2 is
still an open because Ville said he did not modify MESA to opt-in to enforce
protected sessions for all the PXP contexts.

...alan