Re: [PATCH v1] misc: fastrpc: Trigger a panic using BUG_ON in device release

Abhishek Singh <quic_abhishes@xxxxxxxxxxx> · Fri, 30 Aug 2024 14:14:15 +0530



On 8/13/2024 3:07 PM, Greg KH wrote:
> On Mon, Aug 05, 2024 at 04:36:28PM +0530, Abhishek Singh wrote:
>>
>> On 7/30/2024 12:46 PM, Greg KH wrote:
>>> On Tue, Jul 30, 2024 at 12:39:45PM +0530, Abhishek Singh wrote:
>>>> The user process on ARM closes the device node while closing the
>>>> session, triggers a remote call to terminate the PD running on the
>>>> DSP. If the DSP is in an unstable state and cannot process the remote
>>>> request from the HLOS, glink fails to deliver the kill request to the
>>>> DSP, resulting in a timeout error. Currently, this error is ignored,
>>>> and the session is closed, causing all the SMMU mappings associated
>>>> with that specific PD to be removed. However, since the PD is still
>>>> operational on the DSP, any attempt to access these SMMU mappings
>>>> results in an SMMU fault, leading to a panic.  As the SMMU mappings
>>>> have already been removed, there is no available information on the
>>>> DSP to determine the root cause of its unresponsiveness to remote
>>>> calls. As the DSP is unresponsive to all process remote calls, use
>>>> BUG_ON to prevent the removal of SMMU mappings and to properly
>>>> identify the root cause of the DSP’s unresponsiveness to the remote
>>>> calls.
>>>>
>>>> Signed-off-by: Abhishek Singh <quic_abhishes@xxxxxxxxxxx>
>>>> ---
>>>>  drivers/misc/fastrpc.c | 4 ++++
>>>>  1 file changed, 4 insertions(+)
>>>>
>>>> diff --git a/drivers/misc/fastrpc.c b/drivers/misc/fastrpc.c
>>>> index 5204fda51da3..bac9c749564c 100644
>>>> --- a/drivers/misc/fastrpc.c
>>>> +++ b/drivers/misc/fastrpc.c
>>>> @@ -97,6 +97,7 @@
>>>>  #define FASTRPC_RMID_INIT_CREATE_STATIC	8
>>>>  #define FASTRPC_RMID_INIT_MEM_MAP      10
>>>>  #define FASTRPC_RMID_INIT_MEM_UNMAP    11
>>>> +#define PROCESS_KILL_SC 0x01010000
>>>>  
>>>>  /* Protection Domain(PD) ids */
>>>>  #define ROOT_PD		(0)
>>>> @@ -1128,6 +1129,9 @@ static int fastrpc_invoke_send(struct fastrpc_session_ctx *sctx,
>>>>  	fastrpc_context_get(ctx);
>>>>  
>>>>  	ret = rpmsg_send(cctx->rpdev->ept, (void *)msg, sizeof(*msg));
>>>> +	/* trigger panic if glink communication is broken and the message is for PD kill */
>>>> +	BUG_ON((ret == -ETIMEDOUT) && (handle == FASTRPC_INIT_HANDLE) &&
>>>> +			(ctx->sc == PROCESS_KILL_SC));
>>>
>>> You just crashed the machine completely, sorry, but no, properly handle
>>> the issue and clean up if you can detect it, do not break systems.
>>>
>> But the Glink communication with DSP is already broken; we cannot communicate with the DSP.
>> The system will crash if we proceed with cleanup on the ARM side. If we don’t do cleanup,
>> a resource leak will occur. Eventually, the system will become dead. That’s why I am
>> crashing the device.
> 
> Then explicitly call panic() if you think you really want to shut the
> system down.
>
>> What does it mean to explicitly call panic()? Are you trying to say we should use panic() instead of BUG_ON()?
> 
> greg k-h