Re: [PATCH 12/14] accel/ivpu: Add handling of VPU_JSM_STATUS_MVNCI_CONTEXT_VIOLATION_HW

Jacek Lawrynowicz <jacek.lawrynowicz@xxxxxxxxxxxxxxx> · Thu, 9 Jan 2025 09:29:21 +0100



Reviewed-by: Jacek Lawrynowicz <jacek.lawrynowicz@xxxxxxxxxxxxxxx>

On 1/7/2025 6:32 PM, Maciej Falkowski wrote:
> From: Karol Wachowski <karol.wachowski@xxxxxxxxx>
> 
> Mark as invalid context of a job that returned HW context violation
> error and queue work that aborts jobs from faulty context.
> Add engine reset to the context abort thread handler to not only abort
> currently executing jobs but also to ensure NPU invalid state recovery.
> 
> Signed-off-by: Karol Wachowski <karol.wachowski@xxxxxxxxx>
> Signed-off-by: Maciej Falkowski <maciej.falkowski@xxxxxxxxxxxxxxx>
> ---
>  drivers/accel/ivpu/ivpu_job.c | 25 +++++++++++++++++++++++++
>  1 file changed, 25 insertions(+)
> 
> diff --git a/drivers/accel/ivpu/ivpu_job.c b/drivers/accel/ivpu/ivpu_job.c
> index c93ea37062d7..3c162ac41a1d 100644
> --- a/drivers/accel/ivpu/ivpu_job.c
> +++ b/drivers/accel/ivpu/ivpu_job.c
> @@ -533,6 +533,26 @@ static int ivpu_job_signal_and_destroy(struct ivpu_device *vdev, u32 job_id, u32
>  
>  	lockdep_assert_held(&vdev->submitted_jobs_lock);
>  
> +	job = xa_load(&vdev->submitted_jobs_xa, job_id);
> +	if (!job)
> +		return -ENOENT;
> +
> +	if (job_status == VPU_JSM_STATUS_MVNCI_CONTEXT_VIOLATION_HW) {
> +		guard(mutex)(&job->file_priv->lock);
> +
> +		if (job->file_priv->has_mmu_faults)
> +			return 0;
> +
> +		/*
> +		 * Mark context as faulty and defer destruction of the job to jobs abort thread
> +		 * handler to synchronize between both faults and jobs returning context violation
> +		 * status and ensure both are handled in the same way
> +		 */
> +		job->file_priv->has_mmu_faults = true;
> +		queue_work(system_wq, &vdev->context_abort_work);
> +		return 0;
> +	}
> +
>  	job = ivpu_job_remove_from_submitted_jobs(vdev, job_id);
>  	if (!job)
>  		return -ENOENT;
> @@ -946,6 +966,9 @@ void ivpu_context_abort_work_fn(struct work_struct *work)
>  	unsigned long ctx_id;
>  	unsigned long id;
>  
> +	if (vdev->fw->sched_mode == VPU_SCHEDULING_MODE_HW)
> +		ivpu_jsm_reset_engine(vdev, 0);
> +
>  	mutex_lock(&vdev->context_list_lock);
>  	xa_for_each(&vdev->context_xa, ctx_id, file_priv) {
>  		if (!file_priv->has_mmu_faults || file_priv->aborted)
> @@ -959,6 +982,8 @@ void ivpu_context_abort_work_fn(struct work_struct *work)
>  
>  	if (vdev->fw->sched_mode != VPU_SCHEDULING_MODE_HW)
>  		return;
> +
> +	ivpu_jsm_hws_resume_engine(vdev, 0);
>  	/*
>  	 * In hardware scheduling mode NPU already has stopped processing jobs
>  	 * and won't send us any further notifications, thus we have to free job related resources