Reviewed-by: Jacek Lawrynowicz <jacek.lawrynowicz@xxxxxxxxxxxxxxx> On 1/7/2025 6:32 PM, Maciej Falkowski wrote: > From: Karol Wachowski <karol.wachowski@xxxxxxxxx> > > Mark as invalid context of a job that returned HW context violation > error and queue work that aborts jobs from faulty context. > Add engine reset to the context abort thread handler to not only abort > currently executing jobs but also to ensure NPU invalid state recovery. > > Signed-off-by: Karol Wachowski <karol.wachowski@xxxxxxxxx> > Signed-off-by: Maciej Falkowski <maciej.falkowski@xxxxxxxxxxxxxxx> > --- > drivers/accel/ivpu/ivpu_job.c | 25 +++++++++++++++++++++++++ > 1 file changed, 25 insertions(+) > > diff --git a/drivers/accel/ivpu/ivpu_job.c b/drivers/accel/ivpu/ivpu_job.c > index c93ea37062d7..3c162ac41a1d 100644 > --- a/drivers/accel/ivpu/ivpu_job.c > +++ b/drivers/accel/ivpu/ivpu_job.c > @@ -533,6 +533,26 @@ static int ivpu_job_signal_and_destroy(struct ivpu_device *vdev, u32 job_id, u32 > > lockdep_assert_held(&vdev->submitted_jobs_lock); > > + job = xa_load(&vdev->submitted_jobs_xa, job_id); > + if (!job) > + return -ENOENT; > + > + if (job_status == VPU_JSM_STATUS_MVNCI_CONTEXT_VIOLATION_HW) { > + guard(mutex)(&job->file_priv->lock); > + > + if (job->file_priv->has_mmu_faults) > + return 0; > + > + /* > + * Mark context as faulty and defer destruction of the job to jobs abort thread > + * handler to synchronize between both faults and jobs returning context violation > + * status and ensure both are handled in the same way > + */ > + job->file_priv->has_mmu_faults = true; > + queue_work(system_wq, &vdev->context_abort_work); > + return 0; > + } > + > job = ivpu_job_remove_from_submitted_jobs(vdev, job_id); > if (!job) > return -ENOENT; > @@ -946,6 +966,9 @@ void ivpu_context_abort_work_fn(struct work_struct *work) > unsigned long ctx_id; > unsigned long id; > > + if (vdev->fw->sched_mode == VPU_SCHEDULING_MODE_HW) > + ivpu_jsm_reset_engine(vdev, 0); > + > mutex_lock(&vdev->context_list_lock); > xa_for_each(&vdev->context_xa, ctx_id, file_priv) { > if (!file_priv->has_mmu_faults || file_priv->aborted) > @@ -959,6 +982,8 @@ void ivpu_context_abort_work_fn(struct work_struct *work) > > if (vdev->fw->sched_mode != VPU_SCHEDULING_MODE_HW) > return; > + > + ivpu_jsm_hws_resume_engine(vdev, 0); > /* > * In hardware scheduling mode NPU already has stopped processing jobs > * and won't send us any further notifications, thus we have to free job related resources