On Thu, 24 Jun 2021 10:23:51 +0100 Steven Price <steven.price@xxxxxxx> wrote: > > static void panfrost_job_handle_irq(struct panfrost_device *pfdev, u32 status) > > { > > - int j; > > + struct panfrost_job *done[NUM_JOB_SLOTS][2] = {}; > > + struct panfrost_job *failed[NUM_JOB_SLOTS] = {}; > > + u32 js_state, js_events = 0; > > + unsigned int i, j; > > > > - dev_dbg(pfdev->dev, "jobslot irq status=%x\n", status); > > + while (status) { > > + for (j = 0; j < NUM_JOB_SLOTS; j++) { > > + if (status & JOB_INT_MASK_DONE(j)) { > > + if (done[j][0]) { > > + done[j][1] = panfrost_dequeue_job(pfdev, j); > > + WARN_ON(!done[j][1]); > > + } else { > > + done[j][0] = panfrost_dequeue_job(pfdev, j); > > + WARN_ON(!done[j][0]); > > NIT: I'd be tempted to move this WARN_ON into panfrost_dequeue_job() as > it's relevant for any call to the function. Makes sense. I'll move those WARN_ON()s. > > > + } > > + } > > > > - for (j = 0; status; j++) { > > - u32 mask = MK_JS_MASK(j); > > + if (status & JOB_INT_MASK_ERR(j)) { > > + /* Cancel the next submission. Will be submitted > > + * after we're done handling this failure if > > + * there's no reset pending. > > + */ > > + job_write(pfdev, JS_COMMAND_NEXT(j), JS_COMMAND_NOP); > > + failed[j] = panfrost_dequeue_job(pfdev, j); > > + } > > + } > > > > - if (!(status & mask)) > > + /* JS_STATE is sampled when JOB_INT_CLEAR is written. > > + * For each BIT(slot) or BIT(slot + 16) bit written to > > + * JOB_INT_CLEAR, the corresponding bits in JS_STATE > > + * (BIT(slot) and BIT(slot + 16)) are updated, but this > > + * is racy. If we only have one job done at the time we > > + * read JOB_INT_RAWSTAT but the second job fails before we > > + * clear the status, we end up with a status containing > > + * only the DONE bit and consider both jobs as DONE since > > + * JS_STATE reports both NEXT and CURRENT as inactive. > > + * To prevent that, let's repeat this clear+read steps > > + * until status is 0. > > + */ > > + job_write(pfdev, JOB_INT_CLEAR, status); > > + js_state = job_read(pfdev, JOB_INT_JS_STATE); > > This seems a bit dodgy. The spec says that JOB_INT_JS_STATE[1] is > updated only for the job slots which have bits set in the JOB_INT_CLEAR. > So there's potentially two problems: > > * The spec makes no gaurentee about the values of the bits for other > slots. But we're not masking off those bits. > > * If we loop (e.g. because the other slot finishes while handling the > first interrupt) then we may lose the state for the first slot. > > I'm not sure what the actual hardware returns in the bits which are > unrelated to the previous JOB_INT_CLEAR - kbase is careful only to > consider the bits relating to the slot it's currently dealing with. Hm, I see. How about something like that? struct panfrost_job *done[NUM_JOB_SLOTS][2] = {}; struct panfrost_job *failed[NUM_JOB_SLOTS] = {}; u32 js_state = 0, js_events = 0; unsigned int i, j; while (status) { u32 js_state_mask = 0; for (j = 0; j < NUM_JOB_SLOTS; j++) { if (status & MK_JS_MASK(j)) js_state_mask |= MK_JS_MASK(j); if (status & JOB_INT_MASK_DONE(j)) { if (done[j][0]) { done[j][1] = panfrost_dequeue_job(pfdev, j); WARN_ON(!done[j][1]); } else { done[j][0] = panfrost_dequeue_job(pfdev, j); WARN_ON(!done[j][0]); } } if (status & JOB_INT_MASK_ERR(j)) { /* Cancel the next submission. Will be submitted * after we're done handling this failure if * there's no reset pending. */ job_write(pfdev, JS_COMMAND_NEXT(j), JS_COMMAND_NOP); failed[j] = panfrost_dequeue_job(pfdev, j); } } /* JS_STATE is sampled when JOB_INT_CLEAR is written. * For each BIT(slot) or BIT(slot + 16) bit written to * JOB_INT_CLEAR, the corresponding bits in JS_STATE * (BIT(slot) and BIT(slot + 16)) are updated, but this * is racy. If we only have one job done at the time we * read JOB_INT_RAWSTAT but the second job fails before we * clear the status, we end up with a status containing * only the DONE bit and consider both jobs as DONE since * JS_STATE reports both NEXT and CURRENT as inactive. * To prevent that, let's repeat this clear+read steps * until status is 0. */ job_write(pfdev, JOB_INT_CLEAR, status); js_state &= ~js_state_mask; js_state |= job_read(pfdev, JOB_INT_JS_STATE) & js_state_mask; js_events |= status; status = job_read(pfdev, JOB_INT_RAWSTAT); } > > [1] Actually JOB_IRQ_JS_STATE - I sometimes think we should rename to > match the spec, I keep searching the docs for the wrong name ;) I can add a patch doing that :-). > > > > - status &= ~mask; > > + for (j = 0; j < NUM_JOB_SLOTS; j++) { > > + if (!(js_events & MK_JS_MASK(j))) > > + continue; > > + > > + if (!failed[j] || !pfdev->jobs[j][0]) > > + continue; > > + > > + if (pfdev->jobs[j][0]->jc == 0) { > > + /* The job was cancelled, signal the fence now */ > > + struct panfrost_job *canceled = panfrost_dequeue_job(pfdev, j); > > + > > + panfrost_mmu_as_put(pfdev, canceled->file_priv->mmu); > > + panfrost_devfreq_record_idle(&pfdev->pfdevfreq); > > + > > + dma_fence_set_error(canceled->done_fence, -ECANCELED); > > + dma_fence_signal_locked(canceled->done_fence); > > + pm_runtime_put_autosuspend(pfdev->dev); > > + } else if (!atomic_read(&pfdev->reset.pending)) { > > + /* Resume the job we stopped if no reset is pending */ > > + job_write(pfdev, JS_COMMAND_NEXT(j), JS_COMMAND_START); > > If I'm following correctly this is resubmitting a soft-stopped job. But > I haven't seen where the new JC pointer is written. Not exactly. It's submitting a job that was queued but not started. When we see a fault, we write NOP to COMMAND_NEXT so the next job is not started when the interrupt is cleared. And here we're just requeuing it.