On 24/06/2021 10:56, Boris Brezillon wrote: > On Thu, 24 Jun 2021 10:23:51 +0100 > Steven Price <steven.price@xxxxxxx> wrote: > >>> static void panfrost_job_handle_irq(struct panfrost_device *pfdev, u32 status) >>> { >>> - int j; >>> + struct panfrost_job *done[NUM_JOB_SLOTS][2] = {}; >>> + struct panfrost_job *failed[NUM_JOB_SLOTS] = {}; >>> + u32 js_state, js_events = 0; >>> + unsigned int i, j; >>> >>> - dev_dbg(pfdev->dev, "jobslot irq status=%x\n", status); >>> + while (status) { >>> + for (j = 0; j < NUM_JOB_SLOTS; j++) { >>> + if (status & JOB_INT_MASK_DONE(j)) { >>> + if (done[j][0]) { >>> + done[j][1] = panfrost_dequeue_job(pfdev, j); >>> + WARN_ON(!done[j][1]); >>> + } else { >>> + done[j][0] = panfrost_dequeue_job(pfdev, j); >>> + WARN_ON(!done[j][0]); >> >> NIT: I'd be tempted to move this WARN_ON into panfrost_dequeue_job() as >> it's relevant for any call to the function. > > Makes sense. I'll move those WARN_ON()s. > >> >>> + } >>> + } >>> >>> - for (j = 0; status; j++) { >>> - u32 mask = MK_JS_MASK(j); >>> + if (status & JOB_INT_MASK_ERR(j)) { >>> + /* Cancel the next submission. Will be submitted >>> + * after we're done handling this failure if >>> + * there's no reset pending. >>> + */ >>> + job_write(pfdev, JS_COMMAND_NEXT(j), JS_COMMAND_NOP); >>> + failed[j] = panfrost_dequeue_job(pfdev, j); >>> + } >>> + } >>> >>> - if (!(status & mask)) >>> + /* JS_STATE is sampled when JOB_INT_CLEAR is written. >>> + * For each BIT(slot) or BIT(slot + 16) bit written to >>> + * JOB_INT_CLEAR, the corresponding bits in JS_STATE >>> + * (BIT(slot) and BIT(slot + 16)) are updated, but this >>> + * is racy. If we only have one job done at the time we >>> + * read JOB_INT_RAWSTAT but the second job fails before we >>> + * clear the status, we end up with a status containing >>> + * only the DONE bit and consider both jobs as DONE since >>> + * JS_STATE reports both NEXT and CURRENT as inactive. >>> + * To prevent that, let's repeat this clear+read steps >>> + * until status is 0. >>> + */ >>> + job_write(pfdev, JOB_INT_CLEAR, status); >>> + js_state = job_read(pfdev, JOB_INT_JS_STATE); >> >> This seems a bit dodgy. The spec says that JOB_INT_JS_STATE[1] is >> updated only for the job slots which have bits set in the JOB_INT_CLEAR. >> So there's potentially two problems: >> >> * The spec makes no gaurentee about the values of the bits for other >> slots. But we're not masking off those bits. >> >> * If we loop (e.g. because the other slot finishes while handling the >> first interrupt) then we may lose the state for the first slot. >> >> I'm not sure what the actual hardware returns in the bits which are >> unrelated to the previous JOB_INT_CLEAR - kbase is careful only to >> consider the bits relating to the slot it's currently dealing with. > > Hm, I see. How about something like that? > > struct panfrost_job *done[NUM_JOB_SLOTS][2] = {}; > struct panfrost_job *failed[NUM_JOB_SLOTS] = {}; > u32 js_state = 0, js_events = 0; > unsigned int i, j; > > while (status) { > u32 js_state_mask = 0; > > for (j = 0; j < NUM_JOB_SLOTS; j++) { > if (status & MK_JS_MASK(j)) > js_state_mask |= MK_JS_MASK(j); > > if (status & JOB_INT_MASK_DONE(j)) { > if (done[j][0]) { > done[j][1] = panfrost_dequeue_job(pfdev, j); > WARN_ON(!done[j][1]); > } else { > done[j][0] = panfrost_dequeue_job(pfdev, j); > WARN_ON(!done[j][0]); > } > } > > if (status & JOB_INT_MASK_ERR(j)) { > /* Cancel the next submission. Will be submitted > * after we're done handling this failure if > * there's no reset pending. > */ > job_write(pfdev, JS_COMMAND_NEXT(j), JS_COMMAND_NOP); > failed[j] = panfrost_dequeue_job(pfdev, j); > } > } > > /* JS_STATE is sampled when JOB_INT_CLEAR is written. > * For each BIT(slot) or BIT(slot + 16) bit written to > * JOB_INT_CLEAR, the corresponding bits in JS_STATE > * (BIT(slot) and BIT(slot + 16)) are updated, but this > * is racy. If we only have one job done at the time we > * read JOB_INT_RAWSTAT but the second job fails before we > * clear the status, we end up with a status containing > * only the DONE bit and consider both jobs as DONE since > * JS_STATE reports both NEXT and CURRENT as inactive. > * To prevent that, let's repeat this clear+read steps > * until status is 0. > */ > job_write(pfdev, JOB_INT_CLEAR, status); > js_state &= ~js_state_mask; > js_state |= job_read(pfdev, JOB_INT_JS_STATE) & js_state_mask; > js_events |= status; > status = job_read(pfdev, JOB_INT_RAWSTAT); > } > That looks like it should work. >> >> [1] Actually JOB_IRQ_JS_STATE - I sometimes think we should rename to >> match the spec, I keep searching the docs for the wrong name ;) > > I can add a patch doing that :-). > >>> >>> - status &= ~mask; >>> + for (j = 0; j < NUM_JOB_SLOTS; j++) { >>> + if (!(js_events & MK_JS_MASK(j))) >>> + continue; >>> + >>> + if (!failed[j] || !pfdev->jobs[j][0]) >>> + continue; >>> + >>> + if (pfdev->jobs[j][0]->jc == 0) { >>> + /* The job was cancelled, signal the fence now */ >>> + struct panfrost_job *canceled = panfrost_dequeue_job(pfdev, j); >>> + >>> + panfrost_mmu_as_put(pfdev, canceled->file_priv->mmu); >>> + panfrost_devfreq_record_idle(&pfdev->pfdevfreq); >>> + >>> + dma_fence_set_error(canceled->done_fence, -ECANCELED); >>> + dma_fence_signal_locked(canceled->done_fence); >>> + pm_runtime_put_autosuspend(pfdev->dev); >>> + } else if (!atomic_read(&pfdev->reset.pending)) { >>> + /* Resume the job we stopped if no reset is pending */ >>> + job_write(pfdev, JS_COMMAND_NEXT(j), JS_COMMAND_START); >> >> If I'm following correctly this is resubmitting a soft-stopped job. But >> I haven't seen where the new JC pointer is written. > > Not exactly. It's submitting a job that was queued but not started. > When we see a fault, we write NOP to COMMAND_NEXT so the next job is > not started when the interrupt is cleared. And here we're just > requeuing it. > Ah that makes sense, then can you at least update the comment to not say "stopped". Perhaps "removed" instead? I think that's what confused me. Thanks, Steve