17.07.2023 11:59, Steven Price пишет: > On 17/07/2023 09:49, Boris Brezillon wrote: >> On Mon, 17 Jul 2023 09:06:56 +0100 >> Steven Price <steven.price@xxxxxxx> wrote: >> >>> On 17/07/2023 08:49, Boris Brezillon wrote: >>>> On Mon, 17 Jul 2023 10:20:02 +0300 >>>> Dmitry Osipenko <dmitry.osipenko@xxxxxxxxxxxxx> wrote: >>>> >>>>> Hi, >>>>> >>>>> On 7/17/23 10:05, Boris Brezillon wrote: >>>>>> Hi Dmitry, >>>>>> >>>>>> On Mon, 17 Jul 2023 09:52:54 +0300 >>>>>> Dmitry Osipenko <dmitry.osipenko@xxxxxxxxxxxxx> wrote: >>>>>> >>>>>>> Panfrost IRQ handler may stuck for a long time, for example this happens >>>>>>> when there is a bad HDMI connection and HDMI handler takes a long time to >>>>>>> finish processing, holding Panfrost. Make Panfrost's job timeout handler >>>>>>> to sync IRQ before checking fence signal status in order to prevent >>>>>>> spurious job timeouts due to a slow IRQ processing. >>>>>> >>>>>> Feels like the problem should be fixed in the HDMI encoder driver >>>>>> instead, so it doesn't stall the whole system when processing its >>>>>> IRQs (use threaded irqs, maybe). I honestly don't think blocking in the >>>>>> job timeout path to flush IRQs is a good strategy. >>>>> >>>>> The syncing is necessary to have for correctness regardless of whether >>>>> it's HDMI problem or something else, there could be other reasons for >>>>> CPU to delay IRQ processing. It's wrong to say that hw is hung, while >>>>> it's not. >>>> >>>> Well, hardware is effectively hung, if not indefinitely, at least >>>> temporarily. All you do here is block in the timeout handler path >>>> waiting for the GPU interrupt handlers to finish, handler that's >>>> probably waiting in the queue, because the raw HDMI handler is blocking >>>> it somehow. So, in the end, you might just be delaying the time of HWR a >>>> bit more. I know it's not GPU's fault in that case, and the job could >>>> have finished in time if the HDMI encoder hadn't stall the interrupt >>>> handling pipeline, but I'm not sure we should care for that specific >>>> situation. And more importantly, if it took more than 500ms to get a >>>> frame rendered (or, in that case, to get the event that a frame is >>>> rendered), you already lost, so I'm not sure correctness matters: >>>> rendering didn't make it in time, and the watchdog kicked in to try and >>>> unblock the situation. Feels like we're just papering over an HDMI >>>> encoder driver bug here, really. >>> >>> TLDR; I don't see any major downsides and it stops the GPU getting the >>> blame for something that isn't its fault. >> >> True, but doing that will also give the impression that things run fine, >> but very slowly, which would put the blame on the userspace driver :P. > > Maybe I'm tainted by years of the kernel driver getting the blame > because it was the one that printed the message ;p > >>> >>> I guess the question is whether panfrost should work on a system which >>> has terrible IRQ latency. At the moment we have a synchronize_irq() call >>> in panfrost_reset() which effectively does the same thing, but with all >>> the overhead/spew of resetting the GPU. >> >> Unless I'm mistaken, the synchronize_irq() in panfrost_reset() is >> mostly here to make sure there's no race between the interrupt >> handler and the reset logic (we mask interrupts, and then synchronize, >> guaranteeing that the interrupt handler won't be running after that >> point), and it happens after we've printed the error message, so the >> user knows something was blocked at least. > > Yes the synchronize_irq() in panfrost_reset() is there to avoid a real > race - but it has the side effect of flushing out the IRQ and therefore > the job gets completed successfully. And in the high IRQ latency > situation makes the actual reset redundant. > >>> >>> Of course in the case Dmitry is actually talking about - it does seem >>> like the HDMI encoder has a bug which needs fixing. There are plenty of >>> other things that will break if IRQ latency gets that bad. >> >> Yes, that's my point. The GPU driver is the only one to complain right >> now, but the HDMI encoder behavior could be impacting other parts of >> the system. Silently ignoring those weirdnesses sounds like a terrible >> idea. > > Agreed - but making it look like a GPU driver bug isn't good either. > >>> >>> I do wonder if it makes sense to only synchronize when it's needed, >>> e.g.: >>> >>> ----8<--- >>> diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c >>> index dbc597ab46fb..d96266b74e5c 100644 >>> --- a/drivers/gpu/drm/panfrost/panfrost_job.c >>> +++ b/drivers/gpu/drm/panfrost/panfrost_job.c >>> @@ -720,6 +720,12 @@ static enum drm_gpu_sched_stat panfrost_job_timedout(struct drm_sched_job >>> if (dma_fence_is_signaled(job->done_fence)) >>> return DRM_GPU_SCHED_STAT_NOMINAL; >>> >>> + /* Synchronize with the IRQ handler in case the IRQ latency is bad */ >>> + synchronize_irq(pfdev->js->irq); >>> + /* Recheck if the job is now complete */ >>> + if (dma_fence_is_signaled(job->done_fence)) >>> + return DRM_GPU_SCHED_STAT_NOMINAL; >>> + >>> dev_err(pfdev->dev, "gpu sched timeout, js=%d, config=0x%x, status=0x%x, head=0x%x, tail=0x%x, sched_job=%p", >>> js, >>> job_read(pfdev, JS_CONFIG(js)), >>> ----8<--- >>> >>> I don't have any data as to how often we hit the case where the DRM >>> scheduler calls the timeout but we've already signalled - so the extra >>> check might be overkill. >> >> Right, it's not so much about the overhead of the synchronize_irq() >> call (even though my first reply complained about that :-)), but more >> about silently ignoring system misbehaviors. So I guess I'd be fine with >> a version printing a dev_warn("Unexpectedly high interrupt latency") >> when synchronize_irq() unblocks the situation, which means you'd still >> have to do it in two steps. > > I like this idea - it still warns so it's obvious there's something > wrong with the system, and it makes it clear it's not the GPU's fault. Like that idea too, thanks for the suggestions! Will prepare v2 -- Best regards, Dmitry