On Wed, May 22, 2019 at 11:55:14AM +0200, Lucas Stach wrote: > The devcoredump needs to operate on a stable state of the MMU while > it is writing the MMU state to the coredump. The missing lock > allowed both the userspace submit, as well as the GPU job finish > paths to mutate the MMU state while a coredump is under way. This locking does little to solve this problem. We actually rely on the GPU being deadlocked at this point - we aren't expecting the GPU to make any progress. The only thing that can realistically happen is for userspace to submit a new job, but adding this locking does little to avoid that. > Fixes: a8c21a5451d8 (drm/etnaviv: add initial etnaviv DRM driver) > Reported-by: David Jander <david@xxxxxxxxxxx> > Signed-off-by: Lucas Stach <l.stach@xxxxxxxxxxxxxx> > Tested-by: David Jander <david@xxxxxxxxxxx> > --- > drivers/gpu/drm/etnaviv/etnaviv_dump.c | 5 +++++ > 1 file changed, 5 insertions(+) > > diff --git a/drivers/gpu/drm/etnaviv/etnaviv_dump.c b/drivers/gpu/drm/etnaviv/etnaviv_dump.c > index 33854c94cb85..515515ef24f9 100644 > --- a/drivers/gpu/drm/etnaviv/etnaviv_dump.c > +++ b/drivers/gpu/drm/etnaviv/etnaviv_dump.c > @@ -125,6 +125,8 @@ void etnaviv_core_dump(struct etnaviv_gpu *gpu) > return; > etnaviv_dump_core = false; > > + mutex_lock(&gpu->mmu->lock); > + > mmu_size = etnaviv_iommu_dump_size(gpu->mmu); > > /* We always dump registers, mmu, ring and end marker */ > @@ -167,6 +169,7 @@ void etnaviv_core_dump(struct etnaviv_gpu *gpu) > iter.start = __vmalloc(file_size, GFP_KERNEL | __GFP_NOWARN | __GFP_NORETRY, > PAGE_KERNEL); > if (!iter.start) { > + mutex_unlock(&gpu->mmu->lock); > dev_warn(gpu->dev, "failed to allocate devcoredump file\n"); > return; > } > @@ -234,6 +237,8 @@ void etnaviv_core_dump(struct etnaviv_gpu *gpu) > obj->base.size); > } > > + mutex_unlock(&gpu->mmu->lock); > + All that this lock does is prevent the lists from changing while we build up what we're going to write out. At this point, you drop the lock, before we've written anything out to the coredump. Things can now change, including the ring buffer. > etnaviv_core_dump_header(&iter, ETDUMP_BUF_END, iter.data); > > dev_coredumpv(gpu->dev, iter.start, iter.data - iter.start, GFP_KERNEL); Here we write out the data, which includes the command buffers, ring buffers, BOs, etc. The data in the ring buffer can still change because the lock has been dropped. However, all those objects should be pinned, so none of them should go away: things like the command buffers that have been submitted should be immutable at this point (if they aren't, it could well be a reason why the GPU has gone awol.) It is also not nice to hold the lock over the _big_ vmalloc() which may take some time. Have you actually seen problems here, or is this just theoretical? -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up According to speedtest.net: 11.9Mbps down 500kbps up _______________________________________________ dri-devel mailing list dri-devel@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/dri-devel