On 10/11/17 09:18, Jon Hunter wrote: ... > Thanks Ben. However, looking at next-20171109 this one is already in. > So maybe the bisect is still not getting me to the current issue. When > booting next-20171109 the last thing I see is ... > > [ 2.228178] nouveau 57000000.gpu: NVIDIA GK20A (0ea000a1) > [ 2.233634] nouveau 57000000.gpu: imem: using IOMMU > [ 2.238572] nouveau 57000000.gpu: Direct firmware load for nvidia/gk20a/fecs_inst.bin failed with error -2 > [ 2.248295] nouveau 57000000.gpu: Direct firmware load for nouveau/nvea_fuc409c failed with error -2 > [ 2.257479] nouveau 57000000.gpu: Direct firmware load for nouveau/fuc409c failed with error -2 > [ 2.266189] nouveau 57000000.gpu: gr: failed to load fuc409c > > So no crash. I did see the crash after the bisect, but not in top of > tree. It appears to hang after the nouveau probe fails. Any thoughts > on how to debug further? So this is probably wrong, but here is a clue about what is happening. It appears that the error code is not being propagated from gk20a_gr_new(). gk20a_gr_new is returning -ENODEV due to the firmware loading failure... 342 if (gf100_gr_ctor_fw(gr, "fecs_inst", &gr->fuc409c) || 343 gf100_gr_ctor_fw(gr, "fecs_data", &gr->fuc409d) || 344 gf100_gr_ctor_fw(gr, "gpccs_inst", &gr->fuc41ac) || 345 gf100_gr_ctor_fw(gr, "gpccs_data", &gr->fuc41ad)) 346 return -ENODEV; ... but this is ignored by nvkm_device_ctor() (probably for good reason). If I make the following change the hang no longer occurs (although I realise this is probably wrong as it has been there for years!) ... diff --git a/drivers/gpu/drm/nouveau/nvkm/engine/device/base.c b/drivers/gpu/drm/nouveau/nvkm/engine/device/base.c index e14643615698..a611615d3ce7 100644 --- a/drivers/gpu/drm/nouveau/nvkm/engine/device/base.c +++ b/drivers/gpu/drm/nouveau/nvkm/engine/device/base.c @@ -2869,7 +2869,7 @@ struct nvkm_engine * subdev = nvkm_device_subdev(device, (s)); \ nvkm_subdev_del(&subdev); \ device->m = NULL; \ - if (ret != -ENODEV) { \ + if (ret == -ENODEV) { \ nvdev_error(device, "%s ctor failed, %d\n", \ nvkm_subdev_name[s], ret); \ goto done; \ So is gk20a_gr_new() returning the wrong error code for when the firmware load fails? I have no gone back to see what has change in this regard, but I can, probably next week. Cheers Jon -- nvpublic -- To unsubscribe from this list: send the line "unsubscribe linux-tegra" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html