On Fri, Jan 10, 2025 at 9:48 AM Christian König <christian.koenig@xxxxxxx> wrote: > > Am 10.01.25 um 15:32 schrieb Philipp Reisner: > > [...] > >> Take a look at those messages right before the crash: > >> > >> Jän 10 07:58:14 ryzen9 kernel: [drm] scheduler comp_1.2.1 is not ready, > >> skipping > >> Jän 10 07:58:14 ryzen9 kernel: [drm] scheduler comp_1.3.1 is not ready, > >> skipping > >> > >> That is basically a 100% certain confirm that an application tries to > >> use the device before before those compute queues are resumed. > >> > >> Can I have a full dmesg? Maybe the resume is canceled or aborted for > >> some reason. > >> > > Yes, of course. I have made the files available here: > > https://drive.google.com/drive/folders/1W3M3bFEl0ZVv2rnqvmbveDFZBhc84BNa > > Ah! That suddenly makes much more sense. > > Here is the root cause: > > [111313.897796] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper > [amdgpu]] *ERROR* ring comp_1.1.0 test failed (-110) > [111314.135761] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper > [amdgpu]] *ERROR* ring comp_1.2.0 test failed (-110) > [111314.373786] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper > [amdgpu]] *ERROR* ring comp_1.0.1 test failed (-110) > [111314.611722] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper > [amdgpu]] *ERROR* ring comp_1.1.1 test failed (-110) > [111314.849647] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper > [amdgpu]] *ERROR* ring comp_1.2.1 test failed (-110) > [111315.087658] amdgpu 0000:29:00.0: [drm:amdgpu_ring_test_helper > [amdgpu]] *ERROR* ring comp_1.3.1 test failed (-110) > [111315.207293] [drm] UVD and UVD ENC initialized successfully. > [111315.308270] [drm] VCE initialized successfully. > [111315.447494] PM: resume devices took 2.306 seconds > [111315.447865] OOM killer enabled. > > I'm surprised that this works at all. For some reason the graphics queue > works, but the compute queues fail to resume. > > @Alex what do we do about that? We could return an error when not all > rings come up again after resume, but that will probably result in a > number of complains. Maybe return an error if all of the rings of a particular type fail, but if only some do, we should be able to deal with that. We currently set up 8 compute rings. We probably don't need that many. Maybe just two (high and low priority). Alex > > Regards, > Christian. > > > > > > best regards, > > Philipp >