Fix will remove the evict/restore calls to MES in case the device is iGPU. Added queues will still be removed normally when the program closes. Easy way to trigger the problem is to build the the ML/AI support for gfx1103 M780 iGPU with the rocm sdk builder and then running the test application in loop. Most of the testing has been done on 6.13 devel and 6.12 final kernels but the same problem can also be triggered at least with the 6.8 and 6.11 kernels. Adding delays to either to test application between calls \ (tested with 1 second) or to loop inside kernel which removes the queues (tested with mdelay(10)) did not help to avoid the crash. After applying the kernel fix, I and others have executed the test loop thousands of times without seeing the error to happen again. On multi-gpu devices, correct gfx1103 needs to be forced in use by exporting environment variable HIP_VISIBLE_DEVICES=<gpu-index> Original bug and test case was made by jrl290 on rocm sdk builder bug issue 141. Test app below to trigger the problem. import torch import numpy as np from onnx import load from onnx2pytorch import ConvertModel import time if __name__ == "__main__": ii = 0 while True: ii = ii + 1 print("Loop Start") model_path = "model.onnx" device = 'cuda' model_run = ConvertModel(load(model_path)) model_run.to(device).eval() #This code causes the crash. Comment out to remove the crash random = np.random.rand(1, 4, 3072, 256) tensor = torch.tensor(random, dtype=torch.float32, device=device) #This code doesn't cause a crash tensor = torch.randn(1, 4, 3072, 256, dtype=torch.float32, device=device) print("[" + str(ii) + "], the crash happens here:") time.sleep(0.5) result = model_run(tensor).numpy(force=True) print(result.shape) Mika Laitio (1): amdgpu fix for gfx1103 queue evict/restore crash .../drm/amd/amdkfd/kfd_device_queue_manager.c | 24 ++++++++++++------- 1 file changed, 16 insertions(+), 8 deletions(-) -- 2.43.0