[AMD Official Use Only] For quick workaround, I agree with the solution. But regarding the root cause, the list is still messed up. Can we make ras_list to be a global variable across all cards, and add list empty check (or add a flag to indicate the register status of ras block) before list add to avoid redundant register? Regards, Tao > -----Original Message----- > From: Chai, Thomas <YiPeng.Chai@xxxxxxx> > Sent: Saturday, January 29, 2022 11:53 AM > To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx > Cc: Chai, Thomas <YiPeng.Chai@xxxxxxx>; Zhang, Hawking > <Hawking.Zhang@xxxxxxx>; Zhou1, Tao <Tao.Zhou1@xxxxxxx>; Clements, > John <John.Clements@xxxxxxx>; Chai, Thomas <YiPeng.Chai@xxxxxxx> > Subject: [PATCH] drm/amdgpu: Add judgement to avoid infinite loop > > 1. The infinite loop causing soft lock occurs on multiple amdgpu cards > supporting ras feature. > 2. This a workaround patch. It is valid for multiple amdgpu cards of the > same type. > 3. The root cause is that each GPU card device has a separate .ras_list > link header, but the instance and linked list node of each ras block > are unique. When each device is initialized, each ras instance will > repeatedly add link node to the device every time. In this way, only > the .ras_list of the last initialized device is completely correct. > the .ras_list->prev and .ras_list->next of the device initialzied > before can still point to the correct ras instance, but the prev > pointer and next pointer of the pointed ras instance both point to > the last initialized device's .ras_ list instead of the beginning > .ras_ list. When using list_for_each_entry_safe searches for > non-existent Ras nodes on devices other than the last device, the > last ras instance next pointer cannot always be equal to the > beginning .ras_list, so that the loop cannot be terminated, the > program enters a infinite loop. > BTW: Since the data and initialization process of each card are the same, > the link list between ras instances will not be destroyed every time > the device is initialized. > 4. The soft locked logs are as follows: > [ 262.165690] CPU: 93 PID: 758 Comm: kworker/93:1 Tainted: G OE > 5.13.0-27-generic #29~20.04.1-Ubuntu > [ 262.165695] Hardware name: Supermicro AS -4124GS-TNR/H12DSG-O-CPU, > BIOS T20200717143848 07/17/2020 [ 262.165698] Workqueue: events > amdgpu_ras_do_recovery [amdgpu] [ 262.165980] RIP: > 0010:amdgpu_ras_get_ras_block+0x86/0xd0 [amdgpu] [ 262.166239] Code: 68 > d8 4c 8d 71 d8 48 39 c3 74 54 49 8b 45 38 48 85 c0 74 32 44 89 fa 44 89 e6 4c 89 > ef e8 82 e4 9b dc 85 c0 74 3c 49 8b 46 28 <49> 8d 56 28 4d 89 f5 48 83 e8 28 48 > 39 d3 74 25 49 89 c6 49 8b 45 [ 262.166243] RSP: 0018:ffffac908fa87d80 > EFLAGS: 00000202 [ 262.166247] RAX: ffffffffc1394248 RBX: ffff91e4ab8d6e20 > RCX: ffffffffc1394248 [ 262.166249] RDX: ffff91e4aa356e20 RSI: > 000000000000000e RDI: ffff91e4ab8c0000 [ 262.166252] RBP: > ffffac908fa87da8 R08: 0000000000000007 R09: 0000000000000001 > [ 262.166254] R10: ffff91e4930b64ec R11: 0000000000000000 R12: > 000000000000000e [ 262.166256] R13: ffff91e4aa356df8 R14: ffffffffc1394320 > R15: 0000000000000003 [ 262.166258] FS: 0000000000000000(0000) > GS:ffff92238fb40000(0000) knlGS:0000000000000000 [ 262.166261] CS: 0010 > DS: 0000 ES: 0000 CR0: 0000000080050033 [ 262.166264] CR2: > 00000001004865d0 CR3: 000000406d796000 CR4: 0000000000350ee0 > [ 262.166267] Call Trace: > [ 262.166272] amdgpu_ras_do_recovery+0x130/0x290 [amdgpu] > [ 262.166529] ? psi_task_switch+0xd2/0x250 [ 262.166537] ? > __switch_to+0x11d/0x460 [ 262.166542] ? __switch_to_asm+0x36/0x70 > [ 262.166549] process_one_work+0x220/0x3c0 [ 262.166556] > worker_thread+0x4d/0x3f0 [ 262.166560] ? process_one_work+0x3c0/0x3c0 > [ 262.166563] kthread+0x12b/0x150 [ 262.166568] ? > set_kthread_struct+0x40/0x40 [ 262.166571] ret_from_fork+0x22/0x30 > > Signed-off-by: yipechai <YiPeng.Chai@xxxxxxx> > --- > drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > index d4e07d0acb66..3d533ef0783d 100644 > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c > @@ -884,6 +884,7 @@ static int amdgpu_ras_block_match_default(struct > amdgpu_ras_block_object *block_ static struct amdgpu_ras_block_object > *amdgpu_ras_get_ras_block(struct amdgpu_device *adev, > enum amdgpu_ras_block block, > uint32_t sub_block_index) { > + int loop_cnt = 0; > struct amdgpu_ras_block_object *obj, *tmp; > > if (block >= AMDGPU_RAS_BLOCK__LAST) > @@ -900,6 +901,9 @@ static struct amdgpu_ras_block_object > *amdgpu_ras_get_ras_block(struct amdgpu_de > if (amdgpu_ras_block_match_default(obj, block) == 0) > return obj; > } > + > + if (++loop_cnt >= AMDGPU_RAS_BLOCK__LAST) > + break; > } > > return NULL; > -- > 2.25.1