[AMD Official Use Only] > -----Original Message----- > From: Kuehling, Felix <Felix.Kuehling@xxxxxxx> > Sent: Tuesday, March 15, 2022 2:41 AM > To: Zhou1, Tao <Tao.Zhou1@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Zhang, > Hawking <Hawking.Zhang@xxxxxxx>; Yang, Stanley > <Stanley.Yang@xxxxxxx>; Chai, Thomas <YiPeng.Chai@xxxxxxx> > Subject: Re: [PATCH 3/3] drm/amdkfd: add RAS poison consumption support for > utcl2 > > I'm not sure I understand this change. It looks like you will check the RAS poison > status on every UTCL2 VM fault? Is that because there is no dedicated interrupt > source or client ID to distinguish UTCL2 poison consumption from VM faults? > > Why is kfd_signal_poison_consumed_event not done for UTCL2 poison > consumption? The comment says, because it's signaled to user mode as a VM > fault. But a VM fault and a poison consumption event are different. > You're basically sending the wrong event to user mode. Maybe it doesn't make a > big difference in practice at the moment. But that could change in the future. [Tao] yes, no dedicated interrupt source, so I need to check FED bit in VM_L2_PROTECTION_FAULT_STATUS to distinguish it from general VM fault. I inject error to PDE1 to trigger UTCL2 poison consumption, both kfd_signal_poison_consumed_event and kfd_signal_vm_fault_event can be handled by ROCr in my experiment (bus error vs. coredump and APP can be killed). I think we can take UTCL2 consumption as a special VM fault, so I select kfd_signal_vm_fault_event. Anyway, this can be discussed furtherly. > > Two more comments inline ... > > > Am 2022-03-14 um 03:03 schrieb Tao Zhou: > > Do RAS page retirement and use gpu reset as fallback in utcl2 fault > > handler. > > > > Signed-off-by: Tao Zhou <tao.zhou1@xxxxxxx> > > --- > > .../gpu/drm/amd/amdkfd/kfd_int_process_v9.c | 23 ++++++++++++++++--- > > 1 file changed, 20 insertions(+), 3 deletions(-) > > > > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c > > b/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c > > index f7def0bf0730..3991f71d865b 100644 > > --- a/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c > > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c > > @@ -93,11 +93,12 @@ enum SQ_INTERRUPT_ERROR_TYPE { > > static void event_interrupt_poison_consumption(struct kfd_dev *dev, > > const uint32_t *ih_ring_entry) > > { > > - uint16_t source_id, pasid; > > + uint16_t source_id, client_id, pasid; > > int ret = -EINVAL; > > struct kfd_process *p; > > > > source_id = SOC15_SOURCE_ID_FROM_IH_ENTRY(ih_ring_entry); > > + client_id = SOC15_CLIENT_ID_FROM_IH_ENTRY(ih_ring_entry); > > pasid = SOC15_PASID_FROM_IH_ENTRY(ih_ring_entry); > > > > p = kfd_lookup_process_by_pasid(pasid); > > @@ -110,6 +111,7 @@ static void > event_interrupt_poison_consumption(struct kfd_dev *dev, > > return; > > } > > > > + pr_debug("RAS poison consumption handling\n"); > > atomic_set(&p->poison, 1); > > kfd_unref_process(p); > > > > @@ -119,10 +121,14 @@ static void > event_interrupt_poison_consumption(struct kfd_dev *dev, > > break; > > case SOC15_INTSRC_SDMA_ECC: > > default: > > + if (client_id == SOC15_IH_CLIENTID_UTCL2) > > + ret = kfd_dqm_evict_pasid(dev->dqm, pasid); > > break; > > } > > > > - kfd_signal_poison_consumed_event(dev, pasid); > > + /* utcl2 page fault has its own vm fault event */ > > + if (client_id != SOC15_IH_CLIENTID_UTCL2) > > + kfd_signal_poison_consumed_event(dev, pasid); > > > > /* resetting queue passes, do page retirement without gpu reset > > * resetting queue fails, fallback to gpu reset solution @@ -314,7 > > +320,18 @@ static void event_interrupt_wq_v9(struct kfd_dev *dev, > > info.prot_write = ring_id & 0x20; > > > > kfd_smi_event_update_vmfault(dev, pasid); > > - kfd_dqm_evict_pasid(dev->dqm, pasid); > > + > > + if (client_id == SOC15_IH_CLIENTID_UTCL2 && > > + dev->kfd2kgd->is_ras_utcl2_poison && > > + dev->kfd2kgd->is_ras_utcl2_poison(dev->adev, client_id)) { > > + event_interrupt_poison_consumption(dev, > ih_ring_entry); > > + > > + if (dev->kfd2kgd->utcl2_fault_clear) > > + dev->kfd2kgd->utcl2_fault_clear(dev->adev); > > + } > > + else > > Else should be on the same line as }. Please run checkpatch.pl, it would catch this. > It's also good practice to use {}-braces around the else-branch if you have > braces around the if-branch. [Tao] accepted > > > > + kfd_dqm_evict_pasid(dev->dqm, pasid); > > + > > kfd_signal_vm_fault_event(dev, pasid, &info); > > If you move kfd_signal_vm_fault_event into the else-branch, you can > unconditionally send the correct poison consumption event to user mode in > event_interrupt_poison_consumption. > > Regards, > Felix > > > > } > > }