Re: [PATCH 3/3] drm/amdkfd: add RAS poison consumption support for utcl2

Felix Kuehling <felix.kuehling@xxxxxxx> · Mon, 14 Mar 2022 14:41:24 -0400

I'm not sure I understand this change. It looks like you will check the 
RAS poison status on every UTCL2 VM fault? Is that because there is no 
dedicated interrupt source or client ID to distinguish UTCL2 poison 
consumption from VM faults?

Why is kfd_signal_poison_consumed_event not done for UTCL2 poison 
consumption? The comment says, because it's signaled to user mode as a 
VM fault. But a VM fault and a poison consumption event are different. 
You're basically sending the wrong event to user mode. Maybe it doesn't 
make a big difference in practice at the moment. But that could change 
in the future.

Two more comments inline ...


Am 2022-03-14 um 03:03 schrieb Tao Zhou:
Do RAS page retirement and use gpu reset as fallback in utcl2
fault handler.

Signed-off-by: Tao Zhou <tao.zhou1@xxxxxxx>
---
  .../gpu/drm/amd/amdkfd/kfd_int_process_v9.c   | 23 ++++++++++++++++---
  1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c b/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
index f7def0bf0730..3991f71d865b 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
@@ -93,11 +93,12 @@ enum SQ_INTERRUPT_ERROR_TYPE {
  static void event_interrupt_poison_consumption(struct kfd_dev *dev,
  				const uint32_t *ih_ring_entry)
  {
-	uint16_t source_id, pasid;
+	uint16_t source_id, client_id, pasid;
  	int ret = -EINVAL;
  	struct kfd_process *p;
  
  	source_id = SOC15_SOURCE_ID_FROM_IH_ENTRY(ih_ring_entry);
+	client_id = SOC15_CLIENT_ID_FROM_IH_ENTRY(ih_ring_entry);
  	pasid = SOC15_PASID_FROM_IH_ENTRY(ih_ring_entry);
  
  	p = kfd_lookup_process_by_pasid(pasid);
@@ -110,6 +111,7 @@ static void event_interrupt_poison_consumption(struct kfd_dev *dev,
  		return;
  	}
  
+	pr_debug("RAS poison consumption handling\n");
  	atomic_set(&p->poison, 1);
  	kfd_unref_process(p);
  
@@ -119,10 +121,14 @@ static void event_interrupt_poison_consumption(struct kfd_dev *dev,
  		break;
  	case SOC15_INTSRC_SDMA_ECC:
  	default:
+		if (client_id == SOC15_IH_CLIENTID_UTCL2)
+			ret = kfd_dqm_evict_pasid(dev->dqm, pasid);
  		break;
  	}
  
-	kfd_signal_poison_consumed_event(dev, pasid);
+	/* utcl2 page fault has its own vm fault event */
+	if (client_id != SOC15_IH_CLIENTID_UTCL2)
+		kfd_signal_poison_consumed_event(dev, pasid);
  
  	/* resetting queue passes, do page retirement without gpu reset
  	 * resetting queue fails, fallback to gpu reset solution
@@ -314,7 +320,18 @@ static void event_interrupt_wq_v9(struct kfd_dev *dev,
  		info.prot_write = ring_id & 0x20;
  
  		kfd_smi_event_update_vmfault(dev, pasid);
-		kfd_dqm_evict_pasid(dev->dqm, pasid);
+
+		if (client_id == SOC15_IH_CLIENTID_UTCL2 &&
+		    dev->kfd2kgd->is_ras_utcl2_poison &&
+		    dev->kfd2kgd->is_ras_utcl2_poison(dev->adev, client_id)) {
+			event_interrupt_poison_consumption(dev, ih_ring_entry);
+
+			if (dev->kfd2kgd->utcl2_fault_clear)
+				dev->kfd2kgd->utcl2_fault_clear(dev->adev);
+		}
+		else

Else should be on the same line as }. Please run checkpatch.pl, it would 
catch this. It's also good practice to use {}-braces around the 
else-branch if you have braces around the if-branch.


+			kfd_dqm_evict_pasid(dev->dqm, pasid);
+
  		kfd_signal_vm_fault_event(dev, pasid, &info);

If you move kfd_signal_vm_fault_event into the else-branch, you can 
unconditionally send the correct poison consumption event to user mode 
in event_interrupt_poison_consumption.

Regards,
  Felix


  	}
  }