Re: [PATCH 3/3] drm/amdkfd: add RAS poison consumption support for utcl2

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I'm not sure I understand this change. It looks like you will check the RAS poison status on every UTCL2 VM fault? Is that because there is no dedicated interrupt source or client ID to distinguish UTCL2 poison consumption from VM faults?

Why is kfd_signal_poison_consumed_event not done for UTCL2 poison consumption? The comment says, because it's signaled to user mode as a VM fault. But a VM fault and a poison consumption event are different. You're basically sending the wrong event to user mode. Maybe it doesn't make a big difference in practice at the moment. But that could change in the future.

Two more comments inline ...


Am 2022-03-14 um 03:03 schrieb Tao Zhou:
Do RAS page retirement and use gpu reset as fallback in utcl2
fault handler.

Signed-off-by: Tao Zhou <tao.zhou1@xxxxxxx>
---
  .../gpu/drm/amd/amdkfd/kfd_int_process_v9.c   | 23 ++++++++++++++++---
  1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c b/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
index f7def0bf0730..3991f71d865b 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c
@@ -93,11 +93,12 @@ enum SQ_INTERRUPT_ERROR_TYPE {
  static void event_interrupt_poison_consumption(struct kfd_dev *dev,
  				const uint32_t *ih_ring_entry)
  {
-	uint16_t source_id, pasid;
+	uint16_t source_id, client_id, pasid;
  	int ret = -EINVAL;
  	struct kfd_process *p;
source_id = SOC15_SOURCE_ID_FROM_IH_ENTRY(ih_ring_entry);
+	client_id = SOC15_CLIENT_ID_FROM_IH_ENTRY(ih_ring_entry);
  	pasid = SOC15_PASID_FROM_IH_ENTRY(ih_ring_entry);
p = kfd_lookup_process_by_pasid(pasid);
@@ -110,6 +111,7 @@ static void event_interrupt_poison_consumption(struct kfd_dev *dev,
  		return;
  	}
+ pr_debug("RAS poison consumption handling\n");
  	atomic_set(&p->poison, 1);
  	kfd_unref_process(p);
@@ -119,10 +121,14 @@ static void event_interrupt_poison_consumption(struct kfd_dev *dev,
  		break;
  	case SOC15_INTSRC_SDMA_ECC:
  	default:
+		if (client_id == SOC15_IH_CLIENTID_UTCL2)
+			ret = kfd_dqm_evict_pasid(dev->dqm, pasid);
  		break;
  	}
- kfd_signal_poison_consumed_event(dev, pasid);
+	/* utcl2 page fault has its own vm fault event */
+	if (client_id != SOC15_IH_CLIENTID_UTCL2)
+		kfd_signal_poison_consumed_event(dev, pasid);
/* resetting queue passes, do page retirement without gpu reset
  	 * resetting queue fails, fallback to gpu reset solution
@@ -314,7 +320,18 @@ static void event_interrupt_wq_v9(struct kfd_dev *dev,
  		info.prot_write = ring_id & 0x20;
kfd_smi_event_update_vmfault(dev, pasid);
-		kfd_dqm_evict_pasid(dev->dqm, pasid);
+
+		if (client_id == SOC15_IH_CLIENTID_UTCL2 &&
+		    dev->kfd2kgd->is_ras_utcl2_poison &&
+		    dev->kfd2kgd->is_ras_utcl2_poison(dev->adev, client_id)) {
+			event_interrupt_poison_consumption(dev, ih_ring_entry);
+
+			if (dev->kfd2kgd->utcl2_fault_clear)
+				dev->kfd2kgd->utcl2_fault_clear(dev->adev);
+		}
+		else

Else should be on the same line as }. Please run checkpatch.pl, it would catch this. It's also good practice to use {}-braces around the else-branch if you have braces around the if-branch.


+			kfd_dqm_evict_pasid(dev->dqm, pasid);
+
  		kfd_signal_vm_fault_event(dev, pasid, &info);

If you move kfd_signal_vm_fault_event into the else-branch, you can unconditionally send the correct poison consumption event to user mode in event_interrupt_poison_consumption.

Regards,
  Felix


  	}
  }



[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux