Re: [PATCH] drm/amdgpu: add support to SMU debug option

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 2021-12-01 8:11 a.m., Christian König wrote:
Adding Andrey as well.

Am 01.12.21 um 12:37 schrieb Yu, Lang:
[SNIP]
+ BUG_ON(unlikely(smu->smu_debug_mode) && res);
BUG_ON() really crashes the kernel and is only allowed if we
prevent further data corruption with that.

Most of the time WARN_ON() is more appropriate, but I can't fully
judge here since I don't know the SMU code well enough.
This is what SMU FW guys want. They want "user-visible (potentially
fatal)
errors", then a hang.
They want to keep system state since the error occurred.
Well that is rather problematic.

First of all we need to really justify that, crashing the kernel is
not something easily done.

Then this isn't really effective here. What happens is that you crash
the kernel thread of the currently executing process, but it is
perfectly possible that another thread still tries to send messages
to the SMU. You need to have the BUG_ON() before dropping the lock to
make sure that this really gets the driver stuck in the current state.
Thanks. I got it. I just thought it is a kenel panic.
Could we use a panic() here?
Potentially, but that might reboot the system automatically which is probably not
what you want either.

How does the SMU firmware team gather the necessary information when a
problem occurs?
As far as I know, they usually use a HDT to collect information.
And they request a hang when error occurred in ticket.
"Suggested error responses include pop-up windows (by x86 driver, if this is possible) or simply hanging after logging the error."

In that case I suggest to set the "don't_touch_the_hardware_any_more" procedure we also use in case of PCIe hotplug.

Andrey has the details but essentially it stops the driver from touching the hardware any more, signals all fences and unblocks everything.

It should then be trivial to inspect the hardware state and see what's going on, but the system will keep stable at least for SSH access.

Might be a good idea to have that mode for other fault cases like page faults and hardware crashes.

Regards,
Christian.


There is no one specific function that does all of that, what I think can be done is to bring the device to kind of halt state where no one touches it - as following -

1) Follow amdpgu_pci_remove -

    drm_dev_unplug to make device inaccessible to user space (IOCTLs e.t.c.) and clears MMIO mappings to device and disallows remappings through page faults

    No need to call all of amdgpu_driver_unload_kms but, within it call amdgpu_irq_disable_all and amdgpu_fence_driver_hw_fini toi disable interrupts and force signall all HW fences.

    pci_disable_device and pci_wait_for_pending_transaction to flush any in flight DMA operations from device

2) set adev->no_hw_access so that most of places we access HW (all subsequent registers reads/writes and SMU/PSP message sending is skipped, but some race will be with those already in progress so maybe adding some wait)

Andrey




Regards,
Lang





[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux