вт, 9 февр. 2021 г. в 02:43, Alex Deucher <alexdeucher@xxxxxxxxx>: > > On Mon, Feb 8, 2021 at 1:34 AM Alexander E. Patrakov <patrakov@xxxxxxxxx> wrote: > > > > пн, 8 февр. 2021 г. в 08:32, Alexander E. Patrakov <patrakov@xxxxxxxxx>: > > > > > > чт, 4 февр. 2021 г. в 09:31, Alex Deucher <alexdeucher@xxxxxxxxx>: > > > > > > > > On Wed, Feb 3, 2021 at 7:56 PM Eric W. Biederman <ebiederm@xxxxxxxxxxxx> wrote: > > > > > > > > > > Alex Deucher <alexdeucher@xxxxxxxxx> writes: > > > > > > > > > > > On Wed, Feb 3, 2021 at 3:36 AM Dave Young <dyoung@xxxxxxxxxx> wrote: > > > > > >> > > > > > >> Hi Baoquan, > > > > > >> > > > > > >> Thanks for ccing. > > > > > >> On 01/28/21 at 01:29pm, Baoquan He wrote: > > > > > >> > On 01/11/21 at 01:17pm, Alexander E. Patrakov wrote: > > > > > >> > > Hello, > > > > > >> > > > > > > > >> > > I was trying out kexec on my new laptop, which is a HP EliteBook 735 > > > > > >> > > G6. The problem is, amdgpu does not have hardware acceleration after > > > > > >> > > kexec. Also, strangely, the lines about BlueTooth are missing from > > > > > >> > > dmesg after kexec, but I have not tried to use BlueTooth on this > > > > > >> > > laptop yet. I don't know how to debug this, the relevant amdgpu lines > > > > > >> > > in dmesg are: > > > > > >> > > > > > > > >> > > amdgpu 0000:04:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB > > > > > >> > > test failed on gfx (-110). > > > > > >> > > [drm:process_one_work] *ERROR* ib ring test failed (-110). > > > > > >> > > > > > > > >> > > The good and bad dmesg files are attached. Is it a kexec problem (and > > > > > >> > > amdgpu is only a victim), or should I take it to amdgpu lists? Do I > > > > > >> > > need to provide some extra kernel arguments for debugging? > > > > > > > > > > The best debugging I can think of is can you arrange to have the amdgpu > > > > > modules removed before the final kexec -e? > > > > > > > > > > That would tell us if the code to shutdown the gpu exist in the rmmod > > > > > path aka the .remove method and is simply missing in the kexec path aka > > > > > the .shutdown method. > > > > > > > > > > > > > > > >> > I am not familiar with graphical component. Add Dave to CC to see if > > > > > >> > he has some comments. It would be great if amdgpu expert can have a look. > > > > > >> > > > > > >> It needs amdgpu driver people to help. Since kexec bypass > > > > > >> bios/UEFI initialization so we requires drivers to implement .shutdown > > > > > >> method and test it to make 2nd kernel to work correctly. > > > > > > > > > > > > kexec is tricky to make work properly on our GPUs. The problem is > > > > > > that there are some engines on the GPU that cannot be re-initialized > > > > > > once they have been initialized without an intervening device reset. > > > > > > APUs are even trickier because they share a lot of hardware state with > > > > > > the CPU. Doing lots of extra resets adds latency. The driver has > > > > > > code to try and detect if certain engines are running at driver load > > > > > > time and do a reset before initialization to make this work, but it > > > > > > apparently is not working properly on your system. > > > > > > > > > > There are two cases that I think sometimes get mixed up. > > > > > > > > > > There is kexec-on-panic in which case all of the work needs to happen in > > > > > the driver initialization. > > > > > > > > > > There is also a simple kexec in which case some of the work can happen > > > > > in the kernel that is being shutdown and sometimes that is easer. > > > > > > > > > > Does it make sense to reset your device unconditionally on driver removal? > > > > > > > > I think we tried that at some point in the past but users complained > > > > that it added latency or artifacts on the display at shutdown or > > > > reboot time. > > > > > > > > > Would it make sense to reset your device unconditionally on driver add? > > > > > > > > Pretty much the same issue there. It adds latency and you get > > > > artifacts on the display when the reset happens. > > > > > > > > > > > > > > How can someone debug the smart logic of reset on driver load? > > > > > > > > See this block of code in amdgpu_device.c: > > > > /* check if we need to reset the asic > > > > * E.g., driver was not cleanly unloaded previously, etc. > > > > */ > > > > if (!amdgpu_sriov_vf(adev) && amdgpu_asic_need_reset_on_init(adev)) { > > > > r = amdgpu_asic_reset(adev); > > > > if (r) { > > > > dev_err(adev->dev, "asic reset on init failed\n"); > > > > goto failed; > > > > } > > > > } > > > > > > > > You'll want to see if amdgpu_asic_need_reset_on_init() was able to > > > > determine that the asic needs a reset. If it does, > > > > amdgpu_asic_reset() getds called to reset it. > > > > The tricky thing is that some reset methods require a fair amount of > > > > driver state and so, they are only possible when the driver is up and > > > > running. Those methods are not necessarily available at driver load > > > > time because we need to reset the GPU before we can initialize it and > > > > determine that state so we end up in a kind of catch 22. > > > > Unfortunately, generic PCI resets don't necessarily work on many of > > > > our GPUs so that's not an option either. > > > > > > > > Alex > > > > > > Sorry for the delay with the reply, I was distracted. > > > > > > Anyway, I managed to unload the amdgpu module successfully, using this > > > script (as /usr/lib/systemd/system-shutdown/debug.sh): > > > > > > #!/bin/sh > > > mount -o remount,rw / > > > echo 0 > /sys/class/vtconsole/vtcon1/bind > > > rmmod amdgpu && echo '<4>==== Succeeded removing amdgpu module ====' > /dev/kmsg > > > dmesg > /var/log/shutdown-log-$(date +%Y%m%d-%H%M%S) > > > mount -o remount,ro / > > > > > > At the end of a non-kexec boot, it logs this: > > > > > > [ 116.512621] Console: switching to colour dummy device 80x25 > > > [ 116.518591] amdgpu 0000:04:00.0: amdgpu: amdgpu: finishing device. > > > [ 116.644899] [drm:dal_irq_service_dummy_ack [amdgpu]] *ERROR* > > > dal_irq_service_dummy_ack: called for non-implemented irq source > > > [ 116.645168] [drm:dal_irq_service_dummy_set [amdgpu]] *ERROR* > > > dal_irq_service_dummy_set: called for non-implemented irq source > > > [ 116.658515] [drm] free PSP TMR buffer > > > [ 116.706265] [TTM] Zone kernel: Used memory at exit: 0 KiB > > > [ 116.706276] [TTM] Zone dma32: Used memory at exit: 0 KiB > > > [ 116.706280] [drm] amdgpu: ttm finalized > > > [ 116.740460] ==== Succeeded removing amdgpu module ==== > > > > > > However, the next kexec-based boot still misses hardware acceleration. > > > > Regarding the reset considerations. > > > > The amdgpu driver contains some logic to reset the card on init if > > needed. However, for all APU chipsets, it says that reset on init is > > not needed. So I tried to force this. In amdgpu_device_init(), I > > changed: > > Ah, right. On APUs, the SMU and PSP which are what we check to see if > they are running on dGPU are always running on APUs since they are > shared with the CPU so it doesn't make sense to check them. > > > > > if (!amdgpu_sriov_vf(adev) && (1 || > > amdgpu_asic_need_reset_on_init(adev))) { > > ... > > } > > > > pci_enable_pcie_error_reporting(adev->ddev.pdev); > > > > /* Post card if necessary */ > > if (1 || amdgpu_device_need_post(adev)) { > > ... > > } > > > > Then it tried to reset the card using MODE2 method, which failed: > > > > [ 1.467192] amdgpu 0000:04:00.0: amdgpu: MODE2 reset > > [ 1.467194] amdgpu 0000:04:00.0: amdgpu: asic reset on init failed > > [ 1.467197] amdgpu 0000:04:00.0: amdgpu: Fatal error during GPU init > > > > The only reset method which doesn't fail is BACO > > (amdgpu.reset_method=4) but unfortunately it doesn't help either. The > > dmesg after kexec is attached. The old workaround that removed the > > amdgpu module on reboot (and thus before kexec) is still active. > > mode2 reset is the only reset available on APUs. The others are not valid. > > Did this ever work in the past on this platform? I think no. This laptop's GPU is supported since linux-5.7 (before that, there were BIOS fetching problems), and the only kernels where I tested kexec were 5.10 and 5.11-rc{6,7}. -- Alexander E. Patrakov CV: http://u.pc.cd/wT8otalK _______________________________________________ amd-gfx mailing list amd-gfx@xxxxxxxxxxxxxxxxxxxxx https://lists.freedesktop.org/mailman/listinfo/amd-gfx