Hi Arthur,
apart from blacklisting amdgpu I generally advise to SSH from another
computer into the affected system if you have a problem like this.
Additionally to what Evan said I suggest that you enable
CONFIG_LOCKDEP_SUPPORT in your kernel configuration. This will yield
warnings in your system log in case of deadlocks or accidentally
forgetting to unlock something.
Regards,
Christian.
Am 01.04.22 um 10:49 schrieb Arthur Marsh:
Hi Evan, this is what was logged (filtering for drm and amdgpu) when I
blacklisted amdgpu then manually did:
modprobe amdgpu si_support=1 gpu_recovery=1
Apr 1 18:31:14 am64 kernel: [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-5.17.0+ root=UUID=39706f53-7c27-4310-b22a-36c7b042d1a1 ro amdgpu.audio=1 amdgpu.si_support=1 radeon.si_support=0 page_owner=on amdgpu.gpu_recovery=1 udev.log-priority=info rd.udev.log-priority=info
Apr 1 18:31:14 am64 kernel: [ 0.059624] Kernel command line: BOOT_IMAGE=/vmlinuz-5.17.0+ root=UUID=39706f53-7c27-4310-b22a-36c7b042d1a1 ro amdgpu.audio=1 amdgpu.si_support=1 radeon.si_support=0 page_owner=on amdgpu.gpu_recovery=1 udev.log-priority=info rd.udev.log-priority=info
Apr 1 18:33:43 am64 kernel: [ 245.724485] ACPI: bus type drm_connector registered
Apr 1 18:33:44 am64 kernel: [ 245.945020] [drm] amdgpu kernel modesetting enabled.
Apr 1 18:33:44 am64 kernel: [ 245.945140] amdgpu 0000:01:00.0: vgaarb: deactivate vga console
Apr 1 18:33:44 am64 kernel: [ 245.946413] [drm] initializing kernel modesetting (VERDE 0x1002:0x682B 0x1458:0x22CA 0x87).
Apr 1 18:33:44 am64 kernel: [ 245.946423] amdgpu 0000:01:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
Apr 1 18:33:44 am64 kernel: [ 245.946448] [drm] register mmio base: 0xFE8C0000
Apr 1 18:33:44 am64 kernel: [ 245.946451] [drm] register mmio size: 262144
Apr 1 18:33:44 am64 kernel: [ 245.946642] [drm] add ip block number 0 <si_common>
Apr 1 18:33:44 am64 kernel: [ 245.946657] [drm] add ip block number 1 <gmc_v6_0>
Apr 1 18:33:44 am64 kernel: [ 245.946660] [drm] add ip block number 2 <si_ih>
Apr 1 18:33:44 am64 kernel: [ 245.946663] [drm] add ip block number 3 <gfx_v6_0>
Apr 1 18:33:44 am64 kernel: [ 245.946666] [drm] add ip block number 4 <si_dma>
Apr 1 18:33:44 am64 kernel: [ 245.946668] [drm] add ip block number 5 <si_dpm>
Apr 1 18:33:44 am64 kernel: [ 245.946671] [drm] add ip block number 6 <dce_v6_0>
Apr 1 18:33:44 am64 kernel: [ 245.946674] [drm] add ip block number 7 <uvd_v3_1>
Apr 1 18:33:44 am64 kernel: [ 245.990113] [drm] BIOS signature incorrect 20 7
Apr 1 18:33:44 am64 kernel: [ 245.990146] amdgpu 0000:01:00.0: No more image in the PCI ROM
Apr 1 18:33:44 am64 kernel: [ 245.991510] amdgpu 0000:01:00.0: amdgpu: Fetched VBIOS from ROM BAR
Apr 1 18:33:44 am64 kernel: [ 245.991516] amdgpu: ATOM BIOS: xxx-xxx-xxx
Apr 1 18:33:44 am64 kernel: [ 245.991539] amdgpu 0000:01:00.0: amdgpu: PCIE atomic ops is not supported
Apr 1 18:33:44 am64 kernel: [ 245.991841] [drm] vm size is 64 GB, 2 levels, block size is 10-bit, fragment size is 9-bit
Apr 1 18:33:44 am64 kernel: [ 246.045705] amdgpu 0000:01:00.0: amdgpu: VRAM: 2048M 0x000000F400000000 - 0x000000F47FFFFFFF (2048M used)
Apr 1 18:33:44 am64 kernel: [ 246.045719] amdgpu 0000:01:00.0: amdgpu: GART: 1024M 0x000000FF00000000 - 0x000000FF3FFFFFFF
Apr 1 18:33:44 am64 kernel: [ 246.045736] [drm] Detected VRAM RAM=2048M, BAR=256M
Apr 1 18:33:44 am64 kernel: [ 246.045739] [drm] RAM width 128bits DDR3
Apr 1 18:33:44 am64 kernel: [ 246.045825] [drm] amdgpu: 2048M of VRAM memory ready
Apr 1 18:33:44 am64 kernel: [ 246.045829] [drm] amdgpu: 3072M of GTT memory ready.
Apr 1 18:33:44 am64 kernel: [ 246.045854] [drm] GART: num cpu pages 262144, num gpu pages 262144
Apr 1 18:33:44 am64 kernel: [ 246.046180] amdgpu 0000:01:00.0: amdgpu: PCIE GART of 1024M enabled (table at 0x000000F400900000).
Apr 1 18:33:44 am64 kernel: [ 246.084159] [drm] Internal thermal controller with fan control
Apr 1 18:33:44 am64 kernel: [ 246.084180] [drm] amdgpu: dpm initialized
Apr 1 18:33:44 am64 kernel: [ 246.084264] [drm] AMDGPU Display Connectors
Apr 1 18:33:44 am64 kernel: [ 246.084268] [drm] Connector 0:
Apr 1 18:33:44 am64 kernel: [ 246.084270] [drm] HDMI-A-1
Apr 1 18:33:44 am64 kernel: [ 246.084272] [drm] HPD1
Apr 1 18:33:44 am64 kernel: [ 246.084274] [drm] DDC: 0x194c 0x194c 0x194d 0x194d 0x194e 0x194e 0x194f 0x194f
Apr 1 18:33:44 am64 kernel: [ 246.084279] [drm] Encoders:
Apr 1 18:33:44 am64 kernel: [ 246.084281] [drm] DFP1: INTERNAL_UNIPHY
Apr 1 18:33:44 am64 kernel: [ 246.084283] [drm] Connector 1:
Apr 1 18:33:44 am64 kernel: [ 246.084285] [drm] DVI-D-1
Apr 1 18:33:44 am64 kernel: [ 246.084287] [drm] HPD2
Apr 1 18:33:44 am64 kernel: [ 246.084289] [drm] DDC: 0x1950 0x1950 0x1951 0x1951 0x1952 0x1952 0x1953 0x1953
Apr 1 18:33:44 am64 kernel: [ 246.084293] [drm] Encoders:
Apr 1 18:33:44 am64 kernel: [ 246.084295] [drm] DFP2: INTERNAL_UNIPHY
Apr 1 18:33:44 am64 kernel: [ 246.084297] [drm] Connector 2:
Apr 1 18:33:44 am64 kernel: [ 246.084299] [drm] VGA-1
Apr 1 18:33:44 am64 kernel: [ 246.084301] [drm] DDC: 0x1970 0x1970 0x1971 0x1971 0x1972 0x1972 0x1973 0x1973
Apr 1 18:33:44 am64 kernel: [ 246.084305] [drm] Encoders:
Apr 1 18:33:44 am64 kernel: [ 246.084307] [drm] CRT1: INTERNAL_KLDSCP_DAC1
Apr 1 18:33:44 am64 kernel: [ 246.135615] [drm] Found UVD firmware Version: 64.0 Family ID: 13
Apr 1 18:33:44 am64 kernel: [ 246.137371] [drm] PCIE gen 2 link speeds already enabled
Apr 1 18:33:44 am64 kernel: [ 246.674277] [drm] UVD initialized successfully.
Apr 1 18:33:44 am64 kernel: [ 246.674849] amdgpu 0000:01:00.0: amdgpu: SE 1, SH per SE 2, CU per SH 5, active_cu_number 8
Apr 1 18:33:45 am64 kernel: [ 247.008964] [drm] Initialized amdgpu 3.46.0 20150101 for 0000:01:00.0 on minor 0
Apr 1 18:33:45 am64 kernel: [ 247.068412] fbcon: amdgpudrmfb (fb0) is primary device
The monitor still went blank but the magic sysreq sync and boot worked,
allowing capture of the above log but nothing after the line above.
Regards,
Arthur Marsh.