Re: [PATCH] drm/amdgpu: fix reset domain xgmi hive info reference leak

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 2022-08-12 14:38, Kim, Jonathan wrote:
[Public]

Hi Andrey,

Here's the load/unload stack trace.  This is a 2 GPU xGMI system.  I put dbg_xgmi_hive_get/put refcount print post kobj get/put.
It's stuck at 2 on unload.  If it's an 8 GPU system, it's stuck at 8.

e.g. of sysfs leak after driver unload:
atitest@atitest:/sys/devices/pci0000:80/0000:80:02.0/0000:81:00.0/0000:82:00.0/0000:83:00.0$ ls xgmi_hive_info/
xgmi_hive_id

Thanks,

Jon


I see the leak, but how is it related to amdgpu_reset_domain ? How you think that he causing this ?

Andrey




Driver load (get ref happens on both device add to hive and init per device):
[   61.975900] amdkcl: loading out-of-tree module taints kernel.
[   61.975973] amdkcl: module verification failed: signature and/or required key missing - tainting kernel
[   62.065546] amdkcl: Warning: fail to get symbol cancel_work, replace it with kcl stub
[   62.081920] AMD-Vi: AMD IOMMUv2 functionality not available on this system - This is not a bug.
[   62.491119] [drm] amdgpu kernel modesetting enabled.
[   62.491122] [drm] amdgpu version: 5.18.2
[   62.491124] [drm] OS DRM version: 5.15.0
[   62.491337] amdgpu: CRAT table not found
[   62.491341] amdgpu: Virtual CRAT table created for CPU
[   62.491360] amdgpu: Topology: Add CPU node
[   62.603556] amdgpu: PeerDirect support was initialized successfully
[   62.603847] amdgpu 0000:83:00.0: enabling device (0100 -> 0102)
[   62.603987] [drm] initializing kernel modesetting (VEGA20 0x1002:0x66A1 0x1002:0x0834 0x00).
[   62.604023] [drm] register mmio base: 0xFBD00000
[   62.604026] [drm] register mmio size: 524288
[   62.604171] [drm] add ip block number 0 <soc15_common>
[   62.604175] [drm] add ip block number 1 <gmc_v9_0>
[   62.604177] [drm] add ip block number 2 <vega20_ih>
[   62.604180] [drm] add ip block number 3 <psp>
[   62.604182] [drm] add ip block number 4 <powerplay>
[   62.604185] [drm] add ip block number 5 <dm>
[   62.604187] [drm] add ip block number 6 <gfx_v9_0>
[   62.604190] [drm] add ip block number 7 <sdma_v4_0>
[   62.604192] [drm] add ip block number 8 <uvd_v7_0>
[   62.604194] [drm] add ip block number 9 <vce_v4_0>
[   62.641771] amdgpu 0000:83:00.0: amdgpu: Fetched VBIOS from ROM BAR
[   62.641777] amdgpu: ATOM BIOS: 113-D1630200-112
[   62.713418] [drm] UVD(0) is enabled in VM mode
[   62.713423] [drm] UVD(1) is enabled in VM mode
[   62.713426] [drm] UVD(0) ENC is enabled in VM mode
[   62.713428] [drm] UVD(1) ENC is enabled in VM mode
[   62.713430] [drm] VCE enabled in VM mode
[   62.713433] amdgpu 0000:83:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
[   62.713472] [drm] GPU posting now...
[   62.713993] amdgpu 0000:83:00.0: amdgpu: MEM ECC is active.
[   62.713995] amdgpu 0000:83:00.0: amdgpu: SRAM ECC is active.
[   62.714006] amdgpu 0000:83:00.0: amdgpu: RAS INFO: ras initialized successfully, hardware ability[7fff] ras_mask[7fff]
[   62.714018] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
[   62.714026] amdgpu 0000:83:00.0: amdgpu: VRAM: 32752M 0x0000008000000000 - 0x00000087FEFFFFFF (32752M used)
[   62.714029] amdgpu 0000:83:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[   62.714032] amdgpu 0000:83:00.0: amdgpu: AGP: 267845632M 0x0000009000000000 - 0x0000FFFFFFFFFFFF
[   62.714043] [drm] Detected VRAM RAM=32752M, BAR=32768M
[   62.714044] [drm] RAM width 4096bits HBM
[   62.714050] debugfs: Directory 'ttm' with parent '/' already present!
[   62.714146] [drm] amdgpu: 32752M of VRAM memory ready
[   62.714149] [drm] amdgpu: 40203M of GTT memory ready.
[   62.714170] [drm] GART: num cpu pages 131072, num gpu pages 131072
[   62.714266] [drm] PCIE GART of 512M enabled.
[   62.714267] [drm] PTB located at 0x0000008000000000
[   62.731067] amdgpu 0000:83:00.0: amdgpu: PSP runtime database doesn't exist
[   62.731075] amdgpu 0000:83:00.0: amdgpu: PSP runtime database doesn't exist
[   62.731449] amdgpu: [powerplay] hwmgr_sw_init smu backed is vega20_smu
[   62.743177] [drm] Found UVD firmware ENC: 1.2 DEC: .43 Family ID: 19
[   62.743244] [drm] PSP loading UVD firmware
[   62.744525] [drm] Found VCE firmware Version: 57.6 Binary ID: 4
[   62.744689] [drm] PSP loading VCE firmware
[   62.896804] [drm] reserve 0x400000 from 0x87fec00000 for PSP TMR
[   62.979421] amdgpu 0000:83:00.0: amdgpu: HDCP: optional hdcp ta ucode is not available
[   62.979427] amdgpu 0000:83:00.0: amdgpu: DTM: optional dtm ta ucode is not available
[   62.979430] amdgpu 0000:83:00.0: amdgpu: RAP: optional rap ta ucode is not available
[   62.979432] amdgpu 0000:83:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[   62.982386] [drm] Display Core initialized with v3.2.196!
[   62.984514] [drm] kiq ring mec 2 pipe 1 q 0
[   63.026846] [drm] UVD and UVD ENC initialized successfully.
[   63.225760] [drm] VCE initialized successfully.
[   63.244442] amdgpu: [dbg_xgmi_hive_get] ref_count 2
[   63.244448] CPU: 10 PID: 397 Comm: kworker/10:2 Tainted: G           OE     5.15.0-46-generic #49~20.04.1-Ubuntu
[   63.244454] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1 09/14/2018
[   63.244457] Workqueue: events work_for_cpu_fn
[   63.244471] Call Trace:
[   63.244474]  <TASK>
[   63.244479]  dump_stack_lvl+0x4a/0x63
[   63.244493]  dump_stack+0x10/0x16
[   63.244501]  amdgpu_get_xgmi_hive+0x217/0x2a0 [amdgpu]
[   63.245047]  amdgpu_xgmi_add_device+0xcc/0x450 [amdgpu]
[   63.245463]  ? amdgpu_ras_recovery_init+0x13d/0x2e0 [amdgpu]
[   63.245879]  ? vce_v4_0_hw_init.cold+0xc/0x13 [amdgpu]
[   63.246466]  amdgpu_device_init.cold+0x15bd/0x1fe3 [amdgpu]
[   63.247055]  ? pci_bus_read_config_word+0x4a/0x70
[   63.247064]  ? do_pci_enable_device+0xdb/0x110
[   63.247070]  amdgpu_driver_load_kms+0x1a/0x120 [amdgpu]
[   63.247463]  amdgpu_pci_probe+0x18d/0x3a0 [amdgpu]
[   63.247868]  local_pci_probe+0x4b/0x90
[   63.247876]  work_for_cpu_fn+0x1a/0x30
[   63.247881]  process_one_work+0x22b/0x3d0
[   63.247887]  worker_thread+0x21d/0x3f0
[   63.247893]  ? process_one_work+0x3d0/0x3d0
[   63.247898]  kthread+0x12a/0x150
[   63.247905]  ? set_kthread_struct+0x50/0x50
[   63.247910]  ret_from_fork+0x22/0x30
[   63.247922]  </TASK>
[   63.248563] amdgpu 0000:83:00.0: amdgpu: XGMI: Add node 0, hive 0x25bbae7e3fd04cf4.
[   63.248569] amdgpu: [dbg_xgmi_hive_get] ref_count 3
[   63.248572] CPU: 10 PID: 397 Comm: kworker/10:2 Tainted: G           OE     5.15.0-46-generic #49~20.04.1-Ubuntu
[   63.248578] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1 09/14/2018
[   63.248580] Workqueue: events work_for_cpu_fn
[   63.248587] Call Trace:
[   63.248588]  <TASK>
[   63.248590]  dump_stack_lvl+0x4a/0x63
[   63.248598]  dump_stack+0x10/0x16
[   63.248604]  amdgpu_get_xgmi_hive+0x285/0x2a0 [amdgpu]
[   63.249033]  amdgpu_device_init.cold+0x15cd/0x1fe3 [amdgpu]
[   63.249621]  ? pci_bus_read_config_word+0x4a/0x70
[   63.249627]  ? do_pci_enable_device+0xdb/0x110
[   63.249632]  amdgpu_driver_load_kms+0x1a/0x120 [amdgpu]
[   63.250022]  amdgpu_pci_probe+0x18d/0x3a0 [amdgpu]
[   63.250410]  local_pci_probe+0x4b/0x90
[   63.250416]  work_for_cpu_fn+0x1a/0x30
[   63.250421]  process_one_work+0x22b/0x3d0
[   63.250428]  worker_thread+0x21d/0x3f0
[   63.250434]  ? process_one_work+0x3d0/0x3d0
[   63.250440]  kthread+0x12a/0x150
[   63.250445]  ? set_kthread_struct+0x50/0x50
[   63.250450]  ret_from_fork+0x22/0x30
[   63.250458]  </TASK>
[   63.268869] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
[   63.269180] amdgpu: sdma_bitmap: ffff
[   63.605188] memmap_init_zone_device initialised 8388608 pages in 132ms
[   63.605203] amdgpu: HMM registered 32752MB device memory
[   63.605244] amdgpu: [powerplay] [MemMclks]: memclk dpm not enabled!

[   63.605263] amdgpu: Virtual CRAT table created for GPU
[   63.605651] amdgpu: [powerplay] [MemMclks]: memclk dpm not enabled!

[   63.605659] amdgpu: Topology: Add dGPU node [0x66a1:0x1002]
[   63.605670] kfd kfd: amdgpu: added device 1002:66a1
[   63.626300] amdgpu 0000:83:00.0: amdgpu: SE 4, SH per SE 1, CU per SH 16, active_cu_number 64
[   63.626517] amdgpu 0000:83:00.0: amdgpu: ring gfx uses VM inv eng 0 on hub 0
[   63.626522] amdgpu 0000:83:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[   63.626525] amdgpu 0000:83:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[   63.626529] amdgpu 0000:83:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[   63.626531] amdgpu 0000:83:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[   63.626534] amdgpu 0000:83:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[   63.626537] amdgpu 0000:83:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[   63.626540] amdgpu 0000:83:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[   63.626543] amdgpu 0000:83:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[   63.626546] amdgpu 0000:83:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
[   63.626549] amdgpu 0000:83:00.0: amdgpu: ring sdma0 uses VM inv eng 0 on hub 1
[   63.626552] amdgpu 0000:83:00.0: amdgpu: ring page0 uses VM inv eng 1 on hub 1
[   63.626555] amdgpu 0000:83:00.0: amdgpu: ring sdma1 uses VM inv eng 4 on hub 1
[   63.626558] amdgpu 0000:83:00.0: amdgpu: ring page1 uses VM inv eng 5 on hub 1
[   63.626561] amdgpu 0000:83:00.0: amdgpu: ring uvd_0 uses VM inv eng 6 on hub 1
[   63.626563] amdgpu 0000:83:00.0: amdgpu: ring uvd_enc_0.0 uses VM inv eng 7 on hub 1
[   63.626566] amdgpu 0000:83:00.0: amdgpu: ring uvd_enc_0.1 uses VM inv eng 8 on hub 1
[   63.626569] amdgpu 0000:83:00.0: amdgpu: ring uvd_1 uses VM inv eng 9 on hub 1
[   63.626572] amdgpu 0000:83:00.0: amdgpu: ring uvd_enc_1.0 uses VM inv eng 10 on hub 1
[   63.626575] amdgpu 0000:83:00.0: amdgpu: ring uvd_enc_1.1 uses VM inv eng 11 on hub 1
[   63.626577] amdgpu 0000:83:00.0: amdgpu: ring vce0 uses VM inv eng 12 on hub 1
[   63.626580] amdgpu 0000:83:00.0: amdgpu: ring vce1 uses VM inv eng 13 on hub 1
[   63.626583] amdgpu 0000:83:00.0: amdgpu: ring vce2 uses VM inv eng 14 on hub 1
[   63.636996] amdgpu: Detected AMDGPU DF Counters. # of Counters = 8.
[   63.637046] amdgpu: Detected AMDGPU 2 Perf Events.
[   63.637428] [drm] Initialized amdgpu 3.48.0 20150101 for 0000:83:00.0 on minor 1
[   63.637937] amdgpu 0000:86:00.0: enabling device (0100 -> 0102)
[   63.638043] [drm] initializing kernel modesetting (VEGA20 0x1002:0x66A1 0x1002:0x0834 0x00).
[   63.638090] [drm] register mmio base: 0xFBB00000
[   63.638092] [drm] register mmio size: 524288
[   63.638261] [drm] add ip block number 0 <soc15_common>
[   63.638263] [drm] add ip block number 1 <gmc_v9_0>
[   63.638265] [drm] add ip block number 2 <vega20_ih>
[   63.638266] [drm] add ip block number 3 <psp>
[   63.638267] [drm] add ip block number 4 <powerplay>
[   63.638269] [drm] add ip block number 5 <dm>
[   63.638271] [drm] add ip block number 6 <gfx_v9_0>
[   63.638272] [drm] add ip block number 7 <sdma_v4_0>
[   63.638273] [drm] add ip block number 8 <uvd_v7_0>
[   63.638275] [drm] add ip block number 9 <vce_v4_0>
[   63.675838] amdgpu 0000:86:00.0: amdgpu: Fetched VBIOS from ROM BAR
[   63.675842] amdgpu: ATOM BIOS: 113-D1630200-112
[   63.675867] [drm] UVD(0) is enabled in VM mode
[   63.675868] [drm] UVD(1) is enabled in VM mode
[   63.675869] [drm] UVD(0) ENC is enabled in VM mode
[   63.675870] [drm] UVD(1) ENC is enabled in VM mode
[   63.675871] [drm] VCE enabled in VM mode
[   63.675873] amdgpu 0000:86:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
[   63.675899] [drm] GPU posting now...
[   63.676276] amdgpu 0000:86:00.0: amdgpu: MEM ECC is active.
[   63.676277] amdgpu 0000:86:00.0: amdgpu: SRAM ECC is active.
[   63.676286] amdgpu 0000:86:00.0: amdgpu: RAS INFO: ras initialized successfully, hardware ability[7fff] ras_mask[7fff]
[   63.676297] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
[   63.676304] amdgpu 0000:86:00.0: amdgpu: VRAM: 32752M 0x0000008800000000 - 0x0000008FFEFFFFFF (32752M used)
[   63.676307] amdgpu 0000:86:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[   63.676310] amdgpu 0000:86:00.0: amdgpu: AGP: 267845632M 0x0000009000000000 - 0x0000FFFFFFFFFFFF
[   63.676321] [drm] Detected VRAM RAM=32752M, BAR=32768M
[   63.676322] [drm] RAM width 4096bits HBM
[   63.676363] [drm] amdgpu: 32752M of VRAM memory ready
[   63.676365] [drm] amdgpu: 40203M of GTT memory ready.
[   63.676388] [drm] GART: num cpu pages 131072, num gpu pages 131072
[   63.676481] [drm] PCIE GART of 512M enabled.
[   63.676482] [drm] PTB located at 0x0000008800000000
[   63.676730] amdgpu 0000:86:00.0: amdgpu: PSP runtime database doesn't exist
[   63.676733] amdgpu 0000:86:00.0: amdgpu: PSP runtime database doesn't exist
[   63.677088] amdgpu: [powerplay] hwmgr_sw_init smu backed is vega20_smu
[   63.678862] [drm] Found UVD firmware ENC: 1.2 DEC: .43 Family ID: 19
[   63.678918] [drm] PSP loading UVD firmware
[   63.679487] [drm] Found VCE firmware Version: 57.6 Binary ID: 4
[   63.679619] [drm] PSP loading VCE firmware
[   63.831730] [drm] reserve 0x400000 from 0x8ffec00000 for PSP TMR
[   63.914508] amdgpu 0000:86:00.0: amdgpu: HDCP: optional hdcp ta ucode is not available
[   63.914513] amdgpu 0000:86:00.0: amdgpu: DTM: optional dtm ta ucode is not available
[   63.914516] amdgpu 0000:86:00.0: amdgpu: RAP: optional rap ta ucode is not available
[   63.914518] amdgpu 0000:86:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[   63.917458] [drm] Display Core initialized with v3.2.196!
[   63.919616] [drm] kiq ring mec 2 pipe 1 q 0
[   63.961950] [drm] UVD and UVD ENC initialized successfully.
[   64.160863] [drm] VCE initialized successfully.
[   64.179285] amdgpu: [dbg_xgmi_hive_get] ref_count 4
[   64.179291] CPU: 10 PID: 397 Comm: kworker/10:2 Tainted: G           OE     5.15.0-46-generic #49~20.04.1-Ubuntu
[   64.179297] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1 09/14/2018
[   64.179299] Workqueue: events work_for_cpu_fn
[   64.179311] Call Trace:
[   64.179315]  <TASK>
[   64.179320]  dump_stack_lvl+0x4a/0x63
[   64.179331]  dump_stack+0x10/0x16
[   64.179340]  amdgpu_get_xgmi_hive+0x217/0x2a0 [amdgpu]
[   64.179904]  amdgpu_xgmi_add_device+0xcc/0x450 [amdgpu]
[   64.180318]  ? amdgpu_ras_recovery_init+0x13d/0x2e0 [amdgpu]
[   64.180733]  ? vce_v4_0_hw_init.cold+0xc/0x13 [amdgpu]
[   64.181321]  amdgpu_device_init.cold+0x15bd/0x1fe3 [amdgpu]
[   64.181909]  ? pci_bus_read_config_word+0x4a/0x70
[   64.181917]  ? do_pci_enable_device+0xdb/0x110
[   64.181923]  amdgpu_driver_load_kms+0x1a/0x120 [amdgpu]
[   64.182315]  amdgpu_pci_probe+0x18d/0x3a0 [amdgpu]
[   64.182703]  local_pci_probe+0x4b/0x90
[   64.182710]  work_for_cpu_fn+0x1a/0x30
[   64.182715]  process_one_work+0x22b/0x3d0
[   64.182722]  worker_thread+0x21d/0x3f0
[   64.182728]  ? process_one_work+0x3d0/0x3d0
[   64.182734]  kthread+0x12a/0x150
[   64.182740]  ? set_kthread_struct+0x50/0x50
[   64.182745]  ret_from_fork+0x22/0x30
[   64.182756]  </TASK>
[   64.184561] amdgpu 0000:86:00.0: amdgpu: XGMI: Add node 1, hive 0x25bbae7e3fd04cf4.
[   64.184568] amdgpu: [dbg_xgmi_hive_get] ref_count 5
[   64.184571] CPU: 10 PID: 397 Comm: kworker/10:2 Tainted: G           OE     5.15.0-46-generic #49~20.04.1-Ubuntu
[   64.184576] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1 09/14/2018
[   64.184578] Workqueue: events work_for_cpu_fn
[   64.184585] Call Trace:
[   64.184587]  <TASK>
[   64.184589]  dump_stack_lvl+0x4a/0x63
[   64.184596]  dump_stack+0x10/0x16
[   64.184602]  amdgpu_get_xgmi_hive+0x285/0x2a0 [amdgpu]
[   64.185041]  amdgpu_device_init.cold+0x15cd/0x1fe3 [amdgpu]
[   64.185624]  ? pci_bus_read_config_word+0x4a/0x70
[   64.185631]  ? do_pci_enable_device+0xdb/0x110
[   64.185636]  amdgpu_driver_load_kms+0x1a/0x120 [amdgpu]
[   64.186027]  amdgpu_pci_probe+0x18d/0x3a0 [amdgpu]
[   64.186416]  local_pci_probe+0x4b/0x90
[   64.186422]  work_for_cpu_fn+0x1a/0x30
[   64.186428]  process_one_work+0x22b/0x3d0
[   64.186434]  worker_thread+0x21d/0x3f0
[   64.186439]  ? process_one_work+0x3d0/0x3d0
[   64.186445]  kthread+0x12a/0x150
[   64.186450]  ? set_kthread_struct+0x50/0x50
[   64.186455]  ret_from_fork+0x22/0x30
[   64.186464]  </TASK>
[   64.206119] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
[   64.206433] amdgpu: sdma_bitmap: ffff
[   64.552064] memmap_init_zone_device initialised 8388608 pages in 132ms
[   64.552080] amdgpu: HMM registered 32752MB device memory
[   64.552116] amdgpu: [powerplay] [MemMclks]: memclk dpm not enabled!

[   64.552138] amdgpu: Virtual CRAT table created for GPU
[   64.552978] amdgpu: [powerplay] [MemMclks]: memclk dpm not enabled!

[   64.552988] amdgpu: Topology: Add dGPU node [0x66a1:0x1002]
[   64.552999] kfd kfd: amdgpu: added device 1002:66a1
[   64.570314] amdgpu 0000:86:00.0: amdgpu: SE 4, SH per SE 1, CU per SH 16, active_cu_number 64
[   64.570527] amdgpu 0000:86:00.0: amdgpu: ring gfx uses VM inv eng 0 on hub 0
[   64.570531] amdgpu 0000:86:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[   64.570535] amdgpu 0000:86:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[   64.570538] amdgpu 0000:86:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[   64.570541] amdgpu 0000:86:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[   64.570544] amdgpu 0000:86:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[   64.570547] amdgpu 0000:86:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[   64.570550] amdgpu 0000:86:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[   64.570552] amdgpu 0000:86:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[   64.570556] amdgpu 0000:86:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
[   64.570559] amdgpu 0000:86:00.0: amdgpu: ring sdma0 uses VM inv eng 0 on hub 1
[   64.570562] amdgpu 0000:86:00.0: amdgpu: ring page0 uses VM inv eng 1 on hub 1
[   64.570565] amdgpu 0000:86:00.0: amdgpu: ring sdma1 uses VM inv eng 4 on hub 1
[   64.570567] amdgpu 0000:86:00.0: amdgpu: ring page1 uses VM inv eng 5 on hub 1
[   64.570570] amdgpu 0000:86:00.0: amdgpu: ring uvd_0 uses VM inv eng 6 on hub 1
[   64.570573] amdgpu 0000:86:00.0: amdgpu: ring uvd_enc_0.0 uses VM inv eng 7 on hub 1
[   64.570576] amdgpu 0000:86:00.0: amdgpu: ring uvd_enc_0.1 uses VM inv eng 8 on hub 1
[   64.570579] amdgpu 0000:86:00.0: amdgpu: ring uvd_1 uses VM inv eng 9 on hub 1
[   64.570581] amdgpu 0000:86:00.0: amdgpu: ring uvd_enc_1.0 uses VM inv eng 10 on hub 1
[   64.570584] amdgpu 0000:86:00.0: amdgpu: ring uvd_enc_1.1 uses VM inv eng 11 on hub 1
[   64.570587] amdgpu 0000:86:00.0: amdgpu: ring vce0 uses VM inv eng 12 on hub 1
[   64.570589] amdgpu 0000:86:00.0: amdgpu: ring vce1 uses VM inv eng 13 on hub 1
[   64.570592] amdgpu 0000:86:00.0: amdgpu: ring vce2 uses VM inv eng 14 on hub 1
[   64.581070] amdgpu: [dbg_xgmi_hive_get] ref_count 6
[   64.581075] CPU: 10 PID: 397 Comm: kworker/10:2 Tainted: G           OE     5.15.0-46-generic #49~20.04.1-Ubuntu
[   64.581079] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1 09/14/2018
[   64.581081] Workqueue: events work_for_cpu_fn
[   64.581089] Call Trace:
[   64.581091]  <TASK>
[   64.581094]  dump_stack_lvl+0x4a/0x63
[   64.581103]  dump_stack+0x10/0x16
[   64.581109]  amdgpu_get_xgmi_hive+0x285/0x2a0 [amdgpu]
[   64.581489]  amdgpu_xgmi_set_pstate+0xe/0x30 [amdgpu]
[   64.581723]  amdgpu_device_ip_late_init+0x2dc/0x380 [amdgpu]
[   64.581943]  amdgpu_device_init.cold+0x1805/0x1fe3 [amdgpu]
[   64.582288]  ? pci_bus_read_config_word+0x4a/0x70
[   64.582295]  ? do_pci_enable_device+0xdb/0x110
[   64.582298]  amdgpu_driver_load_kms+0x1a/0x120 [amdgpu]
[   64.582520]  amdgpu_pci_probe+0x18d/0x3a0 [amdgpu]
[   64.582738]  local_pci_probe+0x4b/0x90
[   64.582743]  work_for_cpu_fn+0x1a/0x30
[   64.582746]  process_one_work+0x22b/0x3d0
[   64.582750]  worker_thread+0x21d/0x3f0
[   64.582753]  ? process_one_work+0x3d0/0x3d0
[   64.582756]  kthread+0x12a/0x150
[   64.582761]  ? set_kthread_struct+0x50/0x50
[   64.582764]  ret_from_fork+0x22/0x30
[   64.582772]  </TASK>
[   64.582774] amdgpu: [dbg_xgmi_hive_put] ref_count 5
[   64.582775] CPU: 10 PID: 397 Comm: kworker/10:2 Tainted: G           OE     5.15.0-46-generic #49~20.04.1-Ubuntu
[   64.582778] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1 09/14/2018
[   64.582779] Workqueue: events work_for_cpu_fn
[   64.582782] Call Trace:
[   64.582783]  <TASK>
[   64.582784]  dump_stack_lvl+0x4a/0x63
[   64.582789]  dump_stack+0x10/0x16
[   64.582792]  amdgpu_put_xgmi_hive.part.0+0x26/0x30 [amdgpu]
[   64.583028]  amdgpu_xgmi_set_pstate+0x1b/0x30 [amdgpu]
[   64.583262]  amdgpu_device_ip_late_init+0x2dc/0x380 [amdgpu]
[   64.583482]  amdgpu_device_init.cold+0x1805/0x1fe3 [amdgpu]
[   64.583833]  ? pci_bus_read_config_word+0x4a/0x70
[   64.583836]  ? do_pci_enable_device+0xdb/0x110
[   64.583840]  amdgpu_driver_load_kms+0x1a/0x120 [amdgpu]
[   64.584072]  amdgpu_pci_probe+0x18d/0x3a0 [amdgpu]
[   64.584304]  local_pci_probe+0x4b/0x90
[   64.584307]  work_for_cpu_fn+0x1a/0x30
[   64.584311]  process_one_work+0x22b/0x3d0
[   64.584314]  worker_thread+0x21d/0x3f0
[   64.584318]  ? process_one_work+0x3d0/0x3d0
[   64.584321]  kthread+0x12a/0x150
[   64.584324]  ? set_kthread_struct+0x50/0x50
[   64.584327]  ret_from_fork+0x22/0x30
[   64.584333]  </TASK>
[   64.584342] amdgpu: [dbg_xgmi_hive_get] ref_count 6
[   64.584344] CPU: 10 PID: 397 Comm: kworker/10:2 Tainted: G           OE     5.15.0-46-generic #49~20.04.1-Ubuntu
[   64.584347] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1 09/14/2018
[   64.584348] Workqueue: events work_for_cpu_fn
[   64.584352] Call Trace:
[   64.584353]  <TASK>
[   64.584354]  dump_stack_lvl+0x4a/0x63
[   64.584358]  dump_stack+0x10/0x16
[   64.584362]  amdgpu_get_xgmi_hive+0x285/0x2a0 [amdgpu]
[   64.584610]  amdgpu_xgmi_set_pstate+0xe/0x30 [amdgpu]
[   64.584856]  amdgpu_device_ip_late_init+0x2dc/0x380 [amdgpu]
[   64.585086]  amdgpu_device_init.cold+0x1805/0x1fe3 [amdgpu]
[   64.585437]  ? pci_bus_read_config_word+0x4a/0x70
[   64.585440]  ? do_pci_enable_device+0xdb/0x110
[   64.585443]  amdgpu_driver_load_kms+0x1a/0x120 [amdgpu]
[   64.585679]  amdgpu_pci_probe+0x18d/0x3a0 [amdgpu]
[   64.585922]  local_pci_probe+0x4b/0x90
[   64.585926]  work_for_cpu_fn+0x1a/0x30
[   64.585929]  process_one_work+0x22b/0x3d0
[   64.585932]  worker_thread+0x21d/0x3f0
[   64.585936]  ? process_one_work+0x3d0/0x3d0
[   64.585939]  kthread+0x12a/0x150
[   64.585942]  ? set_kthread_struct+0x50/0x50
[   64.585945]  ret_from_fork+0x22/0x30
[   64.585950]  </TASK>
[   64.585951] amdgpu: [dbg_xgmi_hive_put] ref_count 5
[   64.585953] CPU: 10 PID: 397 Comm: kworker/10:2 Tainted: G           OE     5.15.0-46-generic #49~20.04.1-Ubuntu
[   64.585956] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1 09/14/2018
[   64.585957] Workqueue: events work_for_cpu_fn
[   64.585960] Call Trace:
[   64.585961]  <TASK>
[   64.585963]  dump_stack_lvl+0x4a/0x63
[   64.585967]  dump_stack+0x10/0x16
[   64.585970]  amdgpu_put_xgmi_hive.part.0+0x26/0x30 [amdgpu]
[   64.586213]  amdgpu_xgmi_set_pstate+0x1b/0x30 [amdgpu]
[   64.586458]  amdgpu_device_ip_late_init+0x2dc/0x380 [amdgpu]
[   64.586688]  amdgpu_device_init.cold+0x1805/0x1fe3 [amdgpu]
[   64.587037]  ? pci_bus_read_config_word+0x4a/0x70
[   64.587040]  ? do_pci_enable_device+0xdb/0x110
[   64.587043]  amdgpu_driver_load_kms+0x1a/0x120 [amdgpu]
[   64.587277]  amdgpu_pci_probe+0x18d/0x3a0 [amdgpu]
[   64.587509]  local_pci_probe+0x4b/0x90
[   64.587512]  work_for_cpu_fn+0x1a/0x30
[   64.587515]  process_one_work+0x22b/0x3d0
[   64.587519]  worker_thread+0x21d/0x3f0
[   64.587523]  ? process_one_work+0x3d0/0x3d0
[   64.587526]  kthread+0x12a/0x150
[   64.587529]  ? set_kthread_struct+0x50/0x50
[   64.587532]  ret_from_fork+0x22/0x30
[   64.587537]  </TASK>
[   64.587619] amdgpu: Detected AMDGPU DF Counters. # of Counters = 8.
[   64.587663] amdgpu: Detected AMDGPU 2 Perf Events.
[   64.588081] [drm] Initialized amdgpu 3.48.0 20150101 for 0000:86:00.0 on minor 2

Then driver unload (reference stuck at 2):
[  110.117018] amdgpu 0000:86:00.0: amdgpu: amdgpu: finishing device.
[  110.131638] [drm] free PSP TMR buffer
[  110.420529] amdgpu: [dbg_xgmi_hive_put] ref_count 4
[  110.420537] CPU: 27 PID: 1748 Comm: modprobe Tainted: G           OE     5.15.0-46-generic #49~20.04.1-Ubuntu
[  110.420545] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1 09/14/2018
[  110.420548] Call Trace:
[  110.420551]  <TASK>
[  110.420556]  dump_stack_lvl+0x4a/0x63
[  110.420569]  dump_stack+0x10/0x16
[  110.420578]  amdgpu_put_xgmi_hive.part.0+0x26/0x30 [amdgpu]
[  110.421001]  amdgpu_xgmi_remove_device+0x11d/0x1c0 [amdgpu]
[  110.421380]  amdgpu_device_fini_sw+0x63/0x4c0 [amdgpu]
[  110.421724]  amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
[  110.422070]  drm_dev_release+0x28/0x50 [drm]
[  110.422145]  devm_drm_dev_init_release+0x38/0x60 [drm]
[  110.422190]  devm_action_release+0x15/0x20
[  110.422198]  release_nodes+0x40/0xb0
[  110.422205]  devres_release_all+0x9e/0xe0
[  110.422212]  device_release_driver_internal+0x117/0x1f0
[  110.422218]  driver_detach+0x4c/0xa0
[  110.422222]  bus_remove_driver+0x6c/0xf0
[  110.422227]  driver_unregister+0x31/0x50
[  110.422231]  pci_unregister_driver+0x40/0x90
[  110.422238]  amdgpu_exit+0x15/0x446 [amdgpu]
[  110.422791]  __x64_sys_delete_module+0x14e/0x260
[  110.422801]  ? do_syscall_64+0x69/0xc0
[  110.422809]  ? __x64_sys_read+0x1a/0x20
[  110.422817]  ? do_syscall_64+0x69/0xc0
[  110.422821]  ? ksys_read+0x67/0xf0
[  110.422825]  do_syscall_64+0x5c/0xc0
[  110.422830]  ? __x64_sys_read+0x1a/0x20
[  110.422834]  ? do_syscall_64+0x69/0xc0
[  110.422839]  ? syscall_exit_to_user_mode+0x27/0x50
[  110.422846]  ? __x64_sys_openat+0x20/0x30
[  110.422853]  ? do_syscall_64+0x69/0xc0
[  110.422857]  ? do_syscall_64+0x69/0xc0
[  110.422862]  ? irqentry_exit+0x1d/0x30
[  110.422868]  ? exc_page_fault+0x89/0x170
[  110.422874]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
[  110.422885] RIP: 0033:0x7f1576682a6b
[  110.422892] Code: 73 01 c3 48 8b 0d 25 c4 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d f5 c3 0c 00 f7 d8 64 89 01 48
[  110.422897] RSP: 002b:00007ffcb96e0bf8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
[  110.422904] RAX: ffffffffffffffda RBX: 000056347ba57550 RCX: 00007f1576682a6b
[  110.422908] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000056347ba575b8
[  110.422911] RBP: 000056347ba57550 R08: 0000000000000000 R09: 0000000000000000
[  110.422913] R10: 00007f15766feac0 R11: 0000000000000206 R12: 000056347ba575b8
[  110.422916] R13: 0000000000000000 R14: 000056347ba575b8 R15: 000056347ba57550
[  110.422921]  </TASK>
[  110.425941] [drm] amdgpu: ttm finalized
[  110.489186] amdgpu 0000:83:00.0: amdgpu: amdgpu: finishing device.
[  110.504025] [drm] free PSP TMR buffer
[  110.762272] amdgpu: [dbg_xgmi_hive_put] ref_count 3
[  110.762280] CPU: 27 PID: 1748 Comm: modprobe Tainted: G           OE     5.15.0-46-generic #49~20.04.1-Ubuntu
[  110.762288] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1 09/14/2018
[  110.762290] Call Trace:
[  110.762294]  <TASK>
[  110.762298]  dump_stack_lvl+0x4a/0x63
[  110.762313]  dump_stack+0x10/0x16
[  110.762319]  amdgpu_put_xgmi_hive.part.0+0x26/0x30 [amdgpu]
[  110.762663]  amdgpu_xgmi_remove_device+0x11d/0x1c0 [amdgpu]
[  110.762965]  amdgpu_device_fini_sw+0x63/0x4c0 [amdgpu]
[  110.763231]  amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
[  110.763519]  drm_dev_release+0x28/0x50 [drm]
[  110.763569]  devm_drm_dev_init_release+0x38/0x60 [drm]
[  110.763609]  devm_action_release+0x15/0x20
[  110.763617]  release_nodes+0x40/0xb0
[  110.763624]  devres_release_all+0x9e/0xe0
[  110.763631]  device_release_driver_internal+0x117/0x1f0
[  110.763636]  driver_detach+0x4c/0xa0
[  110.763640]  bus_remove_driver+0x6c/0xf0
[  110.763646]  driver_unregister+0x31/0x50
[  110.763650]  pci_unregister_driver+0x40/0x90
[  110.763657]  amdgpu_exit+0x15/0x446 [amdgpu]
[  110.764153]  __x64_sys_delete_module+0x14e/0x260
[  110.764164]  ? do_syscall_64+0x69/0xc0
[  110.764172]  ? __x64_sys_read+0x1a/0x20
[  110.764180]  ? do_syscall_64+0x69/0xc0
[  110.764184]  ? ksys_read+0x67/0xf0
[  110.764189]  do_syscall_64+0x5c/0xc0
[  110.764193]  ? __x64_sys_read+0x1a/0x20
[  110.764197]  ? do_syscall_64+0x69/0xc0
[  110.764202]  ? syscall_exit_to_user_mode+0x27/0x50
[  110.764209]  ? __x64_sys_openat+0x20/0x30
[  110.764217]  ? do_syscall_64+0x69/0xc0
[  110.764221]  ? do_syscall_64+0x69/0xc0
[  110.764226]  ? irqentry_exit+0x1d/0x30
[  110.764232]  ? exc_page_fault+0x89/0x170
[  110.764238]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
[  110.764248] RIP: 0033:0x7f1576682a6b
[  110.764255] Code: 73 01 c3 48 8b 0d 25 c4 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d f5 c3 0c 00 f7 d8 64 89 01 48
[  110.764260] RSP: 002b:00007ffcb96e0bf8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
[  110.764267] RAX: ffffffffffffffda RBX: 000056347ba57550 RCX: 00007f1576682a6b
[  110.764270] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000056347ba575b8
[  110.764273] RBP: 000056347ba57550 R08: 0000000000000000 R09: 0000000000000000
[  110.764275] R10: 00007f15766feac0 R11: 0000000000000206 R12: 000056347ba575b8
[  110.764278] R13: 0000000000000000 R14: 000056347ba575b8 R15: 000056347ba57550
[  110.764283]  </TASK>
[  110.764326] amdgpu: [dbg_xgmi_hive_put] ref_count 2
[  110.764329] CPU: 27 PID: 1748 Comm: modprobe Tainted: G           OE     5.15.0-46-generic #49~20.04.1-Ubuntu
[  110.764334] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1 09/14/2018
[  110.764336] Call Trace:
[  110.764337]  <TASK>
[  110.764339]  dump_stack_lvl+0x4a/0x63
[  110.764347]  dump_stack+0x10/0x16
[  110.764354]  amdgpu_put_xgmi_hive.part.0+0x26/0x30 [amdgpu]
[  110.764624]  amdgpu_xgmi_remove_device+0x1ad/0x1c0 [amdgpu]
[  110.764791]  amdgpu_device_fini_sw+0x63/0x4c0 [amdgpu]
[  110.764937]  amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
[  110.765085]  drm_dev_release+0x28/0x50 [drm]
[  110.765108]  devm_drm_dev_init_release+0x38/0x60 [drm]
[  110.765130]  devm_action_release+0x15/0x20
[  110.765134]  release_nodes+0x40/0xb0
[  110.765137]  devres_release_all+0x9e/0xe0
[  110.765141]  device_release_driver_internal+0x117/0x1f0
[  110.765144]  driver_detach+0x4c/0xa0
[  110.765146]  bus_remove_driver+0x6c/0xf0
[  110.765148]  driver_unregister+0x31/0x50
[  110.765150]  pci_unregister_driver+0x40/0x90
[  110.765154]  amdgpu_exit+0x15/0x446 [amdgpu]
[  110.765434]  __x64_sys_delete_module+0x14e/0x260
[  110.765438]  ? do_syscall_64+0x69/0xc0
[  110.765441]  ? __x64_sys_read+0x1a/0x20
[  110.765444]  ? do_syscall_64+0x69/0xc0
[  110.765446]  ? ksys_read+0x67/0xf0
[  110.765449]  do_syscall_64+0x5c/0xc0
[  110.765451]  ? __x64_sys_read+0x1a/0x20
[  110.765454]  ? do_syscall_64+0x69/0xc0
[  110.765456]  ? syscall_exit_to_user_mode+0x27/0x50
[  110.765460]  ? __x64_sys_openat+0x20/0x30
[  110.765464]  ? do_syscall_64+0x69/0xc0
[  110.765466]  ? do_syscall_64+0x69/0xc0
[  110.765469]  ? irqentry_exit+0x1d/0x30
[  110.765472]  ? exc_page_fault+0x89/0x170
[  110.765476]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
[  110.765480] RIP: 0033:0x7f1576682a6b
[  110.765482] Code: 73 01 c3 48 8b 0d 25 c4 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d f5 c3 0c 00 f7 d8 64 89 01 48
[  110.765485] RSP: 002b:00007ffcb96e0bf8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
[  110.765488] RAX: ffffffffffffffda RBX: 000056347ba57550 RCX: 00007f1576682a6b
[  110.765489] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000056347ba575b8
[  110.765491] RBP: 000056347ba57550 R08: 0000000000000000 R09: 0000000000000000
[  110.765492] R10: 00007f15766feac0 R11: 0000000000000206 R12: 000056347ba575b8
[  110.765494] R13: 0000000000000000 R14: 000056347ba575b8 R15: 000056347ba57550
[  110.765496]  </TASK>
[  110.768091] [drm] amdgpu: ttm finalized

-----Original Message-----
From: Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx>
Sent: August 11, 2022 12:43 PM
To: Kim, Jonathan <Jonathan.Kim@xxxxxxx>; Kuehling, Felix
<Felix.Kuehling@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx
Subject: Re: [PATCH] drm/amdgpu: fix reset domain xgmi hive info reference
leak


On 2022-08-11 11:34, Kim, Jonathan wrote:
[Public]

-----Original Message-----
From: Kuehling, Felix <Felix.Kuehling@xxxxxxx>
Sent: August 11, 2022 11:19 AM
To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Kim, Jonathan
<Jonathan.Kim@xxxxxxx>
Subject: Re: [PATCH] drm/amdgpu: fix reset domain xgmi hive info reference
leak

Am 2022-08-11 um 09:42 schrieb Jonathan Kim:
When an xgmi node is added to the hive, it takes another hive
reference for its reset domain.

This extra reference was not dropped on device removal from the
hive so drop it.

Signed-off-by: Jonathan Kim <jonathan.kim@xxxxxxx>
---
    drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 3 +++
    1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
index 1b108d03e785..560bf1c98f08 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c
@@ -731,6 +731,9 @@ int amdgpu_xgmi_remove_device(struct
amdgpu_device *adev)
      mutex_unlock(&hive->hive_lock);

      amdgpu_put_xgmi_hive(hive);
+   /* device is removed from the hive so remove its reset domain
reference */
+   if (adev->reset_domain && adev->reset_domain == hive-
reset_domain)
+           amdgpu_put_xgmi_hive(hive);
This is some messed up reference counting. If you need an extra
reference from the reset_domain to the hive, that should be owned by the
reset_domain and dropped when the reset_domain is destroyed. And it's
only one reference for the reset_domain, not one reference per adev in
the reset_domain.
Cc'ing Andrey.

What you're saying seems to make more sense to me, but what I got from an
offline conversation with Andrey
was that the reset domain reference per device was intentional.
Maybe Andrey can comment here.

What you're doing here looks like every adev that's in a reset_domain of
its hive has two references to the hive. And if you're dropping the
extra reference here, it still leaves the reset_domain with a dangling
pointer to a hive that may no longer exist. So this extra reference is
kind of pointless.

reset_domain doesn't have any references to the hive, the hive has a
reference to reset_domain


Yes.  Currently one reference is fetched from the device's lifetime on the hive
and the other is from the
per-device reset domain.

Snippet from amdgpu_device_ip_init:
          /**
           * In case of XGMI grab extra reference for reset domain for this device
           */
          if (adev->gmc.xgmi.num_physical_nodes > 1) {
                  if (amdgpu_xgmi_add_device(adev) == 0) { <- [JK] reference is
fetched here


amdgpu_xgmi_add_device calls  amdgpu_get_xgmi_hive and only on the first
time amdgpu_get_xgmi_hive is called and hive is actually allocated and
initialized  will we proceed
to creating the reset domain either from scratch (first creation of the
hive) or by taking reference from adev (see [1])



[1] -
https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdgpu/a
mdgpu_xgmi.c#L394

                          struct amdgpu_hive_info *hive = amdgpu_get_xgmi_hive(adev);
<- [JK] then here again


So here I don't see how an extra reference to reset_domain is taken if
amdgpu_get_xgmi_hive returns early since the hive already created and
exists in the global hive container ?

Johantan - can u please show the exact flow how recount leak on
reset_domain is happening ?

Andrey


                          if (!hive->reset_domain ||
                              !amdgpu_reset_get_reset_domain(hive->reset_domain)) {
                                  r = -ENOENT;
                                  goto init_failed;
                          }

                          /* Drop the early temporary reset domain we created for device
*/
                          amdgpu_reset_put_reset_domain(adev->reset_domain);
                          adev->reset_domain = hive->reset_domain;
                  }
          }

One of these never gets dropped so a leak happens.
So either the extra reference has to be dropped on device removal from the
hive or from what you've mentioned,
the reset_domain reference fetch should be fixed to grab at the
hive/reset_domain level.
Thanks,

Jon

Regards,
     Felix


      adev->hive = NULL;

      if (atomic_dec_return(&hive->number_devices) == 0) {



[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux