[Public] Hi Andrey, Here's the load/unload stack trace. This is a 2 GPU xGMI system. I put dbg_xgmi_hive_get/put refcount print post kobj get/put. It's stuck at 2 on unload. If it's an 8 GPU system, it's stuck at 8. e.g. of sysfs leak after driver unload: atitest@atitest:/sys/devices/pci0000:80/0000:80:02.0/0000:81:00.0/0000:82:00.0/0000:83:00.0$ ls xgmi_hive_info/ xgmi_hive_id Thanks, Jon Driver load (get ref happens on both device add to hive and init per device): [ 61.975900] amdkcl: loading out-of-tree module taints kernel. [ 61.975973] amdkcl: module verification failed: signature and/or required key missing - tainting kernel [ 62.065546] amdkcl: Warning: fail to get symbol cancel_work, replace it with kcl stub [ 62.081920] AMD-Vi: AMD IOMMUv2 functionality not available on this system - This is not a bug. [ 62.491119] [drm] amdgpu kernel modesetting enabled. [ 62.491122] [drm] amdgpu version: 5.18.2 [ 62.491124] [drm] OS DRM version: 5.15.0 [ 62.491337] amdgpu: CRAT table not found [ 62.491341] amdgpu: Virtual CRAT table created for CPU [ 62.491360] amdgpu: Topology: Add CPU node [ 62.603556] amdgpu: PeerDirect support was initialized successfully [ 62.603847] amdgpu 0000:83:00.0: enabling device (0100 -> 0102) [ 62.603987] [drm] initializing kernel modesetting (VEGA20 0x1002:0x66A1 0x1002:0x0834 0x00). [ 62.604023] [drm] register mmio base: 0xFBD00000 [ 62.604026] [drm] register mmio size: 524288 [ 62.604171] [drm] add ip block number 0 <soc15_common> [ 62.604175] [drm] add ip block number 1 <gmc_v9_0> [ 62.604177] [drm] add ip block number 2 <vega20_ih> [ 62.604180] [drm] add ip block number 3 <psp> [ 62.604182] [drm] add ip block number 4 <powerplay> [ 62.604185] [drm] add ip block number 5 <dm> [ 62.604187] [drm] add ip block number 6 <gfx_v9_0> [ 62.604190] [drm] add ip block number 7 <sdma_v4_0> [ 62.604192] [drm] add ip block number 8 <uvd_v7_0> [ 62.604194] [drm] add ip block number 9 <vce_v4_0> [ 62.641771] amdgpu 0000:83:00.0: amdgpu: Fetched VBIOS from ROM BAR [ 62.641777] amdgpu: ATOM BIOS: 113-D1630200-112 [ 62.713418] [drm] UVD(0) is enabled in VM mode [ 62.713423] [drm] UVD(1) is enabled in VM mode [ 62.713426] [drm] UVD(0) ENC is enabled in VM mode [ 62.713428] [drm] UVD(1) ENC is enabled in VM mode [ 62.713430] [drm] VCE enabled in VM mode [ 62.713433] amdgpu 0000:83:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported [ 62.713472] [drm] GPU posting now... [ 62.713993] amdgpu 0000:83:00.0: amdgpu: MEM ECC is active. [ 62.713995] amdgpu 0000:83:00.0: amdgpu: SRAM ECC is active. [ 62.714006] amdgpu 0000:83:00.0: amdgpu: RAS INFO: ras initialized successfully, hardware ability[7fff] ras_mask[7fff] [ 62.714018] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit [ 62.714026] amdgpu 0000:83:00.0: amdgpu: VRAM: 32752M 0x0000008000000000 - 0x00000087FEFFFFFF (32752M used) [ 62.714029] amdgpu 0000:83:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF [ 62.714032] amdgpu 0000:83:00.0: amdgpu: AGP: 267845632M 0x0000009000000000 - 0x0000FFFFFFFFFFFF [ 62.714043] [drm] Detected VRAM RAM=32752M, BAR=32768M [ 62.714044] [drm] RAM width 4096bits HBM [ 62.714050] debugfs: Directory 'ttm' with parent '/' already present! [ 62.714146] [drm] amdgpu: 32752M of VRAM memory ready [ 62.714149] [drm] amdgpu: 40203M of GTT memory ready. [ 62.714170] [drm] GART: num cpu pages 131072, num gpu pages 131072 [ 62.714266] [drm] PCIE GART of 512M enabled. [ 62.714267] [drm] PTB located at 0x0000008000000000 [ 62.731067] amdgpu 0000:83:00.0: amdgpu: PSP runtime database doesn't exist [ 62.731075] amdgpu 0000:83:00.0: amdgpu: PSP runtime database doesn't exist [ 62.731449] amdgpu: [powerplay] hwmgr_sw_init smu backed is vega20_smu [ 62.743177] [drm] Found UVD firmware ENC: 1.2 DEC: .43 Family ID: 19 [ 62.743244] [drm] PSP loading UVD firmware [ 62.744525] [drm] Found VCE firmware Version: 57.6 Binary ID: 4 [ 62.744689] [drm] PSP loading VCE firmware [ 62.896804] [drm] reserve 0x400000 from 0x87fec00000 for PSP TMR [ 62.979421] amdgpu 0000:83:00.0: amdgpu: HDCP: optional hdcp ta ucode is not available [ 62.979427] amdgpu 0000:83:00.0: amdgpu: DTM: optional dtm ta ucode is not available [ 62.979430] amdgpu 0000:83:00.0: amdgpu: RAP: optional rap ta ucode is not available [ 62.979432] amdgpu 0000:83:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available [ 62.982386] [drm] Display Core initialized with v3.2.196! [ 62.984514] [drm] kiq ring mec 2 pipe 1 q 0 [ 63.026846] [drm] UVD and UVD ENC initialized successfully. [ 63.225760] [drm] VCE initialized successfully. [ 63.244442] amdgpu: [dbg_xgmi_hive_get] ref_count 2 [ 63.244448] CPU: 10 PID: 397 Comm: kworker/10:2 Tainted: G OE 5.15.0-46-generic #49~20.04.1-Ubuntu [ 63.244454] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1 09/14/2018 [ 63.244457] Workqueue: events work_for_cpu_fn [ 63.244471] Call Trace: [ 63.244474] <TASK> [ 63.244479] dump_stack_lvl+0x4a/0x63 [ 63.244493] dump_stack+0x10/0x16 [ 63.244501] amdgpu_get_xgmi_hive+0x217/0x2a0 [amdgpu] [ 63.245047] amdgpu_xgmi_add_device+0xcc/0x450 [amdgpu] [ 63.245463] ? amdgpu_ras_recovery_init+0x13d/0x2e0 [amdgpu] [ 63.245879] ? vce_v4_0_hw_init.cold+0xc/0x13 [amdgpu] [ 63.246466] amdgpu_device_init.cold+0x15bd/0x1fe3 [amdgpu] [ 63.247055] ? pci_bus_read_config_word+0x4a/0x70 [ 63.247064] ? do_pci_enable_device+0xdb/0x110 [ 63.247070] amdgpu_driver_load_kms+0x1a/0x120 [amdgpu] [ 63.247463] amdgpu_pci_probe+0x18d/0x3a0 [amdgpu] [ 63.247868] local_pci_probe+0x4b/0x90 [ 63.247876] work_for_cpu_fn+0x1a/0x30 [ 63.247881] process_one_work+0x22b/0x3d0 [ 63.247887] worker_thread+0x21d/0x3f0 [ 63.247893] ? process_one_work+0x3d0/0x3d0 [ 63.247898] kthread+0x12a/0x150 [ 63.247905] ? set_kthread_struct+0x50/0x50 [ 63.247910] ret_from_fork+0x22/0x30 [ 63.247922] </TASK> [ 63.248563] amdgpu 0000:83:00.0: amdgpu: XGMI: Add node 0, hive 0x25bbae7e3fd04cf4. [ 63.248569] amdgpu: [dbg_xgmi_hive_get] ref_count 3 [ 63.248572] CPU: 10 PID: 397 Comm: kworker/10:2 Tainted: G OE 5.15.0-46-generic #49~20.04.1-Ubuntu [ 63.248578] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1 09/14/2018 [ 63.248580] Workqueue: events work_for_cpu_fn [ 63.248587] Call Trace: [ 63.248588] <TASK> [ 63.248590] dump_stack_lvl+0x4a/0x63 [ 63.248598] dump_stack+0x10/0x16 [ 63.248604] amdgpu_get_xgmi_hive+0x285/0x2a0 [amdgpu] [ 63.249033] amdgpu_device_init.cold+0x15cd/0x1fe3 [amdgpu] [ 63.249621] ? pci_bus_read_config_word+0x4a/0x70 [ 63.249627] ? do_pci_enable_device+0xdb/0x110 [ 63.249632] amdgpu_driver_load_kms+0x1a/0x120 [amdgpu] [ 63.250022] amdgpu_pci_probe+0x18d/0x3a0 [amdgpu] [ 63.250410] local_pci_probe+0x4b/0x90 [ 63.250416] work_for_cpu_fn+0x1a/0x30 [ 63.250421] process_one_work+0x22b/0x3d0 [ 63.250428] worker_thread+0x21d/0x3f0 [ 63.250434] ? process_one_work+0x3d0/0x3d0 [ 63.250440] kthread+0x12a/0x150 [ 63.250445] ? set_kthread_struct+0x50/0x50 [ 63.250450] ret_from_fork+0x22/0x30 [ 63.250458] </TASK> [ 63.268869] kfd kfd: amdgpu: Allocated 3969056 bytes on gart [ 63.269180] amdgpu: sdma_bitmap: ffff [ 63.605188] memmap_init_zone_device initialised 8388608 pages in 132ms [ 63.605203] amdgpu: HMM registered 32752MB device memory [ 63.605244] amdgpu: [powerplay] [MemMclks]: memclk dpm not enabled! [ 63.605263] amdgpu: Virtual CRAT table created for GPU [ 63.605651] amdgpu: [powerplay] [MemMclks]: memclk dpm not enabled! [ 63.605659] amdgpu: Topology: Add dGPU node [0x66a1:0x1002] [ 63.605670] kfd kfd: amdgpu: added device 1002:66a1 [ 63.626300] amdgpu 0000:83:00.0: amdgpu: SE 4, SH per SE 1, CU per SH 16, active_cu_number 64 [ 63.626517] amdgpu 0000:83:00.0: amdgpu: ring gfx uses VM inv eng 0 on hub 0 [ 63.626522] amdgpu 0000:83:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0 [ 63.626525] amdgpu 0000:83:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0 [ 63.626529] amdgpu 0000:83:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0 [ 63.626531] amdgpu 0000:83:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0 [ 63.626534] amdgpu 0000:83:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0 [ 63.626537] amdgpu 0000:83:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0 [ 63.626540] amdgpu 0000:83:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0 [ 63.626543] amdgpu 0000:83:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0 [ 63.626546] amdgpu 0000:83:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0 [ 63.626549] amdgpu 0000:83:00.0: amdgpu: ring sdma0 uses VM inv eng 0 on hub 1 [ 63.626552] amdgpu 0000:83:00.0: amdgpu: ring page0 uses VM inv eng 1 on hub 1 [ 63.626555] amdgpu 0000:83:00.0: amdgpu: ring sdma1 uses VM inv eng 4 on hub 1 [ 63.626558] amdgpu 0000:83:00.0: amdgpu: ring page1 uses VM inv eng 5 on hub 1 [ 63.626561] amdgpu 0000:83:00.0: amdgpu: ring uvd_0 uses VM inv eng 6 on hub 1 [ 63.626563] amdgpu 0000:83:00.0: amdgpu: ring uvd_enc_0.0 uses VM inv eng 7 on hub 1 [ 63.626566] amdgpu 0000:83:00.0: amdgpu: ring uvd_enc_0.1 uses VM inv eng 8 on hub 1 [ 63.626569] amdgpu 0000:83:00.0: amdgpu: ring uvd_1 uses VM inv eng 9 on hub 1 [ 63.626572] amdgpu 0000:83:00.0: amdgpu: ring uvd_enc_1.0 uses VM inv eng 10 on hub 1 [ 63.626575] amdgpu 0000:83:00.0: amdgpu: ring uvd_enc_1.1 uses VM inv eng 11 on hub 1 [ 63.626577] amdgpu 0000:83:00.0: amdgpu: ring vce0 uses VM inv eng 12 on hub 1 [ 63.626580] amdgpu 0000:83:00.0: amdgpu: ring vce1 uses VM inv eng 13 on hub 1 [ 63.626583] amdgpu 0000:83:00.0: amdgpu: ring vce2 uses VM inv eng 14 on hub 1 [ 63.636996] amdgpu: Detected AMDGPU DF Counters. # of Counters = 8. [ 63.637046] amdgpu: Detected AMDGPU 2 Perf Events. [ 63.637428] [drm] Initialized amdgpu 3.48.0 20150101 for 0000:83:00.0 on minor 1 [ 63.637937] amdgpu 0000:86:00.0: enabling device (0100 -> 0102) [ 63.638043] [drm] initializing kernel modesetting (VEGA20 0x1002:0x66A1 0x1002:0x0834 0x00). [ 63.638090] [drm] register mmio base: 0xFBB00000 [ 63.638092] [drm] register mmio size: 524288 [ 63.638261] [drm] add ip block number 0 <soc15_common> [ 63.638263] [drm] add ip block number 1 <gmc_v9_0> [ 63.638265] [drm] add ip block number 2 <vega20_ih> [ 63.638266] [drm] add ip block number 3 <psp> [ 63.638267] [drm] add ip block number 4 <powerplay> [ 63.638269] [drm] add ip block number 5 <dm> [ 63.638271] [drm] add ip block number 6 <gfx_v9_0> [ 63.638272] [drm] add ip block number 7 <sdma_v4_0> [ 63.638273] [drm] add ip block number 8 <uvd_v7_0> [ 63.638275] [drm] add ip block number 9 <vce_v4_0> [ 63.675838] amdgpu 0000:86:00.0: amdgpu: Fetched VBIOS from ROM BAR [ 63.675842] amdgpu: ATOM BIOS: 113-D1630200-112 [ 63.675867] [drm] UVD(0) is enabled in VM mode [ 63.675868] [drm] UVD(1) is enabled in VM mode [ 63.675869] [drm] UVD(0) ENC is enabled in VM mode [ 63.675870] [drm] UVD(1) ENC is enabled in VM mode [ 63.675871] [drm] VCE enabled in VM mode [ 63.675873] amdgpu 0000:86:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported [ 63.675899] [drm] GPU posting now... [ 63.676276] amdgpu 0000:86:00.0: amdgpu: MEM ECC is active. [ 63.676277] amdgpu 0000:86:00.0: amdgpu: SRAM ECC is active. [ 63.676286] amdgpu 0000:86:00.0: amdgpu: RAS INFO: ras initialized successfully, hardware ability[7fff] ras_mask[7fff] [ 63.676297] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit [ 63.676304] amdgpu 0000:86:00.0: amdgpu: VRAM: 32752M 0x0000008800000000 - 0x0000008FFEFFFFFF (32752M used) [ 63.676307] amdgpu 0000:86:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF [ 63.676310] amdgpu 0000:86:00.0: amdgpu: AGP: 267845632M 0x0000009000000000 - 0x0000FFFFFFFFFFFF [ 63.676321] [drm] Detected VRAM RAM=32752M, BAR=32768M [ 63.676322] [drm] RAM width 4096bits HBM [ 63.676363] [drm] amdgpu: 32752M of VRAM memory ready [ 63.676365] [drm] amdgpu: 40203M of GTT memory ready. [ 63.676388] [drm] GART: num cpu pages 131072, num gpu pages 131072 [ 63.676481] [drm] PCIE GART of 512M enabled. [ 63.676482] [drm] PTB located at 0x0000008800000000 [ 63.676730] amdgpu 0000:86:00.0: amdgpu: PSP runtime database doesn't exist [ 63.676733] amdgpu 0000:86:00.0: amdgpu: PSP runtime database doesn't exist [ 63.677088] amdgpu: [powerplay] hwmgr_sw_init smu backed is vega20_smu [ 63.678862] [drm] Found UVD firmware ENC: 1.2 DEC: .43 Family ID: 19 [ 63.678918] [drm] PSP loading UVD firmware [ 63.679487] [drm] Found VCE firmware Version: 57.6 Binary ID: 4 [ 63.679619] [drm] PSP loading VCE firmware [ 63.831730] [drm] reserve 0x400000 from 0x8ffec00000 for PSP TMR [ 63.914508] amdgpu 0000:86:00.0: amdgpu: HDCP: optional hdcp ta ucode is not available [ 63.914513] amdgpu 0000:86:00.0: amdgpu: DTM: optional dtm ta ucode is not available [ 63.914516] amdgpu 0000:86:00.0: amdgpu: RAP: optional rap ta ucode is not available [ 63.914518] amdgpu 0000:86:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available [ 63.917458] [drm] Display Core initialized with v3.2.196! [ 63.919616] [drm] kiq ring mec 2 pipe 1 q 0 [ 63.961950] [drm] UVD and UVD ENC initialized successfully. [ 64.160863] [drm] VCE initialized successfully. [ 64.179285] amdgpu: [dbg_xgmi_hive_get] ref_count 4 [ 64.179291] CPU: 10 PID: 397 Comm: kworker/10:2 Tainted: G OE 5.15.0-46-generic #49~20.04.1-Ubuntu [ 64.179297] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1 09/14/2018 [ 64.179299] Workqueue: events work_for_cpu_fn [ 64.179311] Call Trace: [ 64.179315] <TASK> [ 64.179320] dump_stack_lvl+0x4a/0x63 [ 64.179331] dump_stack+0x10/0x16 [ 64.179340] amdgpu_get_xgmi_hive+0x217/0x2a0 [amdgpu] [ 64.179904] amdgpu_xgmi_add_device+0xcc/0x450 [amdgpu] [ 64.180318] ? amdgpu_ras_recovery_init+0x13d/0x2e0 [amdgpu] [ 64.180733] ? vce_v4_0_hw_init.cold+0xc/0x13 [amdgpu] [ 64.181321] amdgpu_device_init.cold+0x15bd/0x1fe3 [amdgpu] [ 64.181909] ? pci_bus_read_config_word+0x4a/0x70 [ 64.181917] ? do_pci_enable_device+0xdb/0x110 [ 64.181923] amdgpu_driver_load_kms+0x1a/0x120 [amdgpu] [ 64.182315] amdgpu_pci_probe+0x18d/0x3a0 [amdgpu] [ 64.182703] local_pci_probe+0x4b/0x90 [ 64.182710] work_for_cpu_fn+0x1a/0x30 [ 64.182715] process_one_work+0x22b/0x3d0 [ 64.182722] worker_thread+0x21d/0x3f0 [ 64.182728] ? process_one_work+0x3d0/0x3d0 [ 64.182734] kthread+0x12a/0x150 [ 64.182740] ? set_kthread_struct+0x50/0x50 [ 64.182745] ret_from_fork+0x22/0x30 [ 64.182756] </TASK> [ 64.184561] amdgpu 0000:86:00.0: amdgpu: XGMI: Add node 1, hive 0x25bbae7e3fd04cf4. [ 64.184568] amdgpu: [dbg_xgmi_hive_get] ref_count 5 [ 64.184571] CPU: 10 PID: 397 Comm: kworker/10:2 Tainted: G OE 5.15.0-46-generic #49~20.04.1-Ubuntu [ 64.184576] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1 09/14/2018 [ 64.184578] Workqueue: events work_for_cpu_fn [ 64.184585] Call Trace: [ 64.184587] <TASK> [ 64.184589] dump_stack_lvl+0x4a/0x63 [ 64.184596] dump_stack+0x10/0x16 [ 64.184602] amdgpu_get_xgmi_hive+0x285/0x2a0 [amdgpu] [ 64.185041] amdgpu_device_init.cold+0x15cd/0x1fe3 [amdgpu] [ 64.185624] ? pci_bus_read_config_word+0x4a/0x70 [ 64.185631] ? do_pci_enable_device+0xdb/0x110 [ 64.185636] amdgpu_driver_load_kms+0x1a/0x120 [amdgpu] [ 64.186027] amdgpu_pci_probe+0x18d/0x3a0 [amdgpu] [ 64.186416] local_pci_probe+0x4b/0x90 [ 64.186422] work_for_cpu_fn+0x1a/0x30 [ 64.186428] process_one_work+0x22b/0x3d0 [ 64.186434] worker_thread+0x21d/0x3f0 [ 64.186439] ? process_one_work+0x3d0/0x3d0 [ 64.186445] kthread+0x12a/0x150 [ 64.186450] ? set_kthread_struct+0x50/0x50 [ 64.186455] ret_from_fork+0x22/0x30 [ 64.186464] </TASK> [ 64.206119] kfd kfd: amdgpu: Allocated 3969056 bytes on gart [ 64.206433] amdgpu: sdma_bitmap: ffff [ 64.552064] memmap_init_zone_device initialised 8388608 pages in 132ms [ 64.552080] amdgpu: HMM registered 32752MB device memory [ 64.552116] amdgpu: [powerplay] [MemMclks]: memclk dpm not enabled! [ 64.552138] amdgpu: Virtual CRAT table created for GPU [ 64.552978] amdgpu: [powerplay] [MemMclks]: memclk dpm not enabled! [ 64.552988] amdgpu: Topology: Add dGPU node [0x66a1:0x1002] [ 64.552999] kfd kfd: amdgpu: added device 1002:66a1 [ 64.570314] amdgpu 0000:86:00.0: amdgpu: SE 4, SH per SE 1, CU per SH 16, active_cu_number 64 [ 64.570527] amdgpu 0000:86:00.0: amdgpu: ring gfx uses VM inv eng 0 on hub 0 [ 64.570531] amdgpu 0000:86:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0 [ 64.570535] amdgpu 0000:86:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0 [ 64.570538] amdgpu 0000:86:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0 [ 64.570541] amdgpu 0000:86:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0 [ 64.570544] amdgpu 0000:86:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0 [ 64.570547] amdgpu 0000:86:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0 [ 64.570550] amdgpu 0000:86:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0 [ 64.570552] amdgpu 0000:86:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0 [ 64.570556] amdgpu 0000:86:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0 [ 64.570559] amdgpu 0000:86:00.0: amdgpu: ring sdma0 uses VM inv eng 0 on hub 1 [ 64.570562] amdgpu 0000:86:00.0: amdgpu: ring page0 uses VM inv eng 1 on hub 1 [ 64.570565] amdgpu 0000:86:00.0: amdgpu: ring sdma1 uses VM inv eng 4 on hub 1 [ 64.570567] amdgpu 0000:86:00.0: amdgpu: ring page1 uses VM inv eng 5 on hub 1 [ 64.570570] amdgpu 0000:86:00.0: amdgpu: ring uvd_0 uses VM inv eng 6 on hub 1 [ 64.570573] amdgpu 0000:86:00.0: amdgpu: ring uvd_enc_0.0 uses VM inv eng 7 on hub 1 [ 64.570576] amdgpu 0000:86:00.0: amdgpu: ring uvd_enc_0.1 uses VM inv eng 8 on hub 1 [ 64.570579] amdgpu 0000:86:00.0: amdgpu: ring uvd_1 uses VM inv eng 9 on hub 1 [ 64.570581] amdgpu 0000:86:00.0: amdgpu: ring uvd_enc_1.0 uses VM inv eng 10 on hub 1 [ 64.570584] amdgpu 0000:86:00.0: amdgpu: ring uvd_enc_1.1 uses VM inv eng 11 on hub 1 [ 64.570587] amdgpu 0000:86:00.0: amdgpu: ring vce0 uses VM inv eng 12 on hub 1 [ 64.570589] amdgpu 0000:86:00.0: amdgpu: ring vce1 uses VM inv eng 13 on hub 1 [ 64.570592] amdgpu 0000:86:00.0: amdgpu: ring vce2 uses VM inv eng 14 on hub 1 [ 64.581070] amdgpu: [dbg_xgmi_hive_get] ref_count 6 [ 64.581075] CPU: 10 PID: 397 Comm: kworker/10:2 Tainted: G OE 5.15.0-46-generic #49~20.04.1-Ubuntu [ 64.581079] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1 09/14/2018 [ 64.581081] Workqueue: events work_for_cpu_fn [ 64.581089] Call Trace: [ 64.581091] <TASK> [ 64.581094] dump_stack_lvl+0x4a/0x63 [ 64.581103] dump_stack+0x10/0x16 [ 64.581109] amdgpu_get_xgmi_hive+0x285/0x2a0 [amdgpu] [ 64.581489] amdgpu_xgmi_set_pstate+0xe/0x30 [amdgpu] [ 64.581723] amdgpu_device_ip_late_init+0x2dc/0x380 [amdgpu] [ 64.581943] amdgpu_device_init.cold+0x1805/0x1fe3 [amdgpu] [ 64.582288] ? pci_bus_read_config_word+0x4a/0x70 [ 64.582295] ? do_pci_enable_device+0xdb/0x110 [ 64.582298] amdgpu_driver_load_kms+0x1a/0x120 [amdgpu] [ 64.582520] amdgpu_pci_probe+0x18d/0x3a0 [amdgpu] [ 64.582738] local_pci_probe+0x4b/0x90 [ 64.582743] work_for_cpu_fn+0x1a/0x30 [ 64.582746] process_one_work+0x22b/0x3d0 [ 64.582750] worker_thread+0x21d/0x3f0 [ 64.582753] ? process_one_work+0x3d0/0x3d0 [ 64.582756] kthread+0x12a/0x150 [ 64.582761] ? set_kthread_struct+0x50/0x50 [ 64.582764] ret_from_fork+0x22/0x30 [ 64.582772] </TASK> [ 64.582774] amdgpu: [dbg_xgmi_hive_put] ref_count 5 [ 64.582775] CPU: 10 PID: 397 Comm: kworker/10:2 Tainted: G OE 5.15.0-46-generic #49~20.04.1-Ubuntu [ 64.582778] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1 09/14/2018 [ 64.582779] Workqueue: events work_for_cpu_fn [ 64.582782] Call Trace: [ 64.582783] <TASK> [ 64.582784] dump_stack_lvl+0x4a/0x63 [ 64.582789] dump_stack+0x10/0x16 [ 64.582792] amdgpu_put_xgmi_hive.part.0+0x26/0x30 [amdgpu] [ 64.583028] amdgpu_xgmi_set_pstate+0x1b/0x30 [amdgpu] [ 64.583262] amdgpu_device_ip_late_init+0x2dc/0x380 [amdgpu] [ 64.583482] amdgpu_device_init.cold+0x1805/0x1fe3 [amdgpu] [ 64.583833] ? pci_bus_read_config_word+0x4a/0x70 [ 64.583836] ? do_pci_enable_device+0xdb/0x110 [ 64.583840] amdgpu_driver_load_kms+0x1a/0x120 [amdgpu] [ 64.584072] amdgpu_pci_probe+0x18d/0x3a0 [amdgpu] [ 64.584304] local_pci_probe+0x4b/0x90 [ 64.584307] work_for_cpu_fn+0x1a/0x30 [ 64.584311] process_one_work+0x22b/0x3d0 [ 64.584314] worker_thread+0x21d/0x3f0 [ 64.584318] ? process_one_work+0x3d0/0x3d0 [ 64.584321] kthread+0x12a/0x150 [ 64.584324] ? set_kthread_struct+0x50/0x50 [ 64.584327] ret_from_fork+0x22/0x30 [ 64.584333] </TASK> [ 64.584342] amdgpu: [dbg_xgmi_hive_get] ref_count 6 [ 64.584344] CPU: 10 PID: 397 Comm: kworker/10:2 Tainted: G OE 5.15.0-46-generic #49~20.04.1-Ubuntu [ 64.584347] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1 09/14/2018 [ 64.584348] Workqueue: events work_for_cpu_fn [ 64.584352] Call Trace: [ 64.584353] <TASK> [ 64.584354] dump_stack_lvl+0x4a/0x63 [ 64.584358] dump_stack+0x10/0x16 [ 64.584362] amdgpu_get_xgmi_hive+0x285/0x2a0 [amdgpu] [ 64.584610] amdgpu_xgmi_set_pstate+0xe/0x30 [amdgpu] [ 64.584856] amdgpu_device_ip_late_init+0x2dc/0x380 [amdgpu] [ 64.585086] amdgpu_device_init.cold+0x1805/0x1fe3 [amdgpu] [ 64.585437] ? pci_bus_read_config_word+0x4a/0x70 [ 64.585440] ? do_pci_enable_device+0xdb/0x110 [ 64.585443] amdgpu_driver_load_kms+0x1a/0x120 [amdgpu] [ 64.585679] amdgpu_pci_probe+0x18d/0x3a0 [amdgpu] [ 64.585922] local_pci_probe+0x4b/0x90 [ 64.585926] work_for_cpu_fn+0x1a/0x30 [ 64.585929] process_one_work+0x22b/0x3d0 [ 64.585932] worker_thread+0x21d/0x3f0 [ 64.585936] ? process_one_work+0x3d0/0x3d0 [ 64.585939] kthread+0x12a/0x150 [ 64.585942] ? set_kthread_struct+0x50/0x50 [ 64.585945] ret_from_fork+0x22/0x30 [ 64.585950] </TASK> [ 64.585951] amdgpu: [dbg_xgmi_hive_put] ref_count 5 [ 64.585953] CPU: 10 PID: 397 Comm: kworker/10:2 Tainted: G OE 5.15.0-46-generic #49~20.04.1-Ubuntu [ 64.585956] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1 09/14/2018 [ 64.585957] Workqueue: events work_for_cpu_fn [ 64.585960] Call Trace: [ 64.585961] <TASK> [ 64.585963] dump_stack_lvl+0x4a/0x63 [ 64.585967] dump_stack+0x10/0x16 [ 64.585970] amdgpu_put_xgmi_hive.part.0+0x26/0x30 [amdgpu] [ 64.586213] amdgpu_xgmi_set_pstate+0x1b/0x30 [amdgpu] [ 64.586458] amdgpu_device_ip_late_init+0x2dc/0x380 [amdgpu] [ 64.586688] amdgpu_device_init.cold+0x1805/0x1fe3 [amdgpu] [ 64.587037] ? pci_bus_read_config_word+0x4a/0x70 [ 64.587040] ? do_pci_enable_device+0xdb/0x110 [ 64.587043] amdgpu_driver_load_kms+0x1a/0x120 [amdgpu] [ 64.587277] amdgpu_pci_probe+0x18d/0x3a0 [amdgpu] [ 64.587509] local_pci_probe+0x4b/0x90 [ 64.587512] work_for_cpu_fn+0x1a/0x30 [ 64.587515] process_one_work+0x22b/0x3d0 [ 64.587519] worker_thread+0x21d/0x3f0 [ 64.587523] ? process_one_work+0x3d0/0x3d0 [ 64.587526] kthread+0x12a/0x150 [ 64.587529] ? set_kthread_struct+0x50/0x50 [ 64.587532] ret_from_fork+0x22/0x30 [ 64.587537] </TASK> [ 64.587619] amdgpu: Detected AMDGPU DF Counters. # of Counters = 8. [ 64.587663] amdgpu: Detected AMDGPU 2 Perf Events. [ 64.588081] [drm] Initialized amdgpu 3.48.0 20150101 for 0000:86:00.0 on minor 2 Then driver unload (reference stuck at 2): [ 110.117018] amdgpu 0000:86:00.0: amdgpu: amdgpu: finishing device. [ 110.131638] [drm] free PSP TMR buffer [ 110.420529] amdgpu: [dbg_xgmi_hive_put] ref_count 4 [ 110.420537] CPU: 27 PID: 1748 Comm: modprobe Tainted: G OE 5.15.0-46-generic #49~20.04.1-Ubuntu [ 110.420545] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1 09/14/2018 [ 110.420548] Call Trace: [ 110.420551] <TASK> [ 110.420556] dump_stack_lvl+0x4a/0x63 [ 110.420569] dump_stack+0x10/0x16 [ 110.420578] amdgpu_put_xgmi_hive.part.0+0x26/0x30 [amdgpu] [ 110.421001] amdgpu_xgmi_remove_device+0x11d/0x1c0 [amdgpu] [ 110.421380] amdgpu_device_fini_sw+0x63/0x4c0 [amdgpu] [ 110.421724] amdgpu_driver_release_kms+0x16/0x30 [amdgpu] [ 110.422070] drm_dev_release+0x28/0x50 [drm] [ 110.422145] devm_drm_dev_init_release+0x38/0x60 [drm] [ 110.422190] devm_action_release+0x15/0x20 [ 110.422198] release_nodes+0x40/0xb0 [ 110.422205] devres_release_all+0x9e/0xe0 [ 110.422212] device_release_driver_internal+0x117/0x1f0 [ 110.422218] driver_detach+0x4c/0xa0 [ 110.422222] bus_remove_driver+0x6c/0xf0 [ 110.422227] driver_unregister+0x31/0x50 [ 110.422231] pci_unregister_driver+0x40/0x90 [ 110.422238] amdgpu_exit+0x15/0x446 [amdgpu] [ 110.422791] __x64_sys_delete_module+0x14e/0x260 [ 110.422801] ? do_syscall_64+0x69/0xc0 [ 110.422809] ? __x64_sys_read+0x1a/0x20 [ 110.422817] ? do_syscall_64+0x69/0xc0 [ 110.422821] ? ksys_read+0x67/0xf0 [ 110.422825] do_syscall_64+0x5c/0xc0 [ 110.422830] ? __x64_sys_read+0x1a/0x20 [ 110.422834] ? do_syscall_64+0x69/0xc0 [ 110.422839] ? syscall_exit_to_user_mode+0x27/0x50 [ 110.422846] ? __x64_sys_openat+0x20/0x30 [ 110.422853] ? do_syscall_64+0x69/0xc0 [ 110.422857] ? do_syscall_64+0x69/0xc0 [ 110.422862] ? irqentry_exit+0x1d/0x30 [ 110.422868] ? exc_page_fault+0x89/0x170 [ 110.422874] entry_SYSCALL_64_after_hwframe+0x61/0xcb [ 110.422885] RIP: 0033:0x7f1576682a6b [ 110.422892] Code: 73 01 c3 48 8b 0d 25 c4 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d f5 c3 0c 00 f7 d8 64 89 01 48 [ 110.422897] RSP: 002b:00007ffcb96e0bf8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 [ 110.422904] RAX: ffffffffffffffda RBX: 000056347ba57550 RCX: 00007f1576682a6b [ 110.422908] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000056347ba575b8 [ 110.422911] RBP: 000056347ba57550 R08: 0000000000000000 R09: 0000000000000000 [ 110.422913] R10: 00007f15766feac0 R11: 0000000000000206 R12: 000056347ba575b8 [ 110.422916] R13: 0000000000000000 R14: 000056347ba575b8 R15: 000056347ba57550 [ 110.422921] </TASK> [ 110.425941] [drm] amdgpu: ttm finalized [ 110.489186] amdgpu 0000:83:00.0: amdgpu: amdgpu: finishing device. [ 110.504025] [drm] free PSP TMR buffer [ 110.762272] amdgpu: [dbg_xgmi_hive_put] ref_count 3 [ 110.762280] CPU: 27 PID: 1748 Comm: modprobe Tainted: G OE 5.15.0-46-generic #49~20.04.1-Ubuntu [ 110.762288] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1 09/14/2018 [ 110.762290] Call Trace: [ 110.762294] <TASK> [ 110.762298] dump_stack_lvl+0x4a/0x63 [ 110.762313] dump_stack+0x10/0x16 [ 110.762319] amdgpu_put_xgmi_hive.part.0+0x26/0x30 [amdgpu] [ 110.762663] amdgpu_xgmi_remove_device+0x11d/0x1c0 [amdgpu] [ 110.762965] amdgpu_device_fini_sw+0x63/0x4c0 [amdgpu] [ 110.763231] amdgpu_driver_release_kms+0x16/0x30 [amdgpu] [ 110.763519] drm_dev_release+0x28/0x50 [drm] [ 110.763569] devm_drm_dev_init_release+0x38/0x60 [drm] [ 110.763609] devm_action_release+0x15/0x20 [ 110.763617] release_nodes+0x40/0xb0 [ 110.763624] devres_release_all+0x9e/0xe0 [ 110.763631] device_release_driver_internal+0x117/0x1f0 [ 110.763636] driver_detach+0x4c/0xa0 [ 110.763640] bus_remove_driver+0x6c/0xf0 [ 110.763646] driver_unregister+0x31/0x50 [ 110.763650] pci_unregister_driver+0x40/0x90 [ 110.763657] amdgpu_exit+0x15/0x446 [amdgpu] [ 110.764153] __x64_sys_delete_module+0x14e/0x260 [ 110.764164] ? do_syscall_64+0x69/0xc0 [ 110.764172] ? __x64_sys_read+0x1a/0x20 [ 110.764180] ? do_syscall_64+0x69/0xc0 [ 110.764184] ? ksys_read+0x67/0xf0 [ 110.764189] do_syscall_64+0x5c/0xc0 [ 110.764193] ? __x64_sys_read+0x1a/0x20 [ 110.764197] ? do_syscall_64+0x69/0xc0 [ 110.764202] ? syscall_exit_to_user_mode+0x27/0x50 [ 110.764209] ? __x64_sys_openat+0x20/0x30 [ 110.764217] ? do_syscall_64+0x69/0xc0 [ 110.764221] ? do_syscall_64+0x69/0xc0 [ 110.764226] ? irqentry_exit+0x1d/0x30 [ 110.764232] ? exc_page_fault+0x89/0x170 [ 110.764238] entry_SYSCALL_64_after_hwframe+0x61/0xcb [ 110.764248] RIP: 0033:0x7f1576682a6b [ 110.764255] Code: 73 01 c3 48 8b 0d 25 c4 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d f5 c3 0c 00 f7 d8 64 89 01 48 [ 110.764260] RSP: 002b:00007ffcb96e0bf8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 [ 110.764267] RAX: ffffffffffffffda RBX: 000056347ba57550 RCX: 00007f1576682a6b [ 110.764270] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000056347ba575b8 [ 110.764273] RBP: 000056347ba57550 R08: 0000000000000000 R09: 0000000000000000 [ 110.764275] R10: 00007f15766feac0 R11: 0000000000000206 R12: 000056347ba575b8 [ 110.764278] R13: 0000000000000000 R14: 000056347ba575b8 R15: 000056347ba57550 [ 110.764283] </TASK> [ 110.764326] amdgpu: [dbg_xgmi_hive_put] ref_count 2 [ 110.764329] CPU: 27 PID: 1748 Comm: modprobe Tainted: G OE 5.15.0-46-generic #49~20.04.1-Ubuntu [ 110.764334] Hardware name: Supermicro X10DRi/X10DRi-T, BIOS 3.1 09/14/2018 [ 110.764336] Call Trace: [ 110.764337] <TASK> [ 110.764339] dump_stack_lvl+0x4a/0x63 [ 110.764347] dump_stack+0x10/0x16 [ 110.764354] amdgpu_put_xgmi_hive.part.0+0x26/0x30 [amdgpu] [ 110.764624] amdgpu_xgmi_remove_device+0x1ad/0x1c0 [amdgpu] [ 110.764791] amdgpu_device_fini_sw+0x63/0x4c0 [amdgpu] [ 110.764937] amdgpu_driver_release_kms+0x16/0x30 [amdgpu] [ 110.765085] drm_dev_release+0x28/0x50 [drm] [ 110.765108] devm_drm_dev_init_release+0x38/0x60 [drm] [ 110.765130] devm_action_release+0x15/0x20 [ 110.765134] release_nodes+0x40/0xb0 [ 110.765137] devres_release_all+0x9e/0xe0 [ 110.765141] device_release_driver_internal+0x117/0x1f0 [ 110.765144] driver_detach+0x4c/0xa0 [ 110.765146] bus_remove_driver+0x6c/0xf0 [ 110.765148] driver_unregister+0x31/0x50 [ 110.765150] pci_unregister_driver+0x40/0x90 [ 110.765154] amdgpu_exit+0x15/0x446 [amdgpu] [ 110.765434] __x64_sys_delete_module+0x14e/0x260 [ 110.765438] ? do_syscall_64+0x69/0xc0 [ 110.765441] ? __x64_sys_read+0x1a/0x20 [ 110.765444] ? do_syscall_64+0x69/0xc0 [ 110.765446] ? ksys_read+0x67/0xf0 [ 110.765449] do_syscall_64+0x5c/0xc0 [ 110.765451] ? __x64_sys_read+0x1a/0x20 [ 110.765454] ? do_syscall_64+0x69/0xc0 [ 110.765456] ? syscall_exit_to_user_mode+0x27/0x50 [ 110.765460] ? __x64_sys_openat+0x20/0x30 [ 110.765464] ? do_syscall_64+0x69/0xc0 [ 110.765466] ? do_syscall_64+0x69/0xc0 [ 110.765469] ? irqentry_exit+0x1d/0x30 [ 110.765472] ? exc_page_fault+0x89/0x170 [ 110.765476] entry_SYSCALL_64_after_hwframe+0x61/0xcb [ 110.765480] RIP: 0033:0x7f1576682a6b [ 110.765482] Code: 73 01 c3 48 8b 0d 25 c4 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 b0 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d f5 c3 0c 00 f7 d8 64 89 01 48 [ 110.765485] RSP: 002b:00007ffcb96e0bf8 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 [ 110.765488] RAX: ffffffffffffffda RBX: 000056347ba57550 RCX: 00007f1576682a6b [ 110.765489] RDX: 0000000000000000 RSI: 0000000000000800 RDI: 000056347ba575b8 [ 110.765491] RBP: 000056347ba57550 R08: 0000000000000000 R09: 0000000000000000 [ 110.765492] R10: 00007f15766feac0 R11: 0000000000000206 R12: 000056347ba575b8 [ 110.765494] R13: 0000000000000000 R14: 000056347ba575b8 R15: 000056347ba57550 [ 110.765496] </TASK> [ 110.768091] [drm] amdgpu: ttm finalized > -----Original Message----- > From: Grodzovsky, Andrey <Andrey.Grodzovsky@xxxxxxx> > Sent: August 11, 2022 12:43 PM > To: Kim, Jonathan <Jonathan.Kim@xxxxxxx>; Kuehling, Felix > <Felix.Kuehling@xxxxxxx>; amd-gfx@xxxxxxxxxxxxxxxxxxxxx > Subject: Re: [PATCH] drm/amdgpu: fix reset domain xgmi hive info reference > leak > > > On 2022-08-11 11:34, Kim, Jonathan wrote: > > [Public] > > > >> -----Original Message----- > >> From: Kuehling, Felix <Felix.Kuehling@xxxxxxx> > >> Sent: August 11, 2022 11:19 AM > >> To: amd-gfx@xxxxxxxxxxxxxxxxxxxxx; Kim, Jonathan > <Jonathan.Kim@xxxxxxx> > >> Subject: Re: [PATCH] drm/amdgpu: fix reset domain xgmi hive info reference > >> leak > >> > >> Am 2022-08-11 um 09:42 schrieb Jonathan Kim: > >>> When an xgmi node is added to the hive, it takes another hive > >>> reference for its reset domain. > >>> > >>> This extra reference was not dropped on device removal from the > >>> hive so drop it. > >>> > >>> Signed-off-by: Jonathan Kim <jonathan.kim@xxxxxxx> > >>> --- > >>> drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c | 3 +++ > >>> 1 file changed, 3 insertions(+) > >>> > >>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c > >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c > >>> index 1b108d03e785..560bf1c98f08 100644 > >>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c > >>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c > >>> @@ -731,6 +731,9 @@ int amdgpu_xgmi_remove_device(struct > >> amdgpu_device *adev) > >>> mutex_unlock(&hive->hive_lock); > >>> > >>> amdgpu_put_xgmi_hive(hive); > >>> + /* device is removed from the hive so remove its reset domain > >> reference */ > >>> + if (adev->reset_domain && adev->reset_domain == hive- > >>> reset_domain) > >>> + amdgpu_put_xgmi_hive(hive); > >> This is some messed up reference counting. If you need an extra > >> reference from the reset_domain to the hive, that should be owned by the > >> reset_domain and dropped when the reset_domain is destroyed. And it's > >> only one reference for the reset_domain, not one reference per adev in > >> the reset_domain. > > Cc'ing Andrey. > > > > What you're saying seems to make more sense to me, but what I got from an > offline conversation with Andrey > > was that the reset domain reference per device was intentional. > > Maybe Andrey can comment here. > > > >> What you're doing here looks like every adev that's in a reset_domain of > >> its hive has two references to the hive. And if you're dropping the > >> extra reference here, it still leaves the reset_domain with a dangling > >> pointer to a hive that may no longer exist. So this extra reference is > >> kind of pointless. > > > reset_domain doesn't have any references to the hive, the hive has a > reference to reset_domain > > > > Yes. Currently one reference is fetched from the device's lifetime on the hive > and the other is from the > > per-device reset domain. > > > > Snippet from amdgpu_device_ip_init: > > /** > > * In case of XGMI grab extra reference for reset domain for this device > > */ > > if (adev->gmc.xgmi.num_physical_nodes > 1) { > > if (amdgpu_xgmi_add_device(adev) == 0) { <- [JK] reference is > fetched here > > > amdgpu_xgmi_add_device calls amdgpu_get_xgmi_hive and only on the first > time amdgpu_get_xgmi_hive is called and hive is actually allocated and > initialized will we proceed > to creating the reset domain either from scratch (first creation of the > hive) or by taking reference from adev (see [1]) > > > > [1] - > https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdgpu/a > mdgpu_xgmi.c#L394 > > > struct amdgpu_hive_info *hive = amdgpu_get_xgmi_hive(adev); > <- [JK] then here again > > > So here I don't see how an extra reference to reset_domain is taken if > amdgpu_get_xgmi_hive returns early since the hive already created and > exists in the global hive container ? > > Johantan - can u please show the exact flow how recount leak on > reset_domain is happening ? > > Andrey > > > > > > if (!hive->reset_domain || > > !amdgpu_reset_get_reset_domain(hive->reset_domain)) { > > r = -ENOENT; > > goto init_failed; > > } > > > > /* Drop the early temporary reset domain we created for device > */ > > amdgpu_reset_put_reset_domain(adev->reset_domain); > > adev->reset_domain = hive->reset_domain; > > } > > } > > > > One of these never gets dropped so a leak happens. > > So either the extra reference has to be dropped on device removal from the > hive or from what you've mentioned, > > the reset_domain reference fetch should be fixed to grab at the > hive/reset_domain level. > > > > Thanks, > > > > Jon > > > >> Regards, > >> Felix > >> > >> > >>> adev->hive = NULL; > >>> > >>> if (atomic_dec_return(&hive->number_devices) == 0) {