[PATCH 22/22] drm/amdgpu: fix reservation obj shared count bug

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I know in theoretically this issue should be equal for sriov or not
But since the gpu performance under SRIOV is only one quarter compared with bare-metal (assume 4VF)
And the CPU's performance is also limited, so there are chance that the fence is signaled not as quick as bare-metal
And lead to the scenario that four different context fence are living in resv,

If you do it on bare-metal maybe you need some tricks, at least on my sriov platform I can duplicate this BUG()
With 4 repeating loop on VK_EXAMPLE every time 

/Monk

-----Original Message-----
From: Koenig, Christian 
Sent: 2018å¹´2æ??27æ?¥ 16:34
To: Liu, Monk <Monk.Liu at amd.com>; amd-gfx at lists.freedesktop.org; Chris Wilson <chris at chris-wilson.co.uk>
Subject: Re: [PATCH 22/22] drm/amdgpu: fix reservation obj shared count bug

Am 27.02.2018 um 04:36 schrieb Liu, Monk:
>> Well then there is something else broken and I actually think that this is the root cause here.
> What if this VM submits four different jobs on four different rings, 
> e.g. gfx/compute1/compute2/vce So the fences for those jobs are from 
> different context, and now you get a mapping on this VM So the shared 
> count is now 4

Yeah, but each VM submission reserves a shared slot before adding the fence. So even if you have 4 submissions we still have a guaranteed free slot.

> I ran VK_example and it always can hit this reservation BUG(), just 
> repeat running it 4 loops and The BUG() hit, but I did it under SRIOV 
> case so the CPU&GPU performance is different with bare-metal
>
> Anyway do you think above scenario exist ?

Well you somehow run into this problem, I just don't understand how.

It shouldn't matter if you run on bare-metal or SRIOV, the VM handling is the same as far as I know.

Going to give it a try today to create some unit tests exercising the problem.

Christian.

>
>
> -----Original Message-----
> From: Koenig, Christian
> Sent: 2018å¹´2æ??26æ?¥ 19:19
> To: Liu, Monk <Monk.Liu at amd.com>; amd-gfx at lists.freedesktop.org; Chris 
> Wilson <chris at chris-wilson.co.uk>
> Subject: Re: [PATCH 22/22] drm/amdgpu: fix reservation obj shared 
> count bug
>
>> What if all those context are not equal ? and accumulated finally to 4 ?
> Well then there is something else broken and I actually think that this is the root cause here.
>
>> You can try VK_EXAMPLE test yourself, see if you can hit that BUG() 
>> in reservation.c file,
> Could you generate a simpler unit test to trigger the problem?
>
>
> Thanks,
> Christian.
>
> Am 26.02.2018 um 11:57 schrieb Liu, Monk:
>>> So we call reserve_shared only once and then call reservation_object_add_shared_fence() multiple times, always with the same context.
>>> should replace the fence when it has the same context as a previously added fence.
>> What if all those context are not equal ? and accumulated finally to 4 ?
>> You can try VK_EXAMPLE test yourself, see if you can hit that BUG() 
>> in reservation.c file, I hit it and it is fixed by this patch, I don't think there is other reason.
>>
>> -----Original Message-----
>> From: amd-gfx [mailto:amd-gfx-bounces at lists.freedesktop.org] On 
>> Behalf Of Christian K?nig
>> Sent: 2018å¹´2æ??26æ?¥ 18:45
>> To: Liu, Monk <Monk.Liu at amd.com>; amd-gfx at lists.freedesktop.org; 
>> Chris Wilson <chris at chris-wilson.co.uk>
>> Subject: Re: [PATCH 22/22] drm/amdgpu: fix reservation obj shared 
>> count bug
>>
>>> We must call reserve_shared before amdgpu_bo_fence....
>> Actually that's not correct. See 
>> reservation_object_add_shared_fence()
>> should replace the fence when it has the same context as a previously added fence.
>>
>> So we call reserve_shared only once and then call
>> reservation_object_add_shared_fence() multiple times, always with the same context.
>>
>> Writing this, Chris actually had patches which may have broken that assumption, but as far as I know those aren't merged yet.
>>
>> Regards,
>> Christian.
>>
>> Am 26.02.2018 um 11:36 schrieb Liu, Monk:
>>> amdgpu_vm_get_pd_bo() should already reserves a shared slot for the fence:
>>>>            entry->tv.shared = true;
>>> That's not related with the bug I fixed! The root cause is there is 
>>> some sequence that When code path comes to amdgpu_bo_fence of 
>>> vm_update_directories, no one The shared_cnt is by coincidence 
>>> already the max value of resv obj, so the BUG() in Reservation.c 
>>> file will hit,
>>>
>>> We must call reserve_shared before amdgpu_bo_fence....
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Christian König [mailto:ckoenig.leichtzumerken at gmail.com]
>>> Sent: 2018å¹´2æ??26æ?¥ 17:48
>>> To: Liu, Monk <Monk.Liu at amd.com>; amd-gfx at lists.freedesktop.org
>>> Subject: Re: [PATCH 22/22] drm/amdgpu: fix reservation obj shared 
>>> count bug
>>>
>>> That fix is incorrect.
>>>
>>> amdgpu_vm_get_pd_bo() should already reserves a shared slot for the fence:
>>>>            entry->tv.shared = true;
>>> So you must run into this issue because of something else.
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 26.02.2018 um 06:35 schrieb Monk Liu:
>>>> should call reservation_object_reserve_shared before 
>>>> amdgpu_bo_fence(), otherwise there are odds kernel hit BUG in 
>>>> reversation.c
>>>>
>>>> bug:
>>>> [12622.076435] ------------[ cut here ]------------ [12622.076438] 
>>>> kernel BUG at drivers/dma-buf/reservation.c:233!
>>>> [12622.078046] invalid opcode: 0000 [#1] SMP KASAN [12622.078345] 
>>>> Modules linked in: amdgpu(E) chash(E) gpu_sched(E) ttm(E) drm_kms_helper(E) drm(E) i2c_algo_bit(E) fb_sys_fops(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) amdkfd(E) amd_iommu_v2(E) snd_hda_codec_generic(E) snd_hda_intel(E) snd_hda_codec(E) snd_hda_core(E) snd_hwdep(E) snd_pcm(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) pcbc(E) snd_seq_midi(E) snd_seq_midi_event(E) snd_rawmidi(E) snd_seq(E) aesni_intel(E) snd_seq_device(E) aes_x86_64(E) crypto_simd(E) glue_helper(E) cryptd(E) snd_timer(E) serio_raw(E) snd(E) soundcore(E) i2c_piix4(E) mac_hid(E) binfmt_misc(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) sunrpc(E) parport_pc(E) ppdev(E) lp(E) parport(E) autofs4(E) 8139too(E) psmouse(E) floppy(E) 8139cp(E) mii(E) pata_acpi(E)
>>>> [12622.082664] CPU: 2 PID: 6075 Comm: ShaderImageLoad Tainted: G            E   4.13.0-feb15-2018-guest-12 #8
>>>> [12622.083278] Hardware name: QEMU Standard PC (i440FX + PIIX, 
>>>> 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 [12622.083880] task:
>>>> ffff8801314f8000 task.stack: ffff8801e6df0000 [12622.084267] RIP:
>>>> 0010:reservation_object_add_shared_fence+0x2bc/0x2c0
>>>> [12622.084680] RSP: 0018:ffff8801e6df7ae8 EFLAGS: 00010246 
>>>> [12622.085016] RAX: 0000000000000004 RBX: ffff8801f2542a80 RCX:
>>>> ffff8801f2542b58 [12622.085471] RDX: ffff880044791428 RSI:
>>>> ffff8801e2657d20 RDI: ffff880044791428 [12622.085995] RBP:
>>>> ffff8801e6df7b28 R08: 0000000000000000 R09: 0000000000000000 
>>>> [12622.086451] R10: 000000009fe2fe07 R11: 0000000000000002 R12:
>>>> ffff880044791200 [12622.086905] R13: ffff8801e1da6850 R14:
>>>> 0000000000000000 R15: ffff8801e9303000 [12622.087359] FS:
>>>> 00007f70109ff700(0000) GS:ffff8801f7500000(0000)
>>>> knlGS:0000000000000000 [12622.087874] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [12622.088242] CR2: 00007f6fe1a49000 CR3: 00000001e8377000 CR4: 00000000001406e0 [12622.088698] Call Trace:
>>>> [12622.088934]  amdgpu_bo_fence+0x25/0x30 [amdgpu] [12622.089271]
>>>> amdgpu_vm_update_directories+0x3d7/0x450 [amdgpu] [12622.089721]  ?
>>>> amdgpu_vm_free_mapping.isra.20+0x30/0x30 [amdgpu] [12622.090184]
>>>> amdgpu_gem_va_ioctl+0x409/0x4f0 [amdgpu] [12622.090569]  ?
>>>> amdgpu_gem_metadata_ioctl+0x1b0/0x1b0 [amdgpu] [12622.090959]
>>>> drm_ioctl_kernel+0x6a/0xb0 [drm] [12622.091247]  ?
>>>> kasan_check_write+0x14/0x20 [12622.091517]  drm_ioctl+0x2ea/0x3b0 
>>>> [drm] [12622.091805]  ? amdgpu_gem_metadata_ioctl+0x1b0/0x1b0
>>>> [amdgpu] [12622.092212]  amdgpu_drm_ioctl+0x4b/0x80 [amdgpu] 
>>>> [12622.092519]
>>>> do_vfs_ioctl+0x96/0x5b0 [12622.092754]  ?
>>>> kvm_sched_clock_read+0x1e/0x30 [12622.093036]  ?
>>>> kvm_sched_clock_read+0x1e/0x30 [12622.093322]  ?
>>>> sched_clock+0x9/0x10 [12622.093550]  SyS_ioctl+0x79/0x90 [12622.093812]  ?
>>>> vtime_user_exit+0x29/0x70 [12622.094108]  do_syscall_64+0x6e/0x160 
>>>> [12622.094350]  entry_SYSCALL64_slow_path+0x25/0x25
>>>> [12622.094651] RIP: 0033:0x7f7012ebff07 [12622.094884] RSP:
>>>> 002b:00007f70109fdae8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010 
>>>> [12622.095369] RAX: ffffffffffffffda RBX: 00007f6fe100fe30 RCX:
>>>> 00007f7012ebff07 [12622.095827] RDX: 00007f70109fdbb0 RSI:
>>>> 00000000c0286448 RDI: 0000000000000004 [12622.096283] RBP:
>>>> 00007f70109fdb20 R08: 0000000332000000 R09: 000000000000000e 
>>>> [12622.096739] R10: 0000000002e2e628 R11: 0000000000000206 R12:
>>>> 0000000002ebb8f0 [12622.097195] R13: 00007f6fe100fe30 R14:
>>>> 0000000002fe2bf0 R15: 0000000000000000 [12622.097676] Code: 2b e6 
>>>> ff ff 41 8b 4f 10 48 8b 75 c8 48 8b 55 d0 e9 06 ff ff ff 49 8d 44 
>>>> cf 10
>>>> 45 31 ed 48 89 70 08 41 83 47 10 01 e9 12 ff ff ff <0f> 0b 90 90 0f 
>>>> 1f
>>>> 44 00 00 55 31 c0 48 81 7f 08 00 df a9 ac 48 [12622.098919] RIP:
>>>> reservation_object_add_shared_fence+0x2bc/0x2c0 RSP:
>>>> ffff8801e6df7ae8 [12622.099527] ---[ end trace 8235addd8e22e444 
>>>> ]---
>>>>
>>>> Change-Id: I1de84f60c75a0d78b0cad244b8e2bff02aced9fa
>>>> Signed-off-by: Monk Liu <Monk.Liu at amd.com>
>>>> ---
>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 5 +++++
>>>>      1 file changed, 5 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>> index 0b237e0..e6f159f 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>>> @@ -989,6 +989,11 @@ int amdgpu_vm_update_directories(struct amdgpu_device *adev,
>>>>      		amdgpu_ring_pad_ib(ring, params.ib);
>>>>      		amdgpu_sync_resv(adev, &job->sync, root->tbo.resv,
>>>>      				 AMDGPU_FENCE_OWNER_VM, false);
>>>> +
>>>> +		r = reservation_object_reserve_shared(root->tbo.resv);
>>>> +		if (r)
>>>> +			goto error;
>>>> +
>>>>      		WARN_ON(params.ib->length_dw > ndw);
>>>>      		r = amdgpu_job_submit(job, ring, &vm->entity,
>>>>      				      AMDGPU_FENCE_OWNER_VM, &fence);
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx at lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx



[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux