Re: [PATCH] drm/amdgpu: add the checking to avoid NULL pointer dereference

"Sharma, Deepak" <Deepak.Sharma@xxxxxxx> · Sat, 1 Dec 2018 03:42:02 +0000

Did find way to reproduce issue constantly. After applying David's patch "0001-drm-amdgpu-fix-signaled-fence-isn-t-handled" with minor change

-static struct dma_fence *drm_syncobj_get_stub_fence(void)
+struct dma_fence *drm_syncobj_get_stub_fence(void)

was able to avoid kernel panic due to NULL pointer dereference. So seems it is resolving the problem which I was trying to address. 
Can you please put revision of this patch?

Though only one time I got reboot with dmesg..

[  213.714324] RIP: 0010:reservation_object_add_shared_fence+0x22e/0x245
[  213.714331] RSP: 0018:ffffa03900b5ba80 EFLAGS: 00010246
[  213.714338] RAX: 0000000000000004 RBX: ffffa03900b5bba8 RCX: ffff969421139d00
[  213.714343] RDX: 0000000000000000 RSI: ffff9693639662e0 RDI: ffff9694203db230
[  213.714349] RBP: ffffa03900b5bad0 R08: 000000000000002a R09: ffff969363966280
[  213.714354] R10: 0000000180800079 R11: ffffffffa3637f66 R12: ffff96941ea5cf00
[  213.714360] R13: 0000000000000000 R14: ffff9693639662e0 R15: ffff9694203db230
[  213.714366] FS:  0000786009e5f700(0000) GS:ffff96942ec00000(0000) knlGS:0000000000000000
[  213.714372] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  213.714377] CR2: 00007860023a4000 CR3: 000000011ea8a000 CR4: 00000000001406f0
[  213.714382] Call Trace:
[  213.714397]  ttm_eu_fence_buffer_objects+0x53/0x89
[  213.714407]  amdgpu_cs_ioctl+0x194f/0x196c
[  213.714420]  ? amdgpu_cs_report_moved_bytes+0x5f/0x5f
[  213.714427]  drm_ioctl_kernel+0x98/0xac
[  213.714435]  drm_ioctl+0x235/0x357
[  213.714443]  ? amdgpu_cs_report_moved_bytes+0x5f/0x5f
[  213.714450]  ? __switch_to_asm+0x34/0x70
[  213.714456]  ? __switch_to_asm+0x40/0x70
[  213.714462]  ? __switch_to_asm+0x34/0x70
[  213.714468]  ? __switch_to_asm+0x40/0x70
[  213.714476]  amdgpu_drm_ioctl+0x46/0x78
[  213.714500]  vfs_ioctl+0x1b/0x30
[  213.714507]  do_vfs_ioctl+0x492/0x6c1
[  213.714530]  SyS_ioctl+0x52/0x77
[  213.714537]  do_syscall_64+0x6d/0x81
[  213.714544]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[  213.714551] RIP: 0033:0x78600c9c6967
[  213.714556] RSP: 002b:0000786009e5ec08 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  213.714563] RAX: ffffffffffffffda RBX: 00000000c0186444 RCX: 000078600c9c6967
[  213.714568] RDX: 0000786009e5ec60 RSI: 00000000c0186444 RDI: 000000000000000d
[  213.714573] RBP: 0000786009e5ec40 R08: 0000786009e5ed00 R09: 0000786009e5ec50
[  213.714578] R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000d
[  213.714583] R13: 00003ba2573ff128 R14: 0000000000000000 R15: 0000786009e5ec60

Its worth to mention that I am not using latest kernel .relevant git log 

71a0556255de125b7e3fc0dc6171fb31fab2b9ad drm/syncobj: Don't leak fences when WAIT_FOR_SUBMIT is set

4034318d2afac8f7f43457a4b480b877a94ec651  drm/syncobj:
 Stop reusing the same struct file for all syncobj -> fd

So have not taken any recent changes  from https://patchwork.kernel.org/patch/10575275/ series. only
backported  attached two patches . I need to check if something is missing that can cause issue.

Thanks,
Deepak

From: Sharma, Deepak

Sent: Tuesday, November 27, 2018 3:12 PM

To: Zhou, David(ChunMing); Koenig, Christian; amd-gfx@xxxxxxxxxxxxxxxxxxxxx

Subject: Re: [PATCH] drm/amdgpu: add the checking to avoid NULL pointer dereference

Thanks David , 
I did backport the drm patch to my kernel  after that I am not getting -ENOMEM from amdgpu_cs_ioctl while running tests  so have not been able to test patch to handle signaled fence. As this issue
 is hard to reproduce , will give some more try. 
But yes the problem is there and need to handle when fence is null that your patch seems to handle it correctly.

-Deepak

From: Zhou, David(ChunMing)

Sent: Monday, November 26, 2018 6:40 PM

To: Sharma, Deepak; Zhou, David(ChunMing); Koenig, Christian; amd-gfx@xxxxxxxxxxxxxxxxxxxxx

Subject: Re: [PATCH] drm/amdgpu: add the checking to avoid NULL pointer dereference

Yeah, you need another drm patch as well when you apply my patch. Attached.

-David

On 2018年11月27日 08:40, Sharma, Deepak wrote:

>

> On 11/26/18 1:57 AM, Zhou, David(ChunMing) wrote:

>>

>>> -----Original Message-----

>>> From: Christian König <ckoenig.leichtzumerken@xxxxxxxxx>

>>> Sent: Monday, November 26, 2018 5:23 PM

>>> To: Sharma, Deepak <Deepak.Sharma@xxxxxxx>; Zhou, David(ChunMing)

>>> <David1.Zhou@xxxxxxx>; Koenig, Christian <Christian.Koenig@xxxxxxx>;

>>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx

>>> Subject: Re: [PATCH] drm/amdgpu: add the checking to avoid NULL pointer

>>> dereference

>>>

>>> Am 26.11.18 um 02:59 schrieb Sharma, Deepak:

>>>> 在 2018/11/24 2:10, Koenig, Christian 写道:

>>>>> Am 23.11.18 um 15:10 schrieb Zhou, David(ChunMing):

>>>>>> 在 2018/11/23 21:30, Koenig, Christian 写道:

>>>>>>> Am 23.11.18 um 14:27 schrieb Zhou, David(ChunMing):

>>>>>>>> 在 2018/11/22 19:25, Christian König 写道:

>>>>>>>>> Am 22.11.18 um 07:56 schrieb Sharma, Deepak:

>>>>>>>>>> when returned fence is not valid mostly due to userspace ignored

>>>>>>>>>> previous error causes NULL pointer dereference.

>>>>>>>>> Again, this is clearly incorrect. The my other mails on the

>>>>>>>>> earlier patch.

>>>>>>>> Sorry for I didn't get your history, but looks from the patch

>>>>>>>> itself, it is still a valid patch, isn't it?

>>>>>>> No, the semantic of amdgpu_ctx_get_fence() is that we return NULL

>>>>>>> when the fence is already signaled.

>>>>>>>

>>>>>>> So this patch could totally break userspace because it changes the

>>>>>>> behavior when we try to sync to an already signaled fence.

>>>>>> Ah, I got your meaning, how about attached patch?

>>>>> Yeah something like this, but I would just give the

>>>>> DRM_SYNCOBJ_CREATE_SIGNALED instead.

>>>>>

>>>>> I mean that's what this flag is good for isn't it?

>>>> Yeah, I give a flag initally when creating patch, but as you know, there is a

>>> swich case not be able to use that flag:

>>>>         case AMDGPU_FENCE_TO_HANDLE_GET_SYNC_FILE_FD:

>>>>             fd = get_unused_fd_flags(O_CLOEXEC);

>>>>             if (fd < 0) {

>>>>                 dma_fence_put(fence);

>>>>                 return fd;

>>>>             }

>>>>

>>>>             sync_file = sync_file_create(fence);

>>>>             dma_fence_put(fence);

>>>>             if (!sync_file) {

>>>>                 put_unused_fd(fd);

>>>>                 return -ENOMEM;

>>>>             }

>>>>

>>>>             fd_install(fd, sync_file->file);

>>>>             info->out.handle = fd;

>>>>             return 0;

>>>>

>>>> So I change to stub fence instead.

>>> Yeah, I've missed that case. Not sure if the sync_file can deal with a NULL

>>> fence.

>>>

>>> We should then probably move the stub fence function into

>>> dma_fence_stub.c under drivers/dma-buf to keep the stuff together.

>> Yes, you wrap it to review first with your stub fence, we can do it separately first.

>>

>> -David

>>>> -David

>>>>

>>>> I have not applied this patch.

>>>> The issue was trying to address is when amdgpu_cs_ioctl() failed due to

>>> low memory (ENOMEM) but userspace chose to proceed and called

>>> amdgpu_cs_fence_to_handle_ioctl().

>>>> In amdgpu_cs_fence_to_handle_ioctl(), fence is null and later causing

>>>> NULL pointer dereference, this patch was to avoid that and system panic

>>> But I understand now that its a valid case retuning NULL if fence was already

>>> signaled but need to handle case so avoid kernel panic. Seems David patch

>>> should fix this, I will test it tomorrow.

>>>

>>> Mhm, but don't we bail out with an error if we ask for a failed command

>>> submission? If not that sounds like a bug as well.

>>>

>>> Christian.

>>>

> Where do we do that?

> I see error

> [drm:amdgpu_cs_ioctl] *ERROR* amdgpu_cs_list_validate(validated) failed.

> [drm:amdgpu_cs_ioctl] *ERROR* Not enough memory for command submission!

> BUG: unable to handle kernel NULL pointer dereference at 0000000000000008

> Did some more debugging,dma_fence_is_array() is causing NULL pointer

> dereference call through sync_file_ioctl.

>

> Also I think changes in David patch cant be applied on

> amd-staging-drm-next, which all patches I should take to get it correctly?

>

>>>> -Deepak

>>>>> Christian.

>>>>>

>>>>>> -David

>>>>>>> If that patch was applied then please revert it immediately.

>>>>>>>

>>>>>>> Christian.

>>>>>>>

>>>>>>>> -David

>>>>>>>>> If you have already pushed the patch then please revert.

>>>>>>>>>

>>>>>>>>> Christian.

>>>>>>>>>

>>>>>>>>>> Signed-off-by: Deepak Sharma <Deepak.Sharma@xxxxxxx>

>>>>>>>>>> ---

>>>>>>>>>>          drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 2 ++

>>>>>>>>>>          1 file changed, 2 insertions(+)

>>>>>>>>>>

>>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c

>>>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c

>>>>>>>>>> index 024dfbd87f11..14166cd8a12f 100644

>>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c

>>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c

>>>>>>>>>> @@ -1403,6 +1403,8 @@ static struct dma_fence

>>>>>>>>>> *amdgpu_cs_get_fence(struct amdgpu_device *adev,

>>>>>>>>>>                fence = amdgpu_ctx_get_fence(ctx, entity, user->seq_no);

>>>>>>>>>>              amdgpu_ctx_put(ctx);

>>>>>>>>>> +    if(!fence)

>>>>>>>>>> +        return ERR_PTR(-EINVAL);

>>>>>>>>>>                return fence;

>>>>>>>>>>          }

>>>> _______________________________________________

>>>> amd-gfx mailing list

>>>> amd-gfx@xxxxxxxxxxxxxxxxxxxxx

>>>> 
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx