Re: [PATCH libdrm] amdgpu: add a faster BO list API

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On Thu, Jan 10, 2019, 6:51 AM Christian König <ckoenig.leichtzumerken@xxxxxxxxx wrote:
Am 10.01.19 um 12:41 schrieb Marek Olšák:


On Thu, Jan 10, 2019, 4:15 AM Koenig, Christian <Christian.Koenig@xxxxxxx wrote:
Am 10.01.19 um 00:39 schrieb Marek Olšák:
On Wed, Jan 9, 2019 at 1:41 PM Christian König <ckoenig.leichtzumerken@xxxxxxxxx> wrote:
Am 09.01.19 um 17:14 schrieb Marek Olšák:
On Wed, Jan 9, 2019 at 8:09 AM Christian König <ckoenig.leichtzumerken@xxxxxxxxx> wrote:
Am 09.01.19 um 13:36 schrieb Marek Olšák:


On Wed, Jan 9, 2019, 5:28 AM Christian König <ckoenig.leichtzumerken@xxxxxxxxx wrote:
Looks good, but I'm wondering what's the actual improvement?

No malloc calls and 1 less for loop copying the bo list.

Yeah, but didn't we want to get completely rid of the bo list?

If we have multiple IBs (e.g. gfx + compute) that share a BO list, I think it's faster to send the BO list to the kernel only once.

That's not really faster.

The only thing we safe us is a single loop over all BOs to lockup the handle into a pointer and that is only a tiny fraction of the overhead.

The majority of the overhead is locking the BOs and reserving space for the submission.

What could really help here is to submit gfx+comput together in just one CS IOCTL. This way we would need the locking and space reservation only once.

It's a bit of work in the kernel side, but certainly doable.

OK. Any objections to this patch?

In general I'm wondering if we couldn't avoid adding so much new interface.

There are Vulkan drivers that still use the bo_list interface.


For example we can avoid the malloc() when we just cache the last freed bo_list structure in the device. We would just need an atomic pointer exchange operation for that.

This way we even don't need to change mesa at all.

There is still the for loop that we need to get rid of.

Yeah, but that I'm fine to handle with a amdgpu_bo_list_create_raw which only takes the handles and still returns the amdgpu_bo_list structure we are used to.

See what I'm mostly concerned about is having another CS function to maintain.

There is no maintenance cost. It's just a wrapper. Eventually all drivers will switch to it.

Marek




Regarding optimization, this chunk can be replaced by a cast on 64bit:
+	chunk_array = alloca(sizeof(uint64_t) * num_chunks);
+	for (i = 0; i < num_chunks; i++)
+		chunk_array[i] = (uint64_t)(uintptr_t)&chunks[i];
It can't. The input is an array of structures. The ioctl takes an array of pointers.

Ah! Haven't seen this, sorry for the noise.

Christian.


Marek


Regards,
Christian.


Thanks,
Marek


_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux