On Tue, Jul 25, 2023 at 5:48 PM Danilo Krummrich <dakr@xxxxxxxxxx> wrote:
On 7/25/23 18:43, Danilo Krummrich wrote:
> On 7/25/23 18:16, Faith Ekstrand wrote:
>> Thanks for the detailed write-up! That would definitely explain it. If
>> I remember, I'll try to do a single-threaded run or two. If your
>> theory is correct, there should be no real perf difference when
>> running single-threaded. Those runs will take a long time, though, so
>> I'll have to run them over night. I'll let you know in a few days once
>> I have the results.
>
> I can also push a separate branch where I just print out a warning
> whenever we run into such a condition including the time we were waiting
> for things to complete. I can probably push something later today.
https://gitlab.freedesktop.org/nouvelles/kernel/-/tree/new-uapi-drm-next-track-stalls
It prints out the duration of every wait as well as the total wait time
since boot.
- Danilo
>
>>
>> If this theory holds, then I'm not concerned about the performance of
>> the API itself. It would still be good to see if we can find a way to
>> reduce the cross-process drag in the implementation but that's a perf
>> optimization we can do later.
>
> From the kernel side I think the only thing we could really do is to
> temporarily run a secondary drm_gpu_scheduler instance, one for VM_BINDs
> and one for EXECs until we got the new page table handling in place.
>
> However, the UMD could avoid such conditions more effectively, since it
> controls the address space. Namely, avoid re-using the same region of
> the address space right away in certain cases. For instance, instead of
> replacing a sparse region A[0x0, 0x4000000] with a larger sparse region
> B[0x0, 0x8000000], replace it with B'[0x4000000, 0xC000000] if possible.
>
> However, just mentioning this for completeness. The UMD surely shouldn't
> probably even temporarily work around such a kernel limitation.
>
> Anyway, before doing any of those, let's see if the theory holds and
> we're actually running into such cases.
>
>>
>> Does it actually matter? Yes, it kinda does. No, it probably doesn't
>> matter for games because you're typically only running one game at a
>> time. From a development PoV, however, if it makes CI take longer then
>> that slows down development and that's not good for the users, either.
I've run a bunch of tests over the weekend and It's starting to look like the added CTS time might be from the newly enabled synchronization tests themselves. We enabled timeline semaphores as well as semaphore and fence sharing along with the new uAPI and I did not expect those to be nearly that heavy-weight so I didn't think of that as a likely factor. I'm doing a couple more runs to confirm but what I'm seeing right now seems to indicate that the new API with the old feature set has about the same run time now that the submit overhead issue has been fixed.
I think this means that we can proceed under the assumption that there are no more serious perf issues with the new API. I'm happy to RB/ACK the next version of the API and implementation patches, as long as it has the new NO_SHARE BO create flag and properly rejects exports of NO_SHARE BOs, even if not all of the dma_resv plumbing is fully baked.
~Faith