On Mon, Aug 21, 2023 at 1:13 PM Christian König <christian.koenig@xxxxxxx> wrote:
Am 21.08.23 um 20:01 schrieb Danilo Krummrich:
> On 8/21/23 16:07, Christian König wrote:
>> Am 18.08.23 um 13:58 schrieb Danilo Krummrich:
>>> [SNIP]
>>>> I only see two possible outcomes:
>>>> 1. You return -EBUSY (or similar) error code indicating the the hw
>>>> can't receive more commands.
>>>> 2. Wait on previously pushed commands to be executed.
>>>> (3. Your driver crash because you accidentally overwrite stuff in
>>>> the ring buffer which is still executed. I just assume that's
>>>> prevented).
>>>>
>>>> Resolution #1 with -EBUSY is actually something the UAPI should not
>>>> do, because your UAPI then depends on the specific timing of
>>>> submissions which is a really bad idea.
>>>>
>>>> Resolution #2 is usually bad because it forces the hw to run dry
>>>> between submission and so degrade performance.
>>>
>>> I agree, that is a good reason for at least limiting the maximum job
>>> size to half of the ring size.
>>>
>>> However, there could still be cases where two subsequent jobs are
>>> submitted with just a single IB, which as is would still block
>>> subsequent jobs to be pushed to the ring although there is still
>>> plenty of space. Depending on the (CPU) scheduler latency, such a
>>> case can let the HW run dry as well.
>>
>> Yeah, that was intentionally not done as well. The crux here is that
>> the more you push to the hw the worse the scheduling granularity
>> becomes. It's just that neither Xe nor Nouveau relies that much on
>> the scheduling granularity at all (because of hw queues).
>>
>> But Xe doesn't seem to need that feature and I would still try to
>> avoid it because the more you have pushed to the hw the harder it is
>> to get going again after a reset.
>>
>>>
>>> Surely, we could just continue decrease the maximum job size even
>>> further, but this would result in further overhead on user and
>>> kernel for larger IB counts. Tracking the actual job size seems to
>>> be the better solution for drivers where the job size can vary over
>>> a rather huge range.
>>
>> I strongly disagree on that. A larger ring buffer is trivial to allocate
>
> That sounds like a workaround to me. The problem, in the case above,
> isn't that the ring buffer does not have enough space, the problem is
> that we account for the maximum job size although the actual job size
> is much smaller. And enabling the scheduler to track the actual job
> size is trivial as well.
That's what I agree on, so far I just didn't see the reason for doing it
but at least a few reason for not doing it.
>
>> and if userspace submissions are so small that the scheduler can't
>> keep up submitting them then your ring buffer size is your smallest
>> problem.
>>
>> In other words the submission overhead will completely kill your
>> performance and you should probably consider stuffing more into a
>> single submission.
>
> I fully agree and that is also the reason why I want to keep the
> maximum job size as large as possible.
>
> However, afaik with Vulkan it's the applications themselves deciding
> when and with how many command buffers a queue is submitted (@Faith:
> please correct me if I'm wrong). Hence, why not optimize for this case
> as well? It's not that it would make another case worse, right?
As I said it does make both the scheduling granularity as well as the
reset behavior worse.
In general I think we should try to push just enough work to the
hardware to keep it busy and not as much as possible.
So as long as nobody from userspace comes and says we absolutely need to
optimize this use case I would rather not do it.
This is a place where nouveau's needs are legitimately different from AMD or Intel, I think. NVIDIA's command streamer model is very different from AMD and Intel. On AMD and Intel, each EXEC turns into a single small packet (on the order of 16B) which kicks off a command buffer. There may be a bit of cache management or something around it but that's it. From there, it's userspace's job to make one command buffer chain to another until it's finally done and then do a "return", whatever that looks like.
NVIDIA's model is much more static. Each packet in the HW/FW ring is an address and a size and that much data is processed and then it grabs the next packet and processes. The result is that, if we use multiple buffers of commands, there's no way to chain them together. We just have to pass the whole list of buffers to the kernel. A single EXEC ioctl / job may have 500 such addr+size packets depending on how big the command buffer is. It gets worse on pre-Turing hardware where we have to split the batch for every single DrawIndirect or DispatchIndirect.
Lest you think NVIDIA is just crazy here, it's a perfectly reasonable model if you assume that userspace is feeding the firmware. When that's happening, you just have a userspace thread that sits there and feeds the ringbuffer with whatever is next and you can marshal as much data through as you want. Sure, it'd be nice to have a 2nd level batch thing that gets launched from the FW ring and has all the individual launch commands but it's not at all necessary.
What does that mean from a gpu_scheduler PoV? Basically, it means a variable packet size.
What does this mean for implementation? IDK. One option would be to teach the scheduler about actual job sizes. Another would be to virtualize it and have another layer underneath the scheduler that does the actual feeding of the ring. Another would be to decrease the job size somewhat and then have the front-end submit as many jobs as it needs to service userspace and only put the out-fences on the last job. All the options kinda suck.
~Faith