On Thu, Aug 17, 2023 at 01:13:31PM +0200, Danilo Krummrich wrote: > On 8/17/23 07:33, Christian König wrote: > > Am 16.08.23 um 18:33 schrieb Danilo Krummrich: > > > On 8/16/23 16:59, Christian König wrote: > > > > Am 16.08.23 um 14:30 schrieb Danilo Krummrich: > > > > > On 8/16/23 16:05, Christian König wrote: > > > > > > Am 16.08.23 um 13:30 schrieb Danilo Krummrich: > > > > > > > Hi Matt, > > > > > > > > > > > > > > On 8/11/23 04:31, Matthew Brost wrote: > > > > > > > > In XE, the new Intel GPU driver, a choice has made to have a 1 to 1 > > > > > > > > mapping between a drm_gpu_scheduler and > > > > > > > > drm_sched_entity. At first this > > > > > > > > seems a bit odd but let us explain the reasoning below. > > > > > > > > > > > > > > > > 1. In XE the submission order from multiple drm_sched_entity is not > > > > > > > > guaranteed to be the same completion even if > > > > > > > > targeting the same hardware > > > > > > > > engine. This is because in XE we have a firmware scheduler, the GuC, > > > > > > > > which allowed to reorder, timeslice, and preempt > > > > > > > > submissions. If a using > > > > > > > > shared drm_gpu_scheduler across multiple > > > > > > > > drm_sched_entity, the TDR falls > > > > > > > > apart as the TDR expects submission order == > > > > > > > > completion order. Using a > > > > > > > > dedicated drm_gpu_scheduler per drm_sched_entity solve this problem. > > > > > > > > > > > > > > > > 2. In XE submissions are done via programming a > > > > > > > > ring buffer (circular > > > > > > > > buffer), a drm_gpu_scheduler provides a limit on > > > > > > > > number of jobs, if the > > > > > > > > limit of number jobs is set to RING_SIZE / > > > > > > > > MAX_SIZE_PER_JOB we get flow > > > > > > > > control on the ring for free. > > > > > > > > > > > > > > In XE, where does the limitation of MAX_SIZE_PER_JOB come from? > > > > > > > > > > > > > > In Nouveau we currently do have such a limitation as > > > > > > > well, but it is derived from the RING_SIZE, hence > > > > > > > RING_SIZE / MAX_SIZE_PER_JOB would always be 1. > > > > > > > However, I think most jobs won't actually utilize > > > > > > > the whole ring. > > > > > > > > > > > > Well that should probably rather be RING_SIZE / > > > > > > MAX_SIZE_PER_JOB = hw_submission_limit (or even > > > > > > hw_submission_limit - 1 when the hw can't distinct full > > > > > > vs empty ring buffer). > > > > > > > > > > Not sure if I get you right, let me try to clarify what I > > > > > was trying to say: I wanted to say that in Nouveau > > > > > MAX_SIZE_PER_JOB isn't really limited by anything other than > > > > > the RING_SIZE and hence we'd never allow more than 1 active > > > > > job. > > > > > > > > But that lets the hw run dry between submissions. That is > > > > usually a pretty horrible idea for performance. > > > > > > Correct, that's the reason why I said it seems to be more efficient > > > to base ring flow control on the actual size of each incoming job > > > rather than the maximum size of a job. > > > > > > > > > > > > > > > > > However, it seems to be more efficient to base ring flow > > > > > control on the actual size of each incoming job rather than > > > > > the worst case, namely the maximum size of a job. > > > > > > > > That doesn't sounds like a good idea to me. See we don't limit > > > > the number of submitted jobs based on the ring size, but rather > > > > we calculate the ring size based on the number of submitted > > > > jobs. > > > > > > > > > > My point isn't really about whether we derive the ring size from the > > > job limit or the other way around. It's more about the job size (or > > > its maximum size) being arbitrary. > > > > > > As mentioned in my reply to Matt: > > > > > > "In Nouveau, userspace can submit an arbitrary amount of addresses > > > of indirect bufferes containing the ring instructions. The ring on > > > the kernel side takes the addresses of the indirect buffers rather > > > than the instructions themself. Hence, technically there isn't > > > really a limit on the amount of IBs submitted by a job except for > > > the ring size." > > > > > > So, my point is that I don't really want to limit the job size > > > artificially just to be able to fit multiple jobs into the ring even > > > if they're submitted at their "artificial" maximum size, but rather > > > track how much of the ring the submitted job actually occupies. > > > > > > > In other words the hw_submission_limit defines the ring size, > > > > not the other way around. And you usually want the > > > > hw_submission_limit as low as possible for good scheduler > > > > granularity and to avoid extra overhead. > > > > > > I don't think you really mean "as low as possible", do you? > > > > No, I do mean as low as possible or in other words as few as possible. > > > > Ideally the scheduler would submit only the minimum amount of work to > > the hardware to keep the hardware busy. > > > The hardware seems to work mostly the same for all vendors, but you > > somehow seem to think that filling the ring is somehow beneficial which > > is really not the case as far as I can see. > > I think that's a misunderstanding. I'm not trying to say that it is *always* > beneficial to fill up the ring as much as possible. But I think it is under > certain circumstances, exactly those circumstances I described for Nouveau. > > As mentioned, in Nouveau the size of a job is only really limited by the > ring size, which means that one job can (but does not necessarily) fill up > the whole ring. We both agree that this is inefficient, because it > potentially results into the HW run dry due to hw_submission_limit == 1. > > I recognize you said that one should define hw_submission_limit and adjust > the other parts of the equation accordingly, the options I see are: > > (1) Increase the ring size while keeping the maximum job size. > (2) Decrease the maximum job size while keeping the ring size. > (3) Let the scheduler track the actual job size rather than the maximum job > size. > > (1) results into potentially wasted ring memory, because we're not always > reaching the maximum job size, but the scheduler assumes so. > > (2) results into more IOCTLs from userspace for the same amount of IBs and > more jobs result into more memory allocations and more work being submitted > to the workqueue (with Matt's patches). > > (3) doesn't seem to have any of those draw backs. > > What would be your take on that? > > Actually, if none of the other drivers is interested into a more precise way > of keeping track of the ring utilization, I'd be totally fine to do it in a > driver specific way. However, unfortunately I don't see how this would be > possible. > > My proposal would be to just keep the hw_submission_limit (maybe rename it > to submission_unit_limit) and add a submission_units field to struct > drm_sched_job. By default a jobs submission_units field would be 0 and the > scheduler would behave the exact same way as it does now. > > Accordingly, jobs with submission_units > 1 would contribute more than one > unit to the submission_unit_limit. > > What do you think about that? > This seems reasonible to me and a very minimal change to the scheduler. Matt > Besides all that, you said that filling up the ring just enough to not let > the HW run dry rather than filling it up entirely is desirable. Why do you > think so? I tend to think that in most cases it shouldn't make difference. > > - Danilo > > > > > Regards, > > Christian. > > > > > Because one really is the minimum if you want to do work at all, but > > > as you mentioned above a job limit of one can let the ring run dry. > > > > > > In the end my proposal comes down to tracking the actual size of a > > > job rather than just assuming a pre-defined maximum job size, and > > > hence a dynamic job limit. > > > > > > I don't think this would hurt the scheduler granularity. In fact, it > > > should even contribute to the desire of not letting the ring run dry > > > even better. Especially for sequences of small jobs, where the > > > current implementation might wrongly assume the ring is already full > > > although actually there would still be enough space left. > > > > > > > > > > > Christian. > > > > > > > > > > > > > > > > > > > > > Otherwise your scheduler might just overwrite the ring > > > > > > buffer by pushing things to fast. > > > > > > > > > > > > Christian. > > > > > > > > > > > > > > > > > > > > Given that, it seems like it would be better to let > > > > > > > the scheduler keep track of empty ring "slots" > > > > > > > instead, such that the scheduler can deceide whether > > > > > > > a subsequent job will still fit on the ring and if > > > > > > > not re-evaluate once a previous job finished. Of > > > > > > > course each submitted job would be required to carry > > > > > > > the number of slots it requires on the ring. > > > > > > > > > > > > > > What to you think of implementing this as > > > > > > > alternative flow control mechanism? Implementation > > > > > > > wise this could be a union with the existing > > > > > > > hw_submission_limit. > > > > > > > > > > > > > > - Danilo > > > > > > > > > > > > > > > > > > > > > > > A problem with this design is currently a drm_gpu_scheduler uses a > > > > > > > > kthread for submission / job cleanup. This doesn't scale if a large > > > > > > > > number of drm_gpu_scheduler are used. To work > > > > > > > > around the scaling issue, > > > > > > > > use a worker rather than kthread for submission / job cleanup. > > > > > > > > > > > > > > > > v2: > > > > > > > > - (Rob Clark) Fix msm build > > > > > > > > - Pass in run work queue > > > > > > > > v3: > > > > > > > > - (Boris) don't have loop in worker > > > > > > > > v4: > > > > > > > > - (Tvrtko) break out submit ready, stop, > > > > > > > > start helpers into own patch > > > > > > > > > > > > > > > > Signed-off-by: Matthew Brost <matthew.brost@xxxxxxxxx> > > > > > > > > > > > > > > > > > > > > > > > > > > > >