Re: [PATCH v3 3/3] drm/doc/rfc: VM_BIND uapi definition

"Zeng, Oak" <oak.zeng@xxxxxxxxx> · Mon, 27 Jun 2022 18:58:42 +0000

Thanks,
Oak

> -----Original Message-----
> From: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx>
> Sent: June 27, 2022 4:30 AM
> To: Zeng, Oak <oak.zeng@xxxxxxxxx>; Landwerlin, Lionel G
> <lionel.g.landwerlin@xxxxxxxxx>; Vishwanathapura, Niranjana
> <niranjana.vishwanathapura@xxxxxxxxx>
> Cc: Zanoni, Paulo R <paulo.r.zanoni@xxxxxxxxx>; intel-
> gfx@xxxxxxxxxxxxxxxxxxxxx; dri-devel@xxxxxxxxxxxxxxxxxxxxx; Hellstrom,
> Thomas <thomas.hellstrom@xxxxxxxxx>; Wilson, Chris P
> <chris.p.wilson@xxxxxxxxx>; Vetter, Daniel <daniel.vetter@xxxxxxxxx>;
> christian.koenig@xxxxxxx; Auld, Matthew <matthew.auld@xxxxxxxxx>
> Subject: Re:  [PATCH v3 3/3] drm/doc/rfc: VM_BIND uapi definition
> 
> 
> On 24/06/2022 21:23, Zeng, Oak wrote:
> > Let's compare "tlb invalidate at vm unbind" vs "tlb invalidate at backing
> storage":
> >
> > Correctness:
> > consider this sequence of:
> > 1. unbind va1 from pa1,
> > 2. then bind va1 to pa2. //user space has the freedom to do this as it
> > manages virtual address space 3. Submit shader code using va1, 4. Then
> > retire pa1.
> >
> > If you don't perform tlb invalidate at step #1, in step #3, shader will use
> stale entries in tlb and pa1 will be used for the shader. User want to use pa2.
> So I don't think invalidate tlb at step #4 make correctness.
> 
> Define step 3. Is it a new execbuf? If so then there will be a TLB flush there.
> Unless the plan is to stop doing that with eb3 but I haven't picked up on that
> anywhere so far.

In Niranjana's latest patch series, he removed the TLB flushing from vm_unbind. He also said explicitly TLB invalidation will be performed at job submission and backing storage releasing time, which is the existing behavior of the current i915 driver.

I think if we invalidate TLB on each vm_unbind, then we don't need to invalidate at submission and backing storage releasing. It doesn't make a lot of sense to me to perform a tlb invalidation at execbuf time. Maybe it is a behavior for the old implicit binding programming model. For vm_bind and eb3, we separate the binding and job submission into two APIs. It is more natural the TLB invalidation be coupled with the vm bind/unbind, not job submission. So in my opinion we should remove tlb invalidation from submission and backing storage releasing and add it to vm unbind. This is method is cleaner to me.

Regarding performance, we don't have data. In my opinion, we should make things work in a most straight forward way as the first step. Then consider performance improvement if necessary. Consider some delayed tlb invalidation at submission and backing release time without performance data support wasn't a good decision.

Regards,
Oak

> 
> > Performance:
> > It is straight forward to invalidate tlb at step 1. If platform support range
> based tlb invalidation, we can perform range based invalidation easily at
> step1.
> 
> If the platform supports range base yes. If it doesn't _and_ the flush at
> unbind is not needed for 99% of use cases then it is simply a waste.
> 
> > If you do it at step 4, you either need to perform a whole gt tlb invalidation
> (worse performance), or you need to record all the VAs that this pa has been
> bound to and invalidate all the VA ranges - ugly program.
> 
> Someone can setup some benchmarking? :)
> 
> Regards,
> 
> Tvrtko
> 
> >
> >
> > Thanks,
> > Oak
> >
> >> -----Original Message-----
> >> From: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx>
> >> Sent: June 24, 2022 4:32 AM
> >> To: Zeng, Oak <oak.zeng@xxxxxxxxx>; Landwerlin, Lionel G
> >> <lionel.g.landwerlin@xxxxxxxxx>; Vishwanathapura, Niranjana
> >> <niranjana.vishwanathapura@xxxxxxxxx>
> >> Cc: Zanoni, Paulo R <paulo.r.zanoni@xxxxxxxxx>; intel-
> >> gfx@xxxxxxxxxxxxxxxxxxxxx; dri-devel@xxxxxxxxxxxxxxxxxxxxx;
> >> Hellstrom, Thomas <thomas.hellstrom@xxxxxxxxx>; Wilson, Chris P
> >> <chris.p.wilson@xxxxxxxxx>; Vetter, Daniel <daniel.vetter@xxxxxxxxx>;
> >> christian.koenig@xxxxxxx; Auld, Matthew <matthew.auld@xxxxxxxxx>
> >> Subject: Re:  [PATCH v3 3/3] drm/doc/rfc: VM_BIND uapi
> >> definition
> >>
> >>
> >> On 23/06/2022 22:05, Zeng, Oak wrote:
> >>>> -----Original Message-----
> >>>> From: Intel-gfx <intel-gfx-bounces@xxxxxxxxxxxxxxxxxxxxx> On Behalf
> >>>> Of Tvrtko Ursulin
> >>>> Sent: June 23, 2022 7:06 AM
> >>>> To: Landwerlin, Lionel G <lionel.g.landwerlin@xxxxxxxxx>;
> >>>> Vishwanathapura, Niranjana <niranjana.vishwanathapura@xxxxxxxxx>
> >>>> Cc: Zanoni, Paulo R <paulo.r.zanoni@xxxxxxxxx>;
> >>>> intel-gfx@xxxxxxxxxxxxxxxxxxxxx; dri-devel@xxxxxxxxxxxxxxxxxxxxx;
> >>>> Hellstrom, Thomas <thomas.hellstrom@xxxxxxxxx>; Wilson, Chris P
> >>>> <chris.p.wilson@xxxxxxxxx>; Vetter, Daniel
> >>>> <daniel.vetter@xxxxxxxxx>; christian.koenig@xxxxxxx; Auld,
> Matthew
> >>>> <matthew.auld@xxxxxxxxx>
> >>>> Subject: Re:  [PATCH v3 3/3] drm/doc/rfc: VM_BIND uapi
> >>>> definition
> >>>>
> >>>>
> >>>> On 23/06/2022 09:57, Lionel Landwerlin wrote:
> >>>>> On 23/06/2022 11:27, Tvrtko Ursulin wrote:
> >>>>>>>
> >>>>>>> After a vm_unbind, UMD can re-bind to same VA range against an
> >>>>>>> active VM.
> >>>>>>> Though I am not sue with Mesa usecase if that new mapping is
> >>>>>>> required for running GPU job or it will be for the next
> >>>>>>> submission. But ensuring the tlb flush upon unbind, KMD can
> >>>>>>> ensure correctness.
> >>>>>>
> >>>>>> Isn't that their problem? If they re-bind for submitting _new_
> >>>>>> work then they get the flush as part of batch buffer pre-amble.
> >>>>>
> >>>>> In the non sparse case, if a VA range is unbound, it is invalid to
> >>>>> use that range for anything until it has been rebound by something
> else.
> >>>>>
> >>>>> We'll take the fence provided by vm_bind and put it as a wait
> >>>>> fence on the next execbuffer.
> >>>>>
> >>>>> It might be safer in case of memory over fetching?
> >>>>>
> >>>>>
> >>>>> TLB flush will have to happen at some point right?
> >>>>>
> >>>>> What's the alternative to do it in unbind?
> >>>>
> >>>> Currently TLB flush happens from the ring before every BB_START and
> >>>> also when i915 returns the backing store pages to the system.
> >>>
> >>>
> >>> Can you explain more why tlb flush when i915 retire the backing
> >>> storage? I
> >> never figured that out when I looked at the codes. As I understand
> >> it, tlb caches the gpu page tables which map a va to a pa. So it is
> >> straight forward to me that we perform a tlb flush when we change the
> >> page table (either at vm bind time or unbind time. Better at unbind time
> for performance reason).
> >>
> >> I don't know what performs better - someone can measure the two
> >> approaches? Certainly on platforms where we only have global TLB
> >> flushing the cost is quite high so my thinking was to allow i915 to
> >> control when it will be done and not guarantee it in the uapi if it isn't
> needed for security reasons.
> >>
> >>> But it is rather tricky to me to flush tlb when we retire a backing
> >>> storage. I
> >> don't see how backing storage can be connected to page table. Let's
> >> say user unbind va1 from pa1, then bind va1 to pa2. Then retire pa1.
> >> Submit shader code using va1. If we don't tlb flush after unbind va1,
> >> the new shader code which is supposed to use pa2 will still use pa1
> >> due to the stale entries in tlb, right? The point is, tlb cached is
> >> tagged with virtual address, not physical address. so after we unbind
> >> va1 from pa1, regardless we retire pa1 or not,
> >> va1 can be bound to another pa2.
> >>
> >> When you say "retire pa1" I will assume you meant release backing
> >> storage for pa1. At this point i915 currently does do the TLB flush
> >> and that ensures no PTE can point to pa1.
> >>
> >> This approach deals with security of the system as a whole. Client
> >> may still cause rendering corruption or a GPU hang for itself but
> >> that should be completely isolated. (This is the part where you say
> >> "regardless if we retire
> >> pa1 or not" I think.)
> >>
> >> But I think those are advanced use cases where userspace wants to
> >> manipulate PTEs while something is running on the GPU in parallel.
> >> AFAIK limited to compute "infinite batch" so my thinking is to avoid
> >> adding a performance penalty to the common case. Especially on
> >> platforms which only have global flush.
> >>
> >> But.. to circle back on the measuring angle. Until someone invests
> >> time and effort to benchmark the two approaches (flush on unbind vs
> >> flush on backing store release) we don't really know. All I know is
> >> the perf hit with the current solution was significant, AFAIR up to
> >> teen digits on some games. And considering the flushes were driven
> >> only by the shrinker activity, my thinking was they would be less
> >> frequent than the unbinds, therefore have the potential for a smaller perf
> hit.
> >>
> >> Regards,
> >>
> >> Tvrtko