Re: [PATCH] drm/i915/gem: Don't try to map and fence large scanout buffers (v3)

"Kasireddy, Vivek" <vivek.kasireddy@xxxxxxxxx> · Mon, 1 Nov 2021 20:52:59 +0000

Hi Tvrtko,

> On 29/10/2021 08:43, Vivek Kasireddy wrote:
> > On platforms capable of allowing 8K (7680 x 4320) modes, pinning 2 or
> > more framebuffers/scanout buffers results in only one that is mappable/
> > fenceable. Therefore, pageflipping between these 2 FBs where only one
> > is mappable/fenceable creates latencies large enough to miss alternate
> > vblanks thereby producing less optimal framerate.
> >
> > This mainly happens because when i915_gem_object_pin_to_display_plane()
> > is called to pin one of the FB objs, the associated vma is identified
> > as misplaced and therefore i915_vma_unbind() is called which unbinds and
> > evicts it. This misplaced vma gets subseqently pinned only when
> > i915_gem_object_ggtt_pin_ww() is called without the mappable flag. This
> > results in a latency of ~10ms and happens every other vblank/repaint cycle.
> 
> This happens because of PIN_MAPPABLE being tried first just in case? Ie.:
[Kasireddy, Vivek] Yes, that is exactly why this issue happens. I should have
probably described how the patch fixes the problem but the goal with this patch
is to make sure that we return early if we try to pin with PIN_MAPPABLE for 
8K FBs.

> 
> 	vma = ERR_PTR(-ENOSPC);
> 	if ((flags & PIN_MAPPABLE) == 0 &&
> 	    (!view || view->type == I915_GGTT_VIEW_NORMAL))
> 		vma = i915_gem_object_ggtt_pin_ww(obj, ww, view, 0, alignment,
> 						  flags | PIN_MAPPABLE |
> 						  PIN_NONBLOCK);
> 	if (IS_ERR(vma) && vma != ERR_PTR(-EDEADLK))
> 		vma = i915_gem_object_ggtt_pin_ww(obj, ww, view, 0,
> 						  alignment, flags);
> 
> Comment justifying this behaviour talks about ABI:
> 
> 	/*
> 	 * As the user may map the buffer once pinned in the display plane
> 	 * (e.g. libkms for the bootup splash), we have to ensure that we
> 	 * always use map_and_fenceable for all scanout buffers. However,
> 	 * it may simply be too big to fit into mappable, in which case
> 	 * put it anyway and hope that userspace can cope (but always first
> 	 * try to preserve the existing ABI).
> 	 */
> 
> I have no clue how important it is today. So if I ask could we just dropped the "optimistic"
> PIN_MAPPABLE on modern platforms would that be a stupid question?
[Kasireddy, Vivek] Not sure if we could just drop it as we don't know what is out there that
relies on this. I took a quick look at Plymouth and it does seem to still use single buffered
dirtyfb method (or mmap() depending on the plugin) for updating the FB. 

 Because even
> with this patch the change would effectively already happen for 8K boot splash, no? (Since
> code can not know if this is single buffered boot splash client or double buffered
> compositor and the criteria for PIN_MAPPABLE is the same.)
[Kasireddy, Vivek] Maybe it would make sense to limit 8K modes only to a compositor
and not expose them to something like bootup splash/Plymouth? Even then, Gnome with X can
still use single buffering/dirtyfb method to update FB. As I mentioned in the patch description,
we are able to map and fence one 8K FB but not more than that. I think we could let this stand
and only prevent the unbind -- which is the cause of the latency. But this might fall apart if the
userspace tries to mmap both FBs -- which could happen if we enable remote desktop streaming
with Weston. Because there are all these special cases that we cannot know beforehand, I
thought marking all the 8K FBs as non-mappable/non-fenceable would automatically
prevent (or fail) such attempts.

> 
> >
> > Testcase:
> > Running Weston and weston-simple-egl on an Alderlake_S (ADLS) platform
> > with a 8K@60 mode results in only ~40 FPS. Since upstream Weston submits
> > a frame ~7ms before the next vblank, the latencies seen between atomic
> > commit and flip event are 7, 24 (7 + 16.66), 7, 24..... suggesting that
> > it misses the vblank every other frame.
> >
> > Here is the ftrace snippet that shows the source of the ~10ms latency:
> >                i915_gem_object_pin_to_display_plane() {
> > 0.102 us   |    i915_gem_object_set_cache_level();
> >                  i915_gem_object_ggtt_pin_ww() {
> > 0.390 us   |      i915_vma_instance();
> > 0.178 us   |      i915_vma_misplaced();
> >                    i915_vma_unbind() {
> >                    __i915_active_wait() {
> > 0.082 us   |        i915_active_acquire_if_busy();
> > 0.475 us   |      }
> >                    intel_runtime_pm_get() {
> > 0.087 us   |        intel_runtime_pm_acquire();
> > 0.259 us   |      }
> >                    __i915_active_wait() {
> > 0.085 us   |        i915_active_acquire_if_busy();
> > 0.240 us   |      }
> >                    __i915_vma_evict() {
> >                      ggtt_unbind_vma() {
> >                        gen8_ggtt_clear_range() {
> > 10507.255 us |        }
> > 10507.689 us |      }
> > 10508.516 us |   }
> >
> > v2: Instead of using bigjoiner checks, determine whether a scanout
> >      buffer is too big by checking to see if it is possible to map
> >      two of them into the ggtt.
> >
> > v3 (Ville):
> > - Count how many fb objects can be fit into the available holes
> >    instead of checking for a hole twice the object size.
> > - Take alignment constraints into account.
> > - Limit this large scanout buffer check to >= Gen 12 platforms.
> >
> > Cc: Ville Syrjälä <ville.syrjala@xxxxxxxxxxxxxxx>
> > Cc: Maarten Lankhorst <maarten.lankhorst@xxxxxxxxxxxxxxx>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx>
> > Cc: Manasi Navare <manasi.d.navare@xxxxxxxxx>
> > Signed-off-by: Vivek Kasireddy <vivek.kasireddy@xxxxxxxxx>
> > ---
> >   drivers/gpu/drm/i915/i915_gem.c | 65 ++++++++++++++++++++++++++++-----
> >   drivers/gpu/drm/i915/i915_vma.c |  2 +-
> >   2 files changed, 57 insertions(+), 10 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
> > index 981e383d1a5d..761dc385fbfc 100644
> > --- a/drivers/gpu/drm/i915/i915_gem.c
> > +++ b/drivers/gpu/drm/i915/i915_gem.c
> > @@ -866,6 +866,61 @@ static void discard_ggtt_vma(struct i915_vma *vma)
> >   	spin_unlock(&obj->vma.lock);
> >   }
> >
> > +static bool i915_gem_obj_too_big(struct drm_i915_gem_object *obj,
> > +				 u64 alignment)
> > +{
> > +	struct drm_i915_private *i915 = to_i915(obj->base.dev);
> > +	struct i915_ggtt *ggtt = &i915->ggtt;
> > +	struct drm_mm_node *hole;
> > +	u64 hole_start, hole_end, start, end;
> > +	u64 fence_size, fence_alignment;
> > +	unsigned int count = 0;
> > +
> > +	/*
> > +	 * If the required space is larger than the available
> > +	 * aperture, we will not able to find a slot for the
> > +	 * object and unbinding the object now will be in
> > +	 * vain. Worse, doing so may cause us to ping-pong
> > +	 * the object in and out of the Global GTT and
> > +	 * waste a lot of cycles under the mutex.
> > +	 */
> > +	if (obj->base.size > ggtt->mappable_end)
> > +		return true;
> > +
> > +	if (HAS_GMCH(i915) || DISPLAY_VER(i915) < 11 ||
> > +	    !i915_gem_object_is_framebuffer(obj))
> > +		return false;
> 
> Not sure about these checks - are you sure they are preserving like-for-like?
[Kasireddy, Vivek] I think so; platforms with has_gmch must have PIN_MAPPABLE
and 8K is not available for DISPLAY_VER(i915) < 11. The last check is just
for sanity.

> 
> > +
> > +	fence_size = i915_gem_fence_size(i915, obj->base.size,
> > +					 i915_gem_object_get_tiling(obj),
> > +					 i915_gem_object_get_stride(obj));
> > +	fence_alignment = i915_gem_fence_alignment(i915, obj->base.size,
> > +					 i915_gem_object_get_tiling(obj),
> > +					 i915_gem_object_get_stride(obj));
> > +	alignment = max_t(u64, alignment, fence_alignment);
> > +
> > +	/*
> > +	 * Assuming this object is a large scanout buffer, we try to find
> > +	 * out if there is room to map at-least two of them. There could
> > +	 * be space available to map one but to be consistent, we try to
> > +	 * avoid mapping/fencing any of them.
> > +	 */
> > +	drm_mm_for_each_hole(hole, &ggtt->vm.mm, hole_start, hole_end) {
> > +		do {
> > +			start = round_up(hole_start, alignment);
> > +			end = min_t(u64, hole_end, ggtt->mappable_end);
> > +
> > +			if (range_overflows(start, fence_size, end))
> > +				break;
> > +
> > +			count++;
> 
> If this solution remains, you could break out when count reaches two.
[Kasireddy, Vivek] Yes, I was going to do that in v4. 

Thanks,
Vivek
> 
> Regards,
> 
> Tvrtko
> 
> > +			hole_start = start + fence_size;
> > +		} while (1);
> > +	}
> > +
> > +	return count < 2;
> > +}
> > +
> >   struct i915_vma *
> >   i915_gem_object_ggtt_pin_ww(struct drm_i915_gem_object *obj,
> >   			    struct i915_gem_ww_ctx *ww,
> > @@ -879,15 +934,7 @@ i915_gem_object_ggtt_pin_ww(struct drm_i915_gem_object
> *obj,
> >
> >   	if (flags & PIN_MAPPABLE &&
> >   	    (!view || view->type == I915_GGTT_VIEW_NORMAL)) {
> > -		/*
> > -		 * If the required space is larger than the available
> > -		 * aperture, we will not able to find a slot for the
> > -		 * object and unbinding the object now will be in
> > -		 * vain. Worse, doing so may cause us to ping-pong
> > -		 * the object in and out of the Global GTT and
> > -		 * waste a lot of cycles under the mutex.
> > -		 */
> > -		if (obj->base.size > ggtt->mappable_end)
> > +		if (i915_gem_obj_too_big(obj, alignment))
> >   			return ERR_PTR(-E2BIG);
> >
> >   		/*
> > diff --git a/drivers/gpu/drm/i915/i915_vma.c b/drivers/gpu/drm/i915/i915_vma.c
> > index 90546fa58fc1..551644dbfa8a 100644
> > --- a/drivers/gpu/drm/i915/i915_vma.c
> > +++ b/drivers/gpu/drm/i915/i915_vma.c
> > @@ -977,7 +977,7 @@ int i915_vma_pin_ww(struct i915_vma *vma, struct
> i915_gem_ww_ctx *ww,
> >   		if (err)
> >   			goto err_active;
> >
> > -		if (i915_is_ggtt(vma->vm))
> > +		if (i915_is_ggtt(vma->vm) && flags & PIN_MAPPABLE)
> >   			__i915_vma_set_map_and_fenceable(vma);
> >   	}
> >
> >