Hi Tvrtko, > > On 16/03/2022 07:37, Kasireddy, Vivek wrote: > > Hi Tvrtko, > > > >> > >> On 15/03/2022 07:28, Kasireddy, Vivek wrote: > >>> Hi Tvrtko, Daniel, > >>> > >>>> > >>>> On 11/03/2022 09:39, Daniel Vetter wrote: > >>>>> On Mon, 7 Mar 2022 at 21:38, Vivek Kasireddy <vivek.kasireddy@xxxxxxxxx> > wrote: > >>>>>> > >>>>>> On platforms capable of allowing 8K (7680 x 4320) modes, pinning 2 or > >>>>>> more framebuffers/scanout buffers results in only one that is mappable/ > >>>>>> fenceable. Therefore, pageflipping between these 2 FBs where only one > >>>>>> is mappable/fenceable creates latencies large enough to miss alternate > >>>>>> vblanks thereby producing less optimal framerate. > >>>>>> > >>>>>> This mainly happens because when i915_gem_object_pin_to_display_plane() > >>>>>> is called to pin one of the FB objs, the associated vma is identified > >>>>>> as misplaced and therefore i915_vma_unbind() is called which unbinds and > >>>>>> evicts it. This misplaced vma gets subseqently pinned only when > >>>>>> i915_gem_object_ggtt_pin_ww() is called without PIN_MAPPABLE. This > >>>>>> results in a latency of ~10ms and happens every other vblank/repaint cycle. > >>>>>> Therefore, to fix this issue, we try to see if there is space to map > >>>>>> at-least two objects of a given size and return early if there isn't. This > >>>>>> would ensure that we do not try with PIN_MAPPABLE for any objects that > >>>>>> are too big to map thereby preventing unncessary unbind. > >>>>>> > >>>>>> Testcase: > >>>>>> Running Weston and weston-simple-egl on an Alderlake_S (ADLS) platform > >>>>>> with a 8K@60 mode results in only ~40 FPS. Since upstream Weston submits > >>>>>> a frame ~7ms before the next vblank, the latencies seen between atomic > >>>>>> commit and flip event are 7, 24 (7 + 16.66), 7, 24..... suggesting that > >>>>>> it misses the vblank every other frame. > >>>>>> > >>>>>> Here is the ftrace snippet that shows the source of the ~10ms latency: > >>>>>> i915_gem_object_pin_to_display_plane() { > >>>>>> 0.102 us | i915_gem_object_set_cache_level(); > >>>>>> i915_gem_object_ggtt_pin_ww() { > >>>>>> 0.390 us | i915_vma_instance(); > >>>>>> 0.178 us | i915_vma_misplaced(); > >>>>>> i915_vma_unbind() { > >>>>>> __i915_active_wait() { > >>>>>> 0.082 us | i915_active_acquire_if_busy(); > >>>>>> 0.475 us | } > >>>>>> intel_runtime_pm_get() { > >>>>>> 0.087 us | intel_runtime_pm_acquire(); > >>>>>> 0.259 us | } > >>>>>> __i915_active_wait() { > >>>>>> 0.085 us | i915_active_acquire_if_busy(); > >>>>>> 0.240 us | } > >>>>>> __i915_vma_evict() { > >>>>>> ggtt_unbind_vma() { > >>>>>> gen8_ggtt_clear_range() { > >>>>>> 10507.255 us | } > >>>>>> 10507.689 us | } > >>>>>> 10508.516 us | } > >>>>>> > >>>>>> v2: Instead of using bigjoiner checks, determine whether a scanout > >>>>>> buffer is too big by checking to see if it is possible to map > >>>>>> two of them into the ggtt. > >>>>>> > >>>>>> v3 (Ville): > >>>>>> - Count how many fb objects can be fit into the available holes > >>>>>> instead of checking for a hole twice the object size. > >>>>>> - Take alignment constraints into account. > >>>>>> - Limit this large scanout buffer check to >= Gen 11 platforms. > >>>>>> > >>>>>> v4: > >>>>>> - Remove existing heuristic that checks just for size. (Ville) > >>>>>> - Return early if we find space to map at-least two objects. (Tvrtko) > >>>>>> - Slightly update the commit message. > >>>>>> > >>>>>> v5: (Tvrtko) > >>>>>> - Rename the function to indicate that the object may be too big to > >>>>>> map into the aperture. > >>>>>> - Account for guard pages while calculating the total size required > >>>>>> for the object. > >>>>>> - Do not subject all objects to the heuristic check and instead > >>>>>> consider objects only of a certain size. > >>>>>> - Do the hole walk using the rbtree. > >>>>>> - Preserve the existing PIN_NONBLOCK logic. > >>>>>> - Drop the PIN_MAPPABLE check while pinning the VMA. > >>>>>> > >>>>>> v6: (Tvrtko) > >>>>>> - Return 0 on success and the specific error code on failure to > >>>>>> preserve the existing behavior. > >>>>>> > >>>>>> v7: (Ville) > >>>>>> - Drop the HAS_GMCH(i915), DISPLAY_VER(i915) < 11 and > >>>>>> size < ggtt->mappable_end / 4 checks. > >>>>>> - Drop the redundant check that is based on previous heuristic. > >>>>>> > >>>>>> v8: > >>>>>> - Make sure that we are holding the mutex associated with ggtt vm > >>>>>> as we traverse the hole nodes. > >>>>>> > >>>>>> v9: (Tvrtko) > >>>>>> - Use mutex_lock_interruptible_nested() instead of mutex_lock(). > >>>>>> > >>>>>> Cc: Ville Syrjälä <ville.syrjala@xxxxxxxxxxxxxxx> > >>>>>> Cc: Maarten Lankhorst <maarten.lankhorst@xxxxxxxxxxxxxxx> > >>>>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx> > >>>>>> Cc: Manasi Navare <manasi.d.navare@xxxxxxxxx> > >>>>>> Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxx> > >>>>>> Signed-off-by: Vivek Kasireddy <vivek.kasireddy@xxxxxxxxx> > >>>>>> --- > >>>>>> drivers/gpu/drm/i915/i915_gem.c | 128 +++++++++++++++++++++++-------- > - > >>>>>> 1 file changed, 94 insertions(+), 34 deletions(-) > >>>>>> > >>>>>> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c > >>>>>> index 9747924cc57b..e0d731b3f215 100644 > >>>>>> --- a/drivers/gpu/drm/i915/i915_gem.c > >>>>>> +++ b/drivers/gpu/drm/i915/i915_gem.c > >>>>>> @@ -49,6 +49,7 @@ > >>>>>> #include "gem/i915_gem_pm.h" > >>>>>> #include "gem/i915_gem_region.h" > >>>>>> #include "gem/i915_gem_userptr.h" > >>>>>> +#include "gem/i915_gem_tiling.h" > >>>>>> #include "gt/intel_engine_user.h" > >>>>>> #include "gt/intel_gt.h" > >>>>>> #include "gt/intel_gt_pm.h" > >>>>>> @@ -882,6 +883,96 @@ static void discard_ggtt_vma(struct i915_vma *vma) > >>>>>> spin_unlock(&obj->vma.lock); > >>>>>> } > >>>>>> > >>>>>> +static int > >>>>>> +i915_gem_object_fits_in_aperture(struct drm_i915_gem_object *obj, > >>>>>> + u64 alignment, u64 flags) > >>>>> > >>>>> Tvrtko asked me to ack the first patch, but then I looked at this and > >>>>> started wondering. > >>>>> > >>>>> Conceptually this doesn't pass the smell test. What if we have > >>>>> multiple per-crtc buffers? Multiple planes on the same crtc? What if > >>>>> the app does triple buffer? You'll be forever busy tuning this > >>>>> heuristics, which can't fundamentally be fixed I think. The old "half > >>>>> of mappable" heuristic isn't really better, but at least it was dead > >>>>> simple. > >>>>> > >>>>> Imo what we need here is a change in approach: > >>>>> 1. Check whether the useable view for scanout exists already. If yes, > >>>>> use that. This should avoid the constant unbinding stalls. > >>>>> 2. Try to in buffer to mappabley, but without evicting anything (so > >>>>> not the non-blocking thing) > >>>>> 3. Pin the buffer with the most lenient approach > >>>>> > >>>>> Even the non-blocking interim stage is dangerous, since it'll just > >>>>> result in other buffers (e.g. when triple-buffering) getting unbound > >>>>> and we're back to the same stall. Note that this could have an impact > >>>>> on cpu rendering compositors, where we might end up relying a lot more > >>>>> partial views. But as long as we are a tad more aggressive (i.e. the > >>>>> non-blocking binding) in the mmap path that should work out to keep > >>>>> everything balanced, since usually you render first before you display > >>>>> anything. And so the buffer should end up in the ideal place. > >>>>> > >>>>> I'd try to first skip the 2. step since I think it'll require a bit of > >>>>> work, and frankly I don't think we care about the potential fallout. > >>>> > >>>> To be sure I understand, you propose to stop trying to pin mappable by default. Ie. > stop > >>>> respecting this comment from i915_gem_object_pin_to_display_plane: > >>>> > >>>> /* > >>>> * As the user may map the buffer once pinned in the display plane > >>>> * (e.g. libkms for the bootup splash), we have to ensure that we > >>>> * always use map_and_fenceable for all scanout buffers. However, > >>>> * it may simply be too big to fit into mappable, in which case > >>>> * put it anyway and hope that userspace can cope (but always first > >>>> * try to preserve the existing ABI). > >>>> */ > >>> [Kasireddy, Vivek] Digging further, this is what the commit message that added > >>> the above comment says: > >>> commit 2efb813d5388e18255c54afac77bd91acd586908 > >>> Author: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> > >>> Date: Thu Aug 18 17:17:06 2016 +0100 > >>> > >>> drm/i915: Fallback to using unmappable memory for scanout > >>> > >>> The existing ABI says that scanouts are pinned into the mappable region > >>> so that legacy clients (e.g. old Xorg or plymouthd) can write directly > >>> into the scanout through a GTT mapping. However if the surface does not > >>> fit into the mappable region, we are better off just trying to fit it > >>> anywhere and hoping for the best. (Any userspace that is capable of > >>> using ginormous scanouts is also likely not to rely on pure GTT > >>> updates.) With the partial vma fault support, we are no longer > >>> restricted to only using scanouts that we can pin (though it is still > >>> preferred for performance reasons and for powersaving features like > >>> FBC). > >>> > >>>> > >>>> By a quick look, for this case it appears we would end up creating partial views for > >> CPU > >>>> access (since the normal mapping would be busy/unpinnable). Worst case for this is > to > >>>> create a bunch of 1MiB VMAs so something to check would be how long those > persist > >> in > >>>> memory before they get released. Or perhaps the bootup splash use case is not > common > >>>> these days? > >>> [Kasireddy, Vivek] AFAIK, Plymouth is still the default bootup splash service on > Fedora, > >>> Ubuntu and most other distributions. And, I took a quick look at it and IIUC, it > >> (Plymouth's > >>> drm plugin) seems to create a dumb FB, mmap and update it via the dirty_fb ioctl. > This > >>> would not to be a problem on ADL-S where there is space in mappable for one 8K FB. > >>> > >> > >> FBC is a good point - correct me if I am wrong, but if we dropped trying > >> to map in aperture by default it looks like we would lose it and that > >> would be a significant power regression. In which case it doesn't seem > >> like that would be an option. > > [Kasireddy, Vivek] Ok, makes sense. > > > >> > >> Which I think leaves us with _some_ heuristics in any case. > >> > >> 1) N-holes heuristics. > >> > >> 2) Don't ever try PIN_MAPPABLE for framebuffers larger than some > >> percentage of aperture. > >> > >> Could this solve the 8k issue, most of the time, maybe? Could the > >> current "aperture / 2" test be expressed generically in some terms? Like > >> "(aperture - 10% (or some absolute value)) / 2" to account for non-fb > >> objects? I forgot what you said the relationship between aperture size > >> and 8k fb size was. > >> > >> 3) Don't evict for PIN_MAPPABLE mismatches when > >> i915_gem_object_ggtt_pin_ww->i915_vma_misplaced is called on behalf of > >> i915_gem_object_pin_to_display_plane. Assumption being if we ended up > >> with a non-mappable fb to start with, we must not try to re-bind it or > >> we risk ping-pong latencies. > >> > >> The last would I guess need to distinguish between PIN_MAPPABLE passed > >> in versus opportunistically added by i915_gem_object_pin_to_display_plane. > >> > >> How intrusive would it be to implement this option I am not sure without > >> trying myself. > > [Kasireddy, Vivek] I suspect I might be missing something, but could it not be > > as simple as below: > > @@ -940,7 +940,8 @@ i915_gem_object_ggtt_pin_ww(struct drm_i915_gem_object > *obj, > > return ERR_PTR(-ENOSPC); > > > > if (flags & PIN_MAPPABLE && > > - vma->fence_size > ggtt->mappable_end / 2) > > + (vma->fence_size > ggtt->mappable_end / 2 || > > + !i915_vma_is_map_and_fenceable(vma))) > > return ERR_PTR(-ENOSPC); > > } > > Looks like this would work... > > >> > >>> Given this, do you think it would work if we just preserve the existing behavior and > >>> tweak the heuristic introduced in this patch to look for space in aperture for only > >>> one FB instead of two? Or, is there no good option for solving this issue other than > >>> to create 1MB VMAs? > >> > >> I did not get how having one hole would solve the issue. Wouldn't it > >> still hit the re-bind ping-pong? Or there isn't even a single hole for > >> 8k fb typically? > > [Kasireddy, Vivek] IIUC, Mesa gives Weston a max of 4 backbuffers but it > > almost always uses only 2 except when it needs to share the FB -- with a plugin > > such as "remoting" for desktop streaming. > > Given the common use-case, lets assume there are two 8K FBs: FB1 and FB2 > > FB1 is mappable/fenceable and therefore not misplaced. > > FB2 is NOT mappable and hence identified as misplaced > > (because it fails the check > > (flags & PIN_MAPPABLE && !i915_vma_is_map_and_fenceable(vma)) > > > > As you suggest in 3) above the goal is to ensure that FB2 does not get evicted > > when we try to pin with PIN_MAPABBLE -- after it gets identified as misplaced. > > Or, alternatively, when we pin with PIN_MAPABBLE, we could just check to > > see if there is space in aperture for only FB2 (N = 1) and return early -- before > > even getting to i915_vma_misplaced(). As you can see, we avoid the ping-pong > > issue in both these cases. > > ... got it, yes, it seems both approaches works for this use case. > > Not sure that I have a preference between the two approaches at this point. > > Both would be behind a "PIN_MAPPABLE && PIN_NONBLOCK" check, so both > would only apply to opportunistic PIN_MAPPABLE attempts. That is, any > caller who only passes PIN_MAPPABLE would be unaffected which is what we > want. > > The extra i915_vma_is_map_and_fenceable check I guess is simpler and > self-contained. I assume you have a test setup and can try it out to > check it really works? [Kasireddy, Vivek] Yes, it works; my testcase just involves running Weston with a mode of 8K@60 on ADL-S and checking the FPS of the sample client weston-simple-egl. With the fix included, the perf improves to 59 FPS from 40 FPS. I'll send out a new patch for review soon. Oh, btw, do you think it is now pointless to merge the drm/mm patch that adds the iterator given that we'd no longer have the i915 patch that uses it anymore? Thanks, Vivek > > > The current version of this patch -- when running Weston -- puts both FB1 > > and FB2 (N = 2) outside of aperture although there may be space for FB1. > > I don't think this makes sense anymore given Plymouth's single-buffer > > use-case that uses dirtyfb ioctl. > > Yes agreed, it sounds preferable to preserve the current behaviour there. > > Regards, > > Tvrtko