Re: Ã¢ Fi.CI.BAT: failure for drm/i915: (stolen) memory region related fixes

Ville Syrjälä <ville.syrjala@xxxxxxxxxxxxxxx> · Fri, 15 Dec 2023 10:23:39 +0200

On Fri, Dec 15, 2023 at 01:01:39AM +0200, Ville Syrjälä wrote:
> On Thu, Dec 14, 2023 at 12:06:58PM +0100, Andrzej Hajda wrote:
> > 
> > 
> > On 13.12.2023 17:13, Andrzej Hajda wrote:
> > > On 13.12.2023 02:37, Patchwork wrote:
> > >> *Patch Details*
> > >> *Series:*    drm/i915: (stolen) memory region related fixes
> > >> *URL:*    https://patchwork.freedesktop.org/series/127721/ 
> > >> <https://patchwork.freedesktop.org/series/127721/>
> > >> *State:*    failure
> > >> *Details:* 
> > >> https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_127721v1/index.html 
> > >> <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_127721v1/index.html>
> > >>
> > >>
> > >>   CI Bug Log - changes from CI_DRM_14010 -> Patchwork_127721v1
> > >>
> > >>
> > >>     Summary
> > >>
> > >> *FAILURE*
> > >>
> > >> Serious unknown changes coming with Patchwork_127721v1 absolutely need 
> > >> to be
> > >> verified manually.
> > >>
> > >> If you think the reported changes have nothing to do with the changes
> > >> introduced in Patchwork_127721v1, please notify your bug team 
> > >> (I915-ci-infra@xxxxxxxxxxxxxxxxxxxxx) to allow them
> > >> to document this new failure mode, which will reduce false positives 
> > >> in CI.
> > >>
> > >> External URL: 
> > >> https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_127721v1/index.html
> > >>
> > >>
> > >>     Participating hosts (31 -> 33)
> > >>
> > >> Additional (4): bat-dg2-8 bat-dg2-9 bat-mtlp-8 fi-pnv-d510
> > >> Missing (2): bat-adlp-11 fi-snb-2520m
> > >>
> > >>
> > >>     Possible new issues
> > >>
> > >> Here are the unknown changes that may have been introduced in 
> > >> Patchwork_127721v1:
> > >>
> > >>
> > >>       IGT changes
> > >>
> > >>
> > >>         Possible regressions
> > >>
> > >>   *
> > >>
> > >>     igt@i915_module_load@load:
> > >>
> > >>       o bat-mtlp-8: NOTRUN -> INCOMPLETE
> > >>         
> > >> <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_127721v1/bat-mtlp-8/igt@i915_module_load@xxxxxxxxx>
> > > 
> > > 
> > > This is due to overlapping initial fb's vma with GuC reserved area on 
> > > MTL, see ggtt_reserve_guc_top.
> > > 
> > > My dirty quick_fix proposed:
> > > @@ -143,6 +143,9 @@ initial_plane_vma(struct drm_i915_private *i915,
> > >          if (IS_ERR(vma))
> > >                  goto err_obj;
> > > 
> > > +       if (base + size > GUC_GGTT_TOP)
> > > +               base = min(base, GUC_GGTT_TOP) - size;
> > > +
> > >          pinctl = PIN_GLOBAL | PIN_OFFSET_FIXED | base;
> > >          if (HAS_GMCH(i915))
> > >                  pinctl |= PIN_MAPPABLE;
> > 
> > 
> > Copy/Paste Ville response for this proposition from another thread:
> > 
> > On 13.12.2023 17:03, Ville Syrjälä wrote:
> >  >
> >  > This is not a solution. The correct fix is either:
> >  > 1. fix the guc code to not insist on a fixed chunk of the ggtt
> >  >    and instead allocate it normally like a good boy. It could
> >  >    still try to allocate as high as possible to ideally
> >  >    land in the GUC_GGTT_TOP range
> > 
> > This would be the best solution from initial plane PoV for sure, I am 
> > not sure about GuC PoV.
> 
> I was going to submit a patch to simply change this to insert(PIN_HIGH)
> as that is all that is needed from the GuC FW POV. And whether the
> GOP framebuffer or the GuC FW lives there doesn't matter one bit,
> both fit the bill of not needing to be accessed by the GuC.
> 
> But I suspect this function might be serving a double duty, despite
> not saying so anywhere. Basically I think it's perhaps used as
> a quick and dirty way to restrict everything below GUC_GGTT_TOP,
> presumably because we don't otherwise restrict allocations
> below that even if the GuC needs to access them.
> 
> So the insert(PIN_HIGH) approach could result in some troubles
> iff the GOP framebuffer is preventing the uc_fw to be placed at
> GUC_GGTT_TOP and we later free the GOP framebuffer (happens if
> we find it unsuitable due to one reason or another). This opens
> up a hole above GUC_GGTT_TOP again and something else could
> get bound there.
> 
> Is my assumption about ggtt_reserve_guc_top() correct? And
> if so do we know what are all the things the GuC needs to access
> so that we could properly restrict those below GUC_GGTT_TOP
> and get rid of this weird way of achieving that same result?
> 
> > 
> >  >
> >  > 2. relocate the display vma to a different part of the ggtt
> > 
> > I think this point is what I have proposed.
> > 
> >  >
> >  >
> >  > 1. should be far simpler by the looks of it, as 2. would involve:
> >  > a) pinning the same object into two places in the ggtt simultanously
> >  >    I don't think we have support for that except for special ggtt views,
> >  >    and xe doesn't have even that (should we need this trick there as 
> > well)
> > 
> > AFAIU the fb is not yet pinned in terms of kernel structures, so it 
> > doesn't seems to be an issue.
> 
> Hmm. Yeah, I suppose we don't really care where in the ggtt the thing
> is bound as we rewrite the PTEs anyway when we bind the vma. So
> whether we bind it into its current place or not doesn't really
> matter. Well, assuming we move the thing to a non-overlapping section
> of the ggtt. Any overlap with the current positon would clobber the
> currently used PTEs and cause display corruption.
> 
> If we did try this we'd also have to think where to move it. Due
> the requirement of no overlaps I suppose offset 0 might be the most
> obvious choice. That would also avoid permanently fragmentating
> the ggtt.

OK, so I've gone and implemnted this, and what I did was to just
pin it low. That part is sort of OK, but then I noticed that
something had already allocated two pages at offset 0. And to my
surprise that was some hdcp stuff. So I then changed the hdcp code
to pin high instead (since there's no point in stuffing what is a
normal shmem object into the precious mappable portion of ggtt).
But then I realized that would now interfere with the guc thing
again. I then proceeded to move the guc thing into the pre-display
ggtt_init_hw() call, but then that's going to interfere with my
BIOS framebuffer takeover again since I temporarily reserve the
original range of the framebuffer so that we don't accidentally
bind the the new range over the original range.

So now I need to either get rid of that guc thing for real,
or I need to keep ggtt_reserve_guc_top() where is is right
now and figure out how to defer that hdcp stuff to a much
later point in the driver init.

Additonally ggtt->error_capture would prefer to get pinned to
offet 0 on platforms with mappable. I don't think that's going
to happen because the BIOS framebuffer is likely there. I guess
one option would be to attempt to move the BIOS fb to the end
of mappable. But currenlty PIN_MAPPABLE implies DRM_MM_INSERT_LOW
so I'd need to redesign the pin flags. So what likely happens
now is the BIOS of gets offset 0, and error_capture gets pinned
right after it. But that's also quite terrible in cases where
we later get rid of the BIOS fb (likely to happen when booting
with multiple different sized displays), as then we've managed
to permanently fragment the mappable region into two chunks.
So I think the easiest solution for that is to move error_capture
to the end of mappable, assuming we failed to bind it to offset 0.

Who knew changing one lightbulb could be this complicated...

> 
> Though I suppose one has to wonder why the GOP doesn't just use
> offset 0 anyway. Is there perhaps some other weird requirement
> that needs the bottom end of ggtt to be available for something
> else?
> 
> > Of course there is problem with PLANE_SURF still pointing to old VA, it 
> > should be replaced with new VA before ggtt will be populated with new stuff.
> > 
> >  >
> >  > b) flip the plane(s) over to the relocated vma
> >  > c) wait for vblank(s)
> >  > d) discard the original vma
> >  > e) all of that would need to happen pretty early so we may not have
> >  >    a lot of the required normal machinery fully up and running yet
> > 
> > Async update to PLANE_SURF shouldn't be enough?
> 
> Async flips:
> - Can't be assume to be universally available
> - Don't really help anyway as they don't happen instaneously.
>   Ie. you still need to wait for them to have happened
>   before clobbering old the PTEs.
> 
> -- 
> Ville Syrjälä
> Intel

-- 
Ville Syrjälä
Intel