Re: [PATCH v2] drm/xe: improve hibernation on igpu

Matthew Brost <matthew.brost@xxxxxxxxx> · Tue, 5 Nov 2024 11:26:50 -0800

On Tue, Nov 05, 2024 at 01:18:27PM -0600, Lucas De Marchi wrote:
> On Tue, Nov 05, 2024 at 10:12:24AM -0800, Matthew Brost wrote:
> > On Tue, Nov 05, 2024 at 11:32:37AM -0600, Lucas De Marchi wrote:
> > > On Fri, Nov 01, 2024 at 12:16:19PM -0700, Matthew Brost wrote:
> > > > On Fri, Nov 01, 2024 at 12:38:19PM -0500, Lucas De Marchi wrote:
> > > > > On Fri, Nov 01, 2024 at 05:01:57PM +0000, Matthew Auld wrote:
> > > > > > The GGTT looks to be stored inside stolen memory on igpu which is not
> > > > > > treated as normal RAM.  The core kernel skips this memory range when
> > > > > > creating the hibernation image, therefore when coming back from
> > > > >
> > > > > can you add the log for e820 mapping to confirm?
> > > > >
> > > > > > hibernation the GGTT programming is lost. This seems to cause issues
> > > > > > with broken resume where GuC FW fails to load:
> > > > > >
> > > > > > [drm] *ERROR* GT0: load failed: status = 0x400000A0, time = 10ms, freq = 1250MHz (req 1300MHz), done = -1
> > > > > > [drm] *ERROR* GT0: load failed: status: Reset = 0, BootROM = 0x50, UKernel = 0x00, MIA = 0x00, Auth = 0x01
> > > > > > [drm] *ERROR* GT0: firmware signature verification failed
> > > > > > [drm] *ERROR* CRITICAL: Xe has declared device 0000:00:02.0 as wedged.
> > > > >
> > > > > it seems the message above is cut short. Just above these lines don't
> > > > > you have a log with __xe_guc_upload? Which means: we actually upload the
> > > > > firmware again to stolen and it doesn't matter that we lost it when
> > > > > hibernating.
> > > > >
> > > >
> > > > The image is always uploaded. The upload logic uses a GGTT address to
> > > > find firmware image in SRAM...
> > > >
> > > > See snippet from uc_fw_xfer:
> > > >
> > > > 821         /* Set the source address for the uCode */
> > > > 822         src_offset = uc_fw_ggtt_offset(uc_fw) + uc_fw->css_offset;
> > > > 823         xe_mmio_write32(mmio, DMA_ADDR_0_LOW, lower_32_bits(src_offset));
> > > > 824         xe_mmio_write32(mmio, DMA_ADDR_0_HIGH,
> > > > 825                         upper_32_bits(src_offset) | DMA_ADDRESS_SPACE_GGTT);
> > > >
> > > > If the GGTT mappings are in stolen and not restored we will not be
> > > > uploading the correct data for the image.
> > > >
> > > > See the gitlab issue, this has been confirmed to fix a real problem from
> > > > a customer.
> > > 
> > > I don't doubt it fixes it, but the justification here is not making much
> > > sense.  AFAICS it doesn't really correspond to what the patch is doing.
> > > 
> > > >
> > > > Matt
> > > >
> > > > > It'd be good to know the size of the rsa key in the failing scenarios.
> > > > >
> > > > > Also it seems this is also reproduced in DG2 and I wonder if it's the
> > > > > same issue or something different:
> > > > >
> > > > > 	[drm:__xe_guc_upload.isra.0 [xe]] GT0: load still in progress, timeouts = 0, freq = 1700MHz (req 2050MHz), status = 0x00000064 [0x32/00]
> > > > > 	[drm:__xe_guc_upload.isra.0 [xe]] GT0: load still in progress, timeouts = 0, freq = 1700MHz (req 2050MHz), status = 0x00000072 [0x39/00]
> > > > > 	[drm:__xe_guc_upload.isra.0 [xe]] GT0: load still in progress, timeouts = 0, freq = 1700MHz (req 2050MHz), status = 0x00000086 [0x43/00]
> > > > > 	[drm] *ERROR* GT0: load failed: status = 0x400000A0, time = 5ms, freq = 1700MHz (req 2050MHz), done = -1
> > > > > 	[drm] *ERROR* GT0: load failed: status: Reset = 0, BootROM = 0x50, UKernel = 0x00, MIA = 0x00, Auth = 0x01
> > > > > 	[drm] *ERROR* GT0: firmware signature verification failed
> > > > >
> > > > > Cc Ulisses.
> > > > >
> > > > > >
> > > > > > Current GGTT users are kernel internal and tracked as pinned, so it
> > > > > > should be possible to hook into the existing save/restore logic that we
> > > > > > use for dgpu, where the actual evict is skipped but on restore we
> > > > > > importantly restore the GGTT programming.  This has been confirmed to
> > > > > > fix hibernation on at least ADL and MTL, though likely all igpu
> > > > > > platforms are affected.
> > > > > >
> > > > > > This also means we have a hole in our testing, where the existing s4
> > > > > > tests only really test the driver hooks, and don't go as far as actually
> > > > > > rebooting and restoring from the hibernation image and in turn powering
> > > > > > down RAM (and therefore losing the contents of stolen).
> > > > >
> > > > > yeah, the problem is that enabling it to go through the entire sequence
> > > > > we reproduce all kind of issues in other parts of the kernel and userspace
> > > > > env leading to flaky tests that are usually red in CI. The most annoying
> > > > > one is the network not coming back so we mark the test as failure
> > > > > (actually abort. since we stop running everything).
> > > > >
> > > > >
> > > > > >
> > > > > > v2 (Brost)
> > > > > > - Remove extra newline and drop unnecessary parentheses.
> > > > > >
> > > > > > Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
> > > > > > Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/3275
> > > > > > Signed-off-by: Matthew Auld <matthew.auld@xxxxxxxxx>
> > > > > > Cc: Matthew Brost <matthew.brost@xxxxxxxxx>
> > > > > > Cc: <stable@xxxxxxxxxxxxxxx> # v6.8+
> > > > > > Reviewed-by: Matthew Brost <matthew.brost@xxxxxxxxx>
> > > > > > ---
> > > > > > drivers/gpu/drm/xe/xe_bo.c       | 37 ++++++++++++++------------------
> > > > > > drivers/gpu/drm/xe/xe_bo_evict.c |  6 ------
> > > > > > 2 files changed, 16 insertions(+), 27 deletions(-)
> > > > > >
> > > > > > diff --git a/drivers/gpu/drm/xe/xe_bo.c b/drivers/gpu/drm/xe/xe_bo.c
> > > > > > index 8286cbc23721..549866da5cd1 100644
> > > > > > --- a/drivers/gpu/drm/xe/xe_bo.c
> > > > > > +++ b/drivers/gpu/drm/xe/xe_bo.c
> > > > > > @@ -952,7 +952,10 @@ int xe_bo_restore_pinned(struct xe_bo *bo)
> > > > > > 	if (WARN_ON(!xe_bo_is_pinned(bo)))
> > > > > > 		return -EINVAL;
> > > > > >
> > > > > > -	if (WARN_ON(xe_bo_is_vram(bo) || !bo->ttm.ttm))
> > > > > > +	if (WARN_ON(xe_bo_is_vram(bo)))
> > > > > > +		return -EINVAL;
> > > > > > +
> > > > > > +	if (WARN_ON(!bo->ttm.ttm && !xe_bo_is_stolen(bo)))
> > > > > > 		return -EINVAL;
> > > > > >
> > > > > > 	if (!mem_type_is_vram(place->mem_type))
> > > > > > @@ -1774,6 +1777,7 @@ int xe_bo_pin_external(struct xe_bo *bo)
> > > > > >
> > > > > > int xe_bo_pin(struct xe_bo *bo)
> > > > > > {
> > > > > > +	struct ttm_place *place = &bo->placements[0];
> > > > > > 	struct xe_device *xe = xe_bo_device(bo);
> > > > > > 	int err;
> > > > > >
> > > > > > @@ -1804,8 +1808,6 @@ int xe_bo_pin(struct xe_bo *bo)
> > > > > > 	 */
> > > > > > 	if (IS_DGFX(xe) && !(IS_ENABLED(CONFIG_DRM_XE_DEBUG) &&
> > > > > > 	    bo->flags & XE_BO_FLAG_INTERNAL_TEST)) {
> > > > > > -		struct ttm_place *place = &(bo->placements[0]);
> > > > > > -
> > > > > > 		if (mem_type_is_vram(place->mem_type)) {
> > > > > > 			xe_assert(xe, place->flags & TTM_PL_FLAG_CONTIGUOUS);
> > > > > >
> > > > > > @@ -1813,13 +1815,12 @@ int xe_bo_pin(struct xe_bo *bo)
> > > > > > 				       vram_region_gpu_offset(bo->ttm.resource)) >> PAGE_SHIFT;
> > > > > > 			place->lpfn = place->fpfn + (bo->size >> PAGE_SHIFT);
> > > > > > 		}
> > > > > > +	}
> > > > > >
> > > > > > -		if (mem_type_is_vram(place->mem_type) ||
> > > > > > -		    bo->flags & XE_BO_FLAG_GGTT) {
> > > > > > -			spin_lock(&xe->pinned.lock);
> > > > > > -			list_add_tail(&bo->pinned_link, &xe->pinned.kernel_bo_present);
> > > > > > -			spin_unlock(&xe->pinned.lock);
> > > > > > -		}
> > > > > > +	if (mem_type_is_vram(place->mem_type) || bo->flags & XE_BO_FLAG_GGTT) {
> > > 
> > > 
> > > again... why do you say we are restoring the GGTT itself? this seems
> > > rather to allow pinning and then restoring anything that has
> > > the XE_BO_FLAG_GGTT - that's any BO that uses the GGTT, not the GGTT.
> > > 
> > 
> > I think what you are sayings is right - the patch restores every BOs
> > GGTT mappings rather than restoring the entire contents of the GGTT.
> > 
> > This might be a larger problem then as I think the scratch GGTT entries
> > will not be restored - this is problem for both igpu and dgfx devices.
> > 
> > This patch should help but is not complete.
> > 
> > I think we need a follow up to either...
> > 
> > 1. Setup all scratch pages in the GGTT prior to calling
> > xe_bo_restore_kernel and use this flow to restore individual BOs GGTTs.
> 
> yes, but for BOs already in system memory we don't need this flow - we
> only need them to be mapped again.
> 

Right. xe_bo_restore_pinned short circuits on a BO not being in VRAM. We could
move that check out into xe_bo_restore_kernel though to avoid grabbing a system
BOs dma-resv lock though. In either VRAM or system case xe_ggtt_map_bo is
called.

Matt 

> > 
> > 2. Drop restoring of individual BOs GGTTs entirely and save / restore
> > the GGTTs contents.
> 
> ... if we don't risk adding entries to discarded BOs. As long as the
> save happens after invalidating the entries, I think it could work.
> 
> > 
> > Does this make sense?
> 
> yep, thanks.
> 
> Lucas De Marchi