Re: [PATCH v3 2/3] drm/i915: Update error capture code to avoid using the current vma state

Thomas Hellström <thomas.hellstrom@xxxxxxxxxxxxxxx> · Fri, 29 Oct 2021 08:31:26 +0200

On 10/29/21 00:55, Matthew Brost wrote:
On Thu, Oct 28, 2021 at 02:01:27PM +0200, Thomas Hellström wrote:
With asynchronous migrations, the vma state may be several migrations
ahead of the state that matches the request we're capturing.
Address that by introducing an i915_vma_snapshot structure that
can be used to snapshot relevant state at request submission.
In order to make sure we access the correct memory, the snapshots take
references on relevant sg-tables and memory regions.

Also move the capture list allocation out of the fence signaling
critical path and use the CONFIG_DRM_I915_CAPTURE_ERROR define to
avoid compiling in members and functions used for error capture
when they're not used.

Finally, correct lockdep annotation would reveal that error capture is
typically done in the fence signalling critical path. Alter the
error capture memory allocation mode accordingly.

I've seen this as well:
https://patchwork.freedesktop.org/patch/451415/?series=93704&rev=5

John Harrison and Daniele feeling was if a NOWAIT memory allocation
context was used if the system was under any amount of memory pressure
the error capture is likely to fail due to the size of the objects being
allocated. Daniel's Vetter has purposed another solution - basically
allocate a page at the NOWAIT context which is a larger rework.

We have Jira for this. I'll dig this up and send it over off the list if
you want to join that discussion.

Matt

Please do, I basically agree with John and Daniele error capture may 
fail under memory pressure, but I couldn't see how we could avoid that 
short of exposing us to dma-fence deadlocks.

I figure basically we'd have to pin all vmas, reset, retire the request 
and *then* do the allocating parts of the capture.

I'll ping Daniel about the best course of action meanwhile for the above 
series.

/Thomas