Re: [PATCH v8 3/7] drm/i915: Define and use GuC and CTB TLB invalidation routines

Tvrtko Ursulin <tvrtko.ursulin@xxxxxxxxxxxxxxx> · Tue, 10 Oct 2023 10:00:42 +0100

On 09/10/2023 20:14, John Harrison wrote:
On 10/9/2023 01:56, Tvrtko Ursulin wrote:
On 06/10/2023 19:20, Jonathan Cavitt wrote:
From: Prathap Kumar Valsan <prathap.kumar.valsan@xxxxxxxxx>

The GuC firmware had defined the interface for Translation Look-Aside
Buffer (TLB) invalidation.  We should use this interface when
invalidating the engine and GuC TLBs.
Add additional functionality to intel_gt_invalidate_tlb, invalidating
the GuC TLBs and falling back to GT invalidation when the GuC is
disabled.
The invalidation is done by sending a request directly to the GuC
tlb_lookup that invalidates the table.  The invalidation is submitted as
a wait request and is performed in the CT event handler.  This means we
cannot perform this TLB invalidation path if the CT is not enabled.
If the request isn't fulfilled in two seconds, this would constitute
an error in the invalidation as that would constitute either a lost
request or a severe GuC overload.

With this new invalidation routine, we can perform GuC-based GGTT
invalidations.  GuC-based GGTT invalidation is incompatible with
MMIO invalidation so we should not perform MMIO invalidation when
GuC-based GGTT invalidation is expected.

Purpose of xarray:
The tlb_lookup table is allocated as an xarray because the set of
pending TLB invalidations may have no upper bound.  The consequence of
this is that all actions interfacing with this table need to use the
xarray functions, such as xa_alloc_cyclic_irq for array insertion.

Purpose of must_wait_woken:
Our wait for the G2H ack for the completion of a TLB invalidation is
mandatory; we must wait for the HW to confirm that the physical
addresses are no longer accessible before we return those to the system.

On switching to using the wait_woken() convenience routine, we
introduced ourselves to an issue where wait_woken() may complete early
under a kthread that is stopped. Since we send a TLB invalidation when
we try to release pages from the shrinker, we can be called from any
process; including kthreads.

Using wait_woken() from any process context causes another issue. The
use of is_kthread_should_stop() assumes that any task with PF_KTHREAD
set was made by kthread_create() and has called set_kthread_struct().
This is not true for the raw kernel_thread():

This explanation misses the main point of my ask - which is to explain 
why a simpler scheme isn't sufficient. Simpler scheme aka not needed 
the xarray or any flavour of wait_token().

In other words it is obvious we have to wait for the invalidation ack, 
but not obvious why we need a complicated scheme.
The alternative being to simply serialise all TLB invalidation requests? 
Thus, no complex tracking required as there is only one in flight at a 
time? That seems inefficient and a potential performance impact if a 
bunch of invalidations are required back to back. But given that the 
current scheme is global invalidation only (no support for ranges / per 
page invalidation yet), is it possible to get multiple back-to-back 
requests?

It is possible to get a stream of invalidation requests but all that 
come with put_pages() are serialized on the gt->tlb.invalidate_lock 
anyway. So for them only benefit with the complicated approach versus 
the dumb single wait queue is avoiding wake up storms.

Then the second source of TLB invalidations is ggtt->invalidate(). I am 
not sure if those are frequent enough to warrant parallelism. Definitely 
shouldn't be for things like context images and ringbuffers. So I was 
asking if maybe framebuffers but don't know.

Regards,

Tvrtko