Re: [PATCH 43/46] drm/i915: Allocate a status page for each timeline

John Harrison <John.C.Harrison@xxxxxxxxx> · Wed, 16 Jan 2019 13:06:36 -0800




On 1/15/2019 10:43, Chris Wilson wrote:
Quoting John Harrison (2019-01-15 18:17:21)
On 1/15/2019 01:50, Chris Wilson wrote:
Quoting John Harrison (2019-01-15 00:56:13)
On 1/7/2019 03:55, Chris Wilson wrote:
+static int alloc_hwsp(struct i915_timeline *timeline)
+{
+     struct drm_i915_private *i915 = timeline->i915;
+     struct i915_vma *vma;
+     int offset;
+
+     mutex_lock(&i915->gt.timeline_lock);
+
+restart:
+     offset = find_first_cacheline(i915);
+     if (offset == NBITS && i915->gt.timeline_hwsp) {
+             i915_vma_put(i915->gt.timeline_hwsp);
+             i915->gt.timeline_hwsp = NULL;
+     }
+
+     vma = i915->gt.timeline_hwsp;
+     if (!vma) {
+             struct drm_i915_gem_object *bo;
+
+             /* Drop the lock before allocations */
+             mutex_unlock(&i915->gt.timeline_lock);
+
+             BUILD_BUG_ON(NBITS * CACHELINE_BYTES > PAGE_SIZE);
+             bo = i915_gem_object_create_internal(i915, PAGE_SIZE);
+             if (IS_ERR(bo))
+                     return PTR_ERR(bo);
+
+             i915_gem_object_set_cache_level(bo, I915_CACHE_LLC);
+
+             vma = i915_vma_instance(bo, &i915->ggtt.vm, NULL);
+             if (IS_ERR(vma))
+                     return PTR_ERR(vma);
+
+             mutex_lock(&i915->gt.timeline_lock);
+             if (i915->gt.timeline_hwsp) {
+                     i915_gem_object_put(bo);
+                     goto restart;
+             }
+
+             i915->gt.timeline_hwsp = vma;
+             i915->gt.timeline_free = ~0ull;
+             offset = 0;
+     }
+
+     i915->gt.timeline_free &= ~BIT_ULL(offset);
+
+     timeline->hwsp_ggtt = i915_vma_get(vma);
+     timeline->hwsp_offset = offset * CACHELINE_BYTES;
+
+     mutex_unlock(&i915->gt.timeline_lock);
+
+     return 0;
+}
If I'm reading this correctly then gt.timeline_hwsp/free is the a cached
copy of the most recently allocated but not yet filled bank of seqno
locations. When it gets full, the i915->gt reference gets dropped and a
new page is allocated and used up line by line. Meanwhile, each timeline
has it's own private reference to the page so dropping the i915->gt
reference is safe. And once the last timeline using a given page is
freed, the last reference to that page will be dropped and so the page
itself will also be freed. If a timeline is freed before the currently
cached page is filled, then that timeline's slot will be released and
re-used by the next timeline to be created.

But what about the scenario of a long running system with a small but
growing number of persistent tasks interspersed with many short lived
tasks? In that case, you would end up with many sparsely populated pages
that whose free slots will not get re-used. You could have a linked list
of cached pages. When a page is filled, move it to a 'full' list. When a
timeline is freed, if it's page was on the 'full' list, clear the slot
and move it back to the 'available' list.
Yes. My thinking was a plain slab cache was a quick-and-dirty
improvement over a page-per-timeline. And a freelist would be the next
step.

Or is the idea that a worst case of a single page vma allocation per
timeline is the least of our worries if there is an ever growing number
of timelines/contexts/users in the system?
Nah, it was just an attempt to quickly reduce the number of allocations,
where the worst case of one page+vma per timeline was the starting
point.

We should break this patch down into 1) one-page-per-timeline, 2) slab
cache, 3) free list 4) profit.

At other times we have been wanting to be able to suballocate pages,
something to keep in mind would be extending this to arbitrary cacheline
allocations.
The multi-stage approach sounds good. Keep things simple in this patch
and then improve the situation later. One thing to be careful of with a
cacheline allocator would be make sure whatever is being converted
wasn't using full pages for security reasons. I.e. a page can be private
to a process, a cacheline will be shared by many. I guess that would
only really apply to allocations being passed to user land as the kernel
is considered secure? Or can a user batch buffer write to arbitrary
locations within the ppHWSP and thereby splat someone else's seqno?
ppHWSP, yes. But for internal allocations, only accessible via the ring
+ GGTT, should be no problem. I agree that we definitely don't want to
expose subpage sharing across the userspace boundary (all isolation
controls are only on pages and above).

If userspace wants suballocations, it can (and does) do them for itself
and should regulate its own sharing.

I'm a little confused. Are you saying that a rogue batch buffer could 
splat some other context's ppHWSP seqno or that it can't? It would be 
bad if one dodgy user could cause hangchecks in another user's batch by 
splatting their seqnos.


+     if (global_hwsp) {
+             timeline->hwsp_ggtt = i915_vma_get(global_hwsp);
+             timeline->hwsp_offset = I915_GEM_HWS_SEQNO_ADDR;
+     } else {
+             err = alloc_hwsp(timeline);
+             if (err)
+                     return err;
+     }
+
+     vaddr = i915_gem_object_pin_map(timeline->hwsp_ggtt->obj, I915_MAP_WB);
+     if (IS_ERR(vaddr)) { /* leak the cacheline, but will clean up later */
Can you explain this comment more? Where/when is the later?
On failure here, the cacheline is still marked as allocated in the slab,
but the reference to the page is released. So the backing page will be
released when everyone else finally drops their reference.

Just laziness, since we have the ability to return the cacheline later
on...
Meaning the actual leak is the bit in 'i915->gt.timeline_free' that says
this cacheline can or can't be used for the next allocation? Presumably
you could do the bit map munging in the case that 'global_hwsp' is null,
but the code would certainly be messier for not a lot of gain.
Having been pointed out that I was being lazy, a bit of refactoring
later showed how lazy I was.
Does that mean you are going to re-work this patch or follow it up with 
a subsequent one?


@@ -2616,7 +2628,7 @@ static int execlists_context_deferred_alloc(struct i915_gem_context *ctx,
                goto error_deref_obj;
        }
    
-     timeline = i915_timeline_create(ctx->i915, ctx->name);
+     timeline = i915_timeline_create(ctx->i915, ctx->name, NULL);
Why does this use the global HWSP rather than a per context one?
.global_hwsp = NULL => it allocates its own HWSP.

Were you thinking of intel_engine_setup_common() which is still using
the global HWSP at this point in time?
Doh. Brain fart. Presumably the engine one will disappear completely? Or
is it still needed for legacy mode?
It (the timeline embedded inside the engine) is killed later, once
the internal clients (perf/pmu, hangcheck and idling at the last count)
are ready for the lack of globally ordered execution queue. The single
ringbuffer + timeline persists for legacy. (Multiple timelines for gen7,
coming later!)
-Chris
Sounds good :).

John.

_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx