Re: [igt-dev] [PATCH i-g-t] i915/gem_exec_schedule: Try to spot unfairness

Chris Wilson <chris@xxxxxxxxxxxxxxxxxx> · Wed, 25 Nov 2020 12:23:20 +0000

Quoting Tvrtko Ursulin (2020-11-25 11:25:02)
> 
> On 24/11/2020 23:39, Chris Wilson wrote:
> > +static void fair_child(int i915, uint32_t ctx,
> > +                    const struct intel_execution_engine2 *e,
> > +                    uint64_t frame_ns,
> > +                    int timeline,
> > +                    uint32_t common,
> > +                    unsigned int flags,
> > +                    unsigned long *ctl,
> > +                    unsigned long *out)
> > +#define F_SYNC               (1 << 0)
> > +#define F_PACE               (1 << 1)
> > +#define F_FLOW               (1 << 2)
> > +#define F_HALF               (1 << 3)
> > +#define F_SOLO               (1 << 4)
> > +#define F_SPARE              (1 << 5)
> > +#define F_NEXT               (1 << 6)
> > +#define F_VIP                (1 << 7)
> > +#define F_RRUL               (1 << 8)
> > +#define F_SHARE              (1 << 9)
> > +#define F_PING               (1 << 10)
> > +#define F_THROTTLE   (1 << 11)
> > +#define F_ISOLATE    (1 << 12)
> > +{
> > +     const int batches_per_frame = flags & F_SOLO ? 1 : 3;
> > +     struct drm_i915_gem_exec_object2 obj[4] = {
> > +             {},
> > +             {
> > +                     .handle = common ?: gem_create(i915, 4096),
> > +             },
> > +             delay_create(i915, ctx, e, frame_ns / batches_per_frame),
> > +             delay_create(i915, ctx, e, frame_ns / batches_per_frame),
> > +     };
> > +     struct intel_execution_engine2 ping = *e;
> > +     int p_fence = -1, n_fence = -1;
> > +     unsigned long count = 0;
> > +     int n;
> > +
> > +     srandom(getpid());
> > +     if (flags & F_PING)
> > +             ping = pick_random_engine(i915, e);
> > +     obj[0] = tslog_create(i915, ctx, &ping);
> > +
> > +     while (!READ_ONCE(*ctl)) {
> > +             struct drm_i915_gem_execbuffer2 execbuf = {
> > +                     .buffers_ptr = to_user_pointer(obj),
> > +                     .buffer_count = 4,
> > +                     .rsvd1 = ctx,
> > +                     .rsvd2 = -1,
> > +                     .flags = e->flags,
> > +             };
> > +
> > +             if (flags & F_FLOW) {
> > +                     unsigned int seq;
> > +
> > +                     seq = count;
> > +                     if (flags & F_NEXT)
> > +                             seq++;
> > +
> > +                     execbuf.rsvd2 =
> > +                             sw_sync_timeline_create_fence(timeline, seq);
> > +                     execbuf.flags |= I915_EXEC_FENCE_IN;
> > +             }
> > +
> > +             execbuf.flags |= I915_EXEC_FENCE_OUT;
> > +             gem_execbuf_wr(i915, &execbuf);
> > +             n_fence = execbuf.rsvd2 >> 32;
> > +             execbuf.flags &= ~(I915_EXEC_FENCE_OUT | I915_EXEC_FENCE_IN);
> > +             for (n = 1; n < batches_per_frame; n++)
> > +                     gem_execbuf(i915, &execbuf);
> > +             close(execbuf.rsvd2);
> > +
> > +             execbuf.buffer_count = 1;
> > +             execbuf.batch_start_offset = 2048;
> > +             execbuf.flags = ping.flags | I915_EXEC_FENCE_IN;
> > +             execbuf.rsvd2 = n_fence;
> > +             gem_execbuf(i915, &execbuf);
> > +
> > +             if (flags & F_PACE && p_fence != -1) {
> > +                     struct pollfd pfd = {
> > +                             .fd = p_fence,
> > +                             .events = POLLIN,
> > +                     };
> > +                     poll(&pfd, 1, -1);
> > +             }
> > +             close(p_fence);
> > +
> > +             if (flags & F_SYNC) {
> > +                     struct pollfd pfd = {
> > +                             .fd = n_fence,
> > +                             .events = POLLIN,
> > +                     };
> > +                     poll(&pfd, 1, -1);
> > +             }
> > +
> > +             if (flags & F_THROTTLE)
> > +                     igt_ioctl(i915, DRM_IOCTL_I915_GEM_THROTTLE, 0);
> > +
> > +             igt_swap(obj[2], obj[3]);
> > +             igt_swap(p_fence, n_fence);
> 
> What are the sync fences simulating and how come they are always used? I 
> mean no children which submit batched up load?

The sync fences are created for each submission for simplicity. We only
use them for synchronisation in emulating some of the clients (such as
mesa synchronising to previous SwapBuffers, and compositors
synchronising to vblanks). Creating/destroying an unused fence should
not be disruptive...

> > +             count++;
> > +     }
> > +     close(p_fence);
> > +
> > +     gem_close(i915, obj[3].handle);
> > +     gem_close(i915, obj[2].handle);
> > +     if (obj[1].handle != common)
> > +             gem_close(i915, obj[1].handle);
> > +
> > +     gem_sync(i915, obj[0].handle);
> > +     if (out) {
> > +             uint32_t *map;
> > +
> > +             map = gem_mmap__device_coherent(i915, obj[0].handle,
> > +                                             0, 4096, PROT_WRITE);
> > +             for (n = 1; n < min(count, 512); n++) {
> > +                     igt_assert(map[n]);
> > +                     map[n - 1] = map[n] - map[n - 1];
> > +             }
> > +             qsort(map, --n, sizeof(*map), cmp_u32);
> > +             *out = ticks_to_ns(i915, map[n / 2]);
> 
> What is returned? Could you explain the ts journal part a bit?

The median interval between timestamps. Each frame records the
CS_TIMESTAMP (global reference clock) it completed at. We then compute
the interval between each pair of frames and sort that to find the
median. I went with median to err on the side of stability, we want the
tests to be reliable. Checking the distribution within each client is
also interesting, but overwhelming.

> 
> > +             munmap(map, 4096);
> > +     }
> > +     gem_close(i915, obj[0].handle);
> > +}
> > +
> > +static int cmp_ul(const void *A, const void *B)
> > +{
> > +     const unsigned long *a = A, *b = B;
> > +
> > +     if (*a < *b)
> > +             return -1;
> > +     else if (*a > *b)
> > +             return 1;
> > +     else
> > +             return 0;
> > +}
> > +
> > +static uint64_t d_cpu_time(const struct rusage *a, const struct rusage *b)
> > +{
> > +     uint64_t cpu_time = 0;
> > +
> > +     cpu_time += (a->ru_utime.tv_sec - b->ru_utime.tv_sec) * NSEC_PER_SEC;
> > +     cpu_time += (a->ru_utime.tv_usec - b->ru_utime.tv_usec) * 1000;
> > +
> > +     cpu_time += (a->ru_stime.tv_sec - b->ru_stime.tv_sec) * NSEC_PER_SEC;
> > +     cpu_time += (a->ru_stime.tv_usec - b->ru_stime.tv_usec) * 1000;
> > +
> > +     return cpu_time;
> > +}
> > +
> > +static void timeline_advance(int timeline, int delay_ns)
> > +{
> > +     struct timespec tv = { .tv_nsec = delay_ns };
> > +     nanosleep(&tv, NULL);
> > +     sw_sync_timeline_inc(timeline, 1);
> > +}
> > +
> > +static void fairness(int i915,
> > +                  const struct intel_execution_engine2 *e,
> > +                  int timeout, unsigned int flags)
> > +{
> > +     const int frame_ns = 16666 * 1000;
> > +     const int fence_ns = flags & F_HALF ? 2 * frame_ns : frame_ns;
> > +     unsigned long *result;
> > +     uint32_t common = 0;
> > +
> > +     igt_require(has_ctx_timestamp(i915, e));
> > +     igt_require(gem_class_has_mutable_submission(i915, e->class));
> > +
> > +     if (flags & F_SHARE)
> > +             common = gem_create(i915, 4095);
> > +
> > +     result = mmap(NULL, 4096, PROT_WRITE, MAP_SHARED | MAP_ANON, -1, 0);
> > +
> > +     for (int n = 2; n <= 64; n <<= 1) { /* 32 == 500us per client */
> > +             int timeline = sw_sync_timeline_create();
> > +             int nfences = timeout * NSEC_PER_SEC / fence_ns + 1;
> > +             const int nchild = n - 1; /* odd for easy medians */
> > +             const int child_ns = frame_ns / (nchild + !!(flags & F_SPARE));
> > +             const int lo = nchild / 4;
> > +             const int hi = (3 * nchild + 3) / 4 - 1;
> > +             struct rusage old_usage, usage;
> > +             uint64_t cpu_time, d_time;
> > +             unsigned long vip = -1;
> > +             struct timespec tv;
> > +             struct igt_mean m;
> > +
> > +             if (flags & F_PING) {
> > +                     struct intel_execution_engine2 *ping;
> > +
> > +                     __for_each_physical_engine(i915, ping) {
> > +                             if (ping->flags == e->flags)
> > +                                     continue;
> > +
> > +                             igt_fork(child, 1) {
> > +                                     uint32_t ctx = gem_context_clone_with_engines(i915, 0);
> > +
> > +                                     fair_child(i915, ctx, ping,
> > +                                                child_ns / 8,
> > +                                                -1, common,
> > +                                                F_SOLO | F_PACE | F_SHARE,
> > +                                                &result[nchild],
> > +                                                NULL);
> > +
> > +                                     gem_context_destroy(i915, ctx);
> > +                             }
> > +                     }
> > +             }
> > +
> > +             memset(result, 0, (nchild + 1) * sizeof(result[0]));
> 
> Children probably can't write into it before, but still would probably 
> be better moved before the first fork (which passes the results array to 
> children).
> 
> > +             getrusage(RUSAGE_CHILDREN, &old_usage);
> > +             igt_nsec_elapsed(memset(&tv, 0, sizeof(tv)));
> > +             igt_fork(child, nchild) {
> > +                     uint32_t ctx;
> > +
> > +                     if (flags & F_ISOLATE) {
> > +                             int clone, dmabuf = -1;
> > +
> > +                             if (common)
> > +                                     dmabuf = prime_handle_to_fd(i915, common);
> > +
> > +                             clone = gem_reopen_driver(i915);
> > +                             gem_context_copy_engines(i915, 0, clone, 0);
> > +                             i915 = clone;
> > +
> > +                             if (dmabuf != -1)
> > +                                     common = prime_fd_to_handle(i915, dmabuf);
> > +                     }
> > +
> > +                     ctx = gem_context_clone_with_engines(i915, 0);
> > +
> > +                     if (flags & F_VIP && child == 0) {
> > +                             gem_context_set_priority(i915, ctx, MAX_PRIO);
> > +                             flags |= F_FLOW;
> > +                     }
> > +                     if (flags & F_RRUL && child == 0)
> > +                             flags |= F_SOLO | F_FLOW | F_SYNC;
> > +
> > +                     fair_child(i915, ctx, e, child_ns,
> > +                                timeline, common, flags,
> > +                                &result[nchild],
> > +                                &result[child]);
> > +
> > +                     gem_context_destroy(i915, ctx);
> > +             }
> > +
> > +             while (nfences--)
> > +                     timeline_advance(timeline, fence_ns);
> > +
> > +             result[nchild] = 1;
> > +             for (int child = 0; child < nchild; child++) {
> > +                     while (!READ_ONCE(result[child]))
> > +                             timeline_advance(timeline, fence_ns);
> > +             }
> > +
> > +             igt_waitchildren();
> > +             close(timeline);
> > +
> > +             /* Are we running out of CPU time, and fail to submit frames? */
> > +             d_time = igt_nsec_elapsed(&tv);
> > +             getrusage(RUSAGE_CHILDREN, &usage);
> > +             cpu_time = d_cpu_time(&usage, &old_usage);
> > +             if (10 * cpu_time > 9 * d_time) {
> > +                     if (nchild > 7)
> > +                             break;
> > +
> > +                     igt_skip_on_f(10 * cpu_time > 9 * d_time,
> > +                                   "%.0f%% CPU usage, presuming capacity exceeded\n",
> > +                                   100. * cpu_time / d_time);
> 
> Aren't children mostly sleeping waiting on fences and like? And if so 
> how/when the test ends up using a lot of CPU time?

lockdep. kasan. And some really slow devices. E.g. CI struggles to hit
31 clients, but on non-debug builds we can sustain 127 clients (the
context switch duration is the ultimate limiting step).

I needed to rule out the impact of the CPU scheduler when evaluating the
GPU. A simple metric being that if we saturate a core, then we are
likely to be experiencing extra latency in submission due to the CPU
scheduler.

> > +             igt_mean_init(&m);
> > +             for (int child = 0; child < nchild; child++)
> > +                     igt_mean_add(&m, result[child]);
> > +
> > +             if (flags & (F_VIP | F_RRUL))
> > +                     vip = result[0];
> > +
> > +             qsort(result, nchild, sizeof(*result), cmp_ul);
> > +             igt_info("%2d clients, range: [%.1f, %.1f], iqr: [%.1f, %.1f], median: %.1f, mean: %.1f ± %.2f ms\n",
> > +                      nchild,
> > +                      1e-6 * result[0],  1e-6 * result[nchild - 1],
> > +                      1e-6 * result[lo], 1e-6 * result[hi],
> > +                      1e-6 * result[nchild / 2],
> > +                      1e-6 * igt_mean_get(&m),
> > +                      1e-6 * sqrt(igt_mean_get_variance(&m)));
> > +
> > +             if (vip != -1) {
> > +                     igt_info("VIP interval %.2f ms\n", 1e-6 * vip);
> > +                     igt_assert(4 * vip > 3 * fence_ns &&
> > +                                3 * vip < 4 * fence_ns);
> > +             }
> > +
> > +             /* May be slowed due to sheer volume of context switches */
> > +             igt_assert(4 * igt_mean_get(&m) > 3 * fence_ns &&
> > +                            igt_mean_get(&m) < 3 * fence_ns);
> > +
> > +             igt_assert(4 * igt_mean_get(&m) > 3 * result[nchild / 2] &&
> > +                        3 * igt_mean_get(&m) < 4 * result[nchild / 2]);
> > +
> > +             igt_assert(2 * (result[hi] - result[lo]) < result[nchild / 2]);
> 
> Put some human readable text above the asserts explaining the criteria 
> please.
> 
> VIP child takes part in the mean and does not affect the result?

The VIP is also running at the same target fps; it hasn't yet been
an issue. When context switch is slow (rcs), the VIP is faster than the
rest so falls outside of the iqr anyway.

But it does cause an oddity in the range.

> > +static void fairslice(int i915,
> > +                   const struct intel_execution_engine2 *e,
> > +                   unsigned long flags)
> > +{
> > +     igt_spin_t *spin = NULL;
> > +     uint32_t ctx[3];
> > +     uint32_t ts[3];
> > +
> > +     for (int i = 0; i < ARRAY_SIZE(ctx); i++) {
> > +             ctx[i] = gem_context_clone_with_engines(i915, 0);
> > +             if (spin == NULL) {
> > +                     spin = __igt_spin_new(i915,
> > +                                           .ctx = ctx[i],
> > +                                           .engine = e->flags,
> > +                                           .flags = flags);
> > +             } else {
> > +                     struct drm_i915_gem_execbuffer2 eb = {
> > +                             .buffer_count = 1,
> > +                             .buffers_ptr = to_user_pointer(&spin->obj[IGT_SPIN_BATCH]),
> > +                             .flags = e->flags,
> > +                             .rsvd1 = ctx[i],
> > +                     };
> > +                     gem_execbuf(i915, &eb);
> > +             }
> > +     }
> > +
> > +     sleep(2); /* over the course of many timeslices */
> > +
> > +     igt_assert(gem_bo_busy(i915, spin->handle));
> > +     igt_spin_end(spin);
> > +     for (int i = 0; i < ARRAY_SIZE(ctx); i++)
> > +             ts[i] = read_ctx_timestamp(i915, ctx[i], e);
> > +
> > +     for (int i = 0; i < ARRAY_SIZE(ctx); i++)
> > +             gem_context_destroy(i915, ctx[i]);
> > +     igt_spin_free(i915, spin);
> > +
> > +     qsort(ts, 3, sizeof(*ts), cmp_u32);
> > +     igt_info("%s: [%.1f, %.1f] ms\n", e->name,
> > +              1e-6 * ticks_to_ns(i915, ts[0]),
> > +              1e-6 * ticks_to_ns(i915, ts[2]));
> 
> Log all three just as well?
> 
> > +
> > +     igt_assert(ts[0] && ts[2] > ts[0]);
>  > +    igt_assert(4 * ts[0] > 3 * ts[2]);
> 
> Three equal priority contexts - why would distribution be expected to be 
> unfair? Intuitively I'd expect a check that all three are within some 
> tolerance of each other, but okay, min and max is good enough, just 
> don't understand the asserts. Max can just as well be equal to min, no? 
> I mean and scheduler would still be considered fair. We should ignore 
> the submission order I think, if that was the point.

The first assert looks more like a leftover from when I used the wrong
compare fn (there was already one compare fn for longs :) and the issue
with cml+ returning 0.

The second assert is that the range is within 25%. Maybe 1/6 is more
interesting for 3, so something like

igt_assert((ts[2] - ts[0]) * 6 < ts[1]);

> Seem clean and logical on the high level and on the implementation 
> level. On the "medium" level I don't claim I tried to understand 
> everything but it's not completely important. With medium level I mean 
> all the different test scenarios, where the important thing is that as 
> long as all children are doing the same thing, which I think they are 
> (small open of VIP), it seems correct to test they will get equal amount 
> of GPU time.
> 
> All subtests pass with the fair scheduler patches?

Yes, although the *solo with 3 clients is borderline on tgl, solo being
targeted at a weak spot of the fair algorithm and initially was very
unfair.

I've tried to cover the userspace throttling algorithms used in practice
and issues found along the way, with a small amount of heterogeneity
(e.g. VIP representing a compositor with a bunch of individual client
windows). Though each client is itself a fixed workload which isn't very
representative, but makes measurements easy.

I think it's a reliable test that scales well across our gen for inter-
client fairness, but it certainly is not the complete picture. There will
always be surprises, and wsim is better suited to trying to replicate
real-world scenarios. One day we should define some regression tests for
wsim metrics.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/intel-gfx