Re: [PATCH v15 04/13] task_isolation: add initial support

Chris Metcalf <cmetcalf@xxxxxxxxxxxx> · Tue, 30 Aug 2016 13:02:20 -0400

On 8/30/2016 12:30 PM, Andy Lutomirski wrote:
On Tue, Aug 30, 2016 at 8:32 AM, Chris Metcalf <cmetcalf@xxxxxxxxxxxx> wrote:
On 8/30/2016 3:58 AM, Peter Zijlstra wrote:
On Mon, Aug 29, 2016 at 12:40:32PM -0400, Chris Metcalf wrote:
On 8/29/2016 12:33 PM, Peter Zijlstra wrote:
On Tue, Aug 16, 2016 at 05:19:27PM -0400, Chris Metcalf wrote:
+       /*
+        * Request rescheduling unless we are in full dynticks mode.
+        * We would eventually get pre-empted without this, and if
+        * there's another task waiting, it would run; but by
+        * explicitly requesting the reschedule, we may reduce the
+        * latency.  We could directly call schedule() here as well,
+        * but since our caller is the standard place where schedule()
+        * is called, we defer to the caller.
+        *
+        * A more substantive approach here would be to use a struct
+        * completion here explicitly, and complete it when we shut
+        * down dynticks, but since we presumably have nothing better
+        * to do on this core anyway, just spinning seems plausible.
+        */
+       if (!tick_nohz_tick_stopped())
+               set_tsk_need_resched(current);
This is broken.. and it would be really good if you don't actually need
to do this.
Can you elaborate?  We clearly do want to wait until we are in full
dynticks mode before we return to userspace.

We could do it just in the prctl() syscall only, but then we lose the
ability to implement the NOSIG mode, which can be a convenience.
So this isn't spelled out anywhere. Why does this need to be in the
return to user path?

I'm not sure where this should be spelled out, to be honest.  I guess
I can add some commentary to the commit message explaining this part.

The basic idea is just that we don't want to be at risk from the
dyntick getting enabled.  Similarly, we don't want to be at risk of a
later global IPI due to lru_add_drain stuff, for example.  And, we may
want to add additional stuff, like catching kernel TLB flushes and
deferring them when a remote core is in userspace.  To do all of this
kind of stuff, we need to run in the return to user path so we are
late enough to guarantee no further kernel things will happen to
perturb our carefully-arranged isolation state that includes dyntick
off, per-cpu lru cache empty, etc etc.
None of the above should need to *loop*, though, AFAIK.

Ordering is a problem, though.

We really want to run task isolation last, so we can guarantee that
all the isolation prerequisites are met (dynticks stopped, per-cpu lru
cache empty, etc).  But achieving that state can require enabling
interrupts - most obviously if we have to schedule, e.g. for vmstat
clearing or whatnot (see the cond_resched in refresh_cpu_vm_stats), or
just while waiting for that last dyntick interrupt to occur.  I'm also
not sure that even something as simple as draining the per-cpu lru
cache can be done holding interrupts disabled throughout - certainly
there's a !SMP code path there that just re-enables interrupts
unconditionally, which gives me pause.

At any rate at that point you need to retest for signals, resched,
etc, all as usual, and then you need to recheck the task isolation
prerequisites once more.

I may be missing something here, but it's really not obvious to me
that there's a way to do this without having task isolation integrated
into the usual return-to-userspace loop.

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html