On 05/29/2011 10:35 AM, Ingo Molnar wrote:
* Avi Kivity<avi@xxxxxxxxxx> wrote: > On 05/28/2011 09:32 PM, Ingo Molnar wrote: > >* Avi Kivity<avi@xxxxxxxxxx> wrote: > > > >> > So if you set a notification signal via fcntl(F_SETOWN) on the > >> > scheduler context switch event fd, the user-space RCU code will > >> > get a signal on every context switch. > >> > >> Context switches are completely uninteresting for userspace rcu: > >> > >> rcu_read_lock(); > >> ---> context switch > >> > >> have we learned anything from that? no. User code is always > >> preemptible and migratable. If rcu_read_lock() prevented migration > >> somehow, then we'd know that a context switch means we've started a > >> grace period for this thread. But it doesn't, so we don't. > > > >Well, in the next mail i mentioned that we can do migration events as > >well, which would be useful: instead of having to keep track of > >nr_tasks RCU grace periods we could simplify it down to nr_cpus. > > I don't see how a migration event helps. It is completely > transparent from the task's point of view. It's not transparent at all if you index RCU data structures by the current CPU index, which the kernel implementation does.
But that's completely broken for userspace. The "current cpu index" doesn't even exist, since you can't disable preemption.
Doing that has the advantage of being much more cache-compressed than the TID index,
If you have more tasks than cpus; which isn't a given.
and also having better worst-case grace period latency properties than a TID index.
> > But if we indexed by the TID then we wouldnt need any scheduler > > bindings at all - this is the simpler approach. > > Yes, and it maps 1:1 to the kernel implementation (cpu = task). No, the kernel indexes grace period tracking (and the write-completion queues) by CPU index.
Do a conceptual #define cpu task and it all works out.
> >> What's needed are explicit notifications about grace periods. For > >> the vcpu threads, calling KVM_VCPU_RUN seems like a good point. > >> For I/O threads, completion of processing of an event is also a > >> good point. > > > > Grace period notifications are needed too, obviously. > > I'd think they're sufficient, no? Is something else needed? I think you are missing the fact that in the kernel we index RCU data structures by CPU number: static void rcu_preempt_qs(int cpu) { struct rcu_data *rdp =&per_cpu(rcu_preempt_data, cpu); ...
s/per_cpu/__thread/
static void rcu_preempt_note_context_switch(int cpu) { struct task_struct *t = current; unsigned long flags; struct rcu_data *rdp; struct rcu_node *rnp; if (t->rcu_read_lock_nesting&& (t->rcu_read_unlock_special& RCU_READ_UNLOCK_BLOCKED) == 0) { /* Possibly blocking in an RCU read-side critical section. */ rdp = per_cpu_ptr(rcu_preempt_state.rda, cpu); ... Which could be changed over to be per task in user-space by treating the TID as a 'virtual CPU' equivalent. This probably lengthens worst-case rcu_sync() latencies rather significantly though - possibly turning urcu into a stop_machine_run() equivalent in the worst-case. (but i could be wrong about this last bit)
I believe you are. urcu does stress scaling, since it's much easier to add tasks than it is to add cpus, but it's conceptually the same problem.
-- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html