----- On Jan 5, 2016, at 4:47 PM, Paul E. McKenney paulmck@xxxxxxxxxxxxxxxxxx wrote: > On Tue, Jan 05, 2016 at 05:40:18PM +0000, Russell King - ARM Linux wrote: >> On Tue, Jan 05, 2016 at 05:31:45PM +0000, Mathieu Desnoyers wrote: >> > For instance, an application could create a linked list or hash map >> > of thread control structures, which could contain the current CPU >> > number of each thread. A dispatch thread could then traverse or >> > lookup this structure to see on which CPU each thread is running and >> > do work queue dispatch or scheduling decisions accordingly. >> >> So, what happens if the linked list is walked from thread X, and we >> discover that thread Y is allegedly running on CPU1. We decide that >> we want to dispatch some work on that thread due to it being on CPU1, >> so we send an event to thread Y. >> >> Thread Y becomes runnable, and the scheduler decides to schedule the >> thread on CPU3 instead of CPU1. >> >> My point is that the above idea is inherently racy. The only case >> where it isn't racy is when thread Y is bound to CPU1, and so can't >> move - but then you'd know that thread Y is on CPU1 and there >> wouldn't be a need for the inherent complexity suggested above. >> >> The behaviour I've seen on ARM from the scheduler (on a quad CPU >> platform, observing the system activity with top reporting the last >> CPU number used by each thread) is that threads often migrate >> between CPUs - especially in the case of (eg) one or two threads >> running in a quad-CPU system. >> >> Given that, I'm really not sure what the use of reading and making >> decisions on the current CPU number would be within a program - >> unless the thread is bound to a particular CPU or group of CPUs, >> it seems that you can't rely on being on the reported CPU by the >> time the system call returns. > > As I understand it, the idea is -not- to eliminate synchronization > like we do with per-CPU variables in the kernel, but rather to > reduce the average cost of synchronization. For example, there > might be a separate data structure per CPU, each structure guarded > by its own lock. A thread could sample the current running CPU, > acquire that CPU's corresponding lock, and operate on that CPU's > structure. This would work correctly even if there was an arbitrarily > high number of preemptions/migrations, but would have improved > performance (compared to a single global lock) in the common case > where there were no preemptions/migrations. > > This approach can also be used in conjunction with Paul Turner's > per-CPU atomics. > > Make sense, or am I missing your point? Russell's point is more about accessing a given thread's cpu_cache variable from other threads/cores, which is beyond what is needed for restartable critical sections. Independently of the usefulness of reading other thread's cpu_cache to see their current CPU, I would advocate for checking the cpu_cache natural alignment, and return EINVAL if it is not aligned. Even for thread-local reads, we care about ensuring there is no load tearing when reading this variable. The behavior of the kernel updating this variable read by a user-space thread is very similar to having a variable updated by a signal handler nested on top of a thread. This makes it simpler and reduces the testing state space. Thoughts ? Thanks, Mathieu > > Thanx, Paul -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com -- To unsubscribe from this list: send the line "unsubscribe linux-api" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html