All the TP futex lock waiters are serialized in the kernel using a kernel mutex which acts like a wait queue. The order at which the waiters popped out from the wait queue will affect performance when exclusive (writer) and shared (reader) lock waiters are mixed in the queue. The worst case scenarios will be something like RWRWRW... in term of ordering where no parallelism in term of lock ownership can happen. To improve throughput, the readers are now grouped together as a single entity in the wait queue. The first reader that enters the mutex wait queue will become the leader of the group. The other readers will spin on the group leader via an OSQ lock. When the futex is put in the shared mode, either by the group leader or by an external reader the spinning readers in the reader group will then acquire the read lock successively. The spinning readers in the group will get disbanded when the group leader goes to sleep. In this case, all those readers will go into the mutex wait queue alone and wait for their turn to acquire the TP futex. On a 2-socket 36-core E5-2699 v3 system (HT off) running on a 4.10 based kernel, the performance of TP rwlock with 1:1 reader/writer ratio versus one based on the wait-wake futexes as well as the Glibc rwlock with a microbenchmark (1 worker thread per cpu core) running for 10s were as follows: WW futex TP futex Glibc -------- -------- ----- Total locking ops 35,707,234 58,645,434 10,930,422 Per-thread avg/sec 99,149 162,887 30,362 Per-thread min/sec 93,190 38,641 29,872 Per-thread max/sec 104,213 225,983 30,708 Write lock futex calls 11,161,534 14,094 - Write unlock futex calls 8,696,121 167 - Read lock futex calls 1,717,863 6,659 - Read unlock futex calls 4,316,418 323 - It can be seen that the throughput of the TP futex is close to 2X the WW futex and almost 6X the Glibc version in this particular case. The following table shows the CPU cores scaling for the average per-thread locking rates (ops/sec): WW futex TP futex Glibc -------- -------- ----- 9 threads (1 socket) 422,014 647,006 190,236 18 threads (2 sockets) 197,145 330,353 66,934 27 threads (2 sockets) 127,947 213,417 43,641 36 threads (2 sockets) 99,149 162,887 30,362 The following tables shows the average per-thread locking rates (36 threads) with different reader percentages: WW futex TP futex Glibc -------- -------- ----- 90% readers 124,426 159,657 60,208 95% readers 148,905 152,315 68,666 100% reader 210,029 191,759 84,316 The rwlocks based on the WW futexes and Glibc prefer readers by default. So it can be seen that the performance of the WW futex and Glibc rwlocks increased with higher reader percentages. The TP futex rwlock, however, prefers writers a bit more than readers. So the performance didn't increase as the reader percentage rises. The WW futexes and Glibc rwlocks also have a writer-preferring version. Their performance with the same tests are as follows: WW futex TP futex Glibc -------- -------- ----- 90% readers 87,866 159,657 26,057 95% readers 93,193 152,315 32,611 100% reader 193,267 191,759 88,440 With separate 18 reader and 18 writer threads, the the average per-thread reader and writer locking rates with different load latencies (L, default = 1) and reader-preferring rwlocks are: WW futex TP futex Glibc -------- -------- ----- Reader rate, L=1 381,411 74,059 164,184 Writer rate, L=1 0 240,841 0 Reader rate, L=5 330,732 57,361 150,691 Writer rate, L=5 0 175,400 0 Reader rate, L=50 304,505 17,355 97,066 Writer rate, L=50 0 114,504 0 The corresponding locking rates with writer-preferring rwlocks are: WW futex TP futex Glibc -------- -------- ----- Reader rate, L=1 138,805 74,059 54 Writer rate, L=1 31,113 240,841 56,424 Reader rate, L=5 114,414 57,361 24 Writer rate, L=5 28,062 175,400 52,039 Reader rate, L=50 88,619 51,483 5 Writer rate, L=50 21,005 98,005 49,885 With both the WW futex and Glibc rwlocks, lock starvation happened for the writers with reader-preferring rwlocks. For writer preferring rwlocks, the WW futex one fared better. The Glibc one, however, was close to starving the readers. The TP futex prefers writer in general, but the actual preference depends on the timing. Lock starvation should not happen on the TP futexes as long as the underlying kernel mutex is lock starvation free which is the case for 4.10 and later kernel. Signed-off-by: Waiman Long <longman@xxxxxxxxxx> --- kernel/futex.c | 135 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 132 insertions(+), 3 deletions(-) diff --git a/kernel/futex.c b/kernel/futex.c index b690a79..5e69c44 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -239,6 +239,16 @@ struct futex_state { u32 handoff_pid; /* For TP futexes only */ int locksteal_disabled; /* For TP futexes only */ + /* + * To improve reader throughput in TP futexes, all the readers + * in the mutex queue are grouped together. The first reader in the + * queue will set first_reader, then the rest of the readers will + * spin on the first reader via the OSQ without actually entering + * the mutex queue. + */ + struct optimistic_spin_queue reader_osq; + struct task_struct *first_reader; + enum futex_type type; union futex_key key; }; @@ -3388,14 +3398,21 @@ void exit_robust_list(struct task_struct *curr) * 0 - steals the lock * 1 - top waiter (mutex owner) acquires the lock * 2 - handed off the lock - * 2) bits 08-15: reserved - * 3) bits 15-30: how many times the task has slept or yield to scheduler + * 2) bit 08: 1 if reader spins alone (shared lock only) + * bit 09: 1 if reader is a spin group leader (shared lock only) + * bits 10-16: reserved + * 3) bits 16-30: how many times the task has slept or yield to scheduler * in futex_spin_on_owner(). */ #define TP_LOCK_STOLEN 0 #define TP_LOCK_ACQUIRED 1 #define TP_LOCK_HANDOFF 2 + +#define TP_READER_ALONE 1 +#define TP_READER_GROUP 2 + #define TP_STATUS_SLEEP(val, sleep) ((val)|((sleep) << 16)) +#define TP_STATUS_ALONE(val, alone) ((val)|((alone) << 8)) /** * lookup_futex_state - Looking up the futex state structure. @@ -3438,6 +3455,7 @@ void exit_robust_list(struct task_struct *curr) state->type = TYPE_TP; state->key = *key; state->locksteal_disabled = false; + osq_lock_init(&state->reader_osq); list_add(&state->fs_list, &hb->fs_head); WARN_ON(atomic_read(&state->refcount) != 1); @@ -3895,6 +3913,98 @@ static int futex_spin_on_owner(u32 __user *uaddr, const u32 vpid, return (ret < 0) ? ret : TP_STATUS_SLEEP(ret, nsleep); } +/** + * futex_spin_on_reader - Optimistically spin on first reader + * @uaddr: futex address + * @pfirst: pointer to first reader + * @state: futex state object + * @timeout: hrtimer_sleeper structure + * + * Reader performance will depend on the placement of readers within the + * mutex queue. For a queue of 4 readers and 4 writers, for example, the + * optimal placement will be either RRRRWWWW or WWWWRRRR. A worse case will + * be RWRWRWRW. + * + * One way to avoid the worst case scenario is to gather all the readers + * together as a single unit and place it into the mutex queue. This is done + * by having the first reader puts its task structure into state->first_reader + * and the rest of the readers optimistically spin on it instead of entering + * the mutex queue. + * + * On exit from this function, the reader would have + * 1) acquired a read lock on the futex; + * 2) become the first reader and goes into the mutex queue; or + * 3) seen the first reader slept and needs to go into the mutex queue alone. + * + * Any fault on accessing the futex will cause it to return 0 and goes into + * the mutex queue. + * + * Return: > 0 - acquired the read lock + * <= 0 - need to goes into the mutex queue + */ +static int futex_spin_on_reader(u32 __user *uaddr, struct task_struct **pfirst, + struct futex_state *state, + struct hrtimer_sleeper *timeout) +{ + struct task_struct *first_reader; + int ret = 0; + + preempt_disable(); + +retry: + first_reader = READ_ONCE(state->first_reader); + if (!first_reader) + first_reader = cmpxchg(&state->first_reader, NULL, current); + if (!first_reader) + goto out; /* Became the first reader */ + + if (!osq_lock(&state->reader_osq)) + goto reschedule; + + for (;;) { + u32 uval; + + if (!state->locksteal_disabled) { + ret = futex_trylock_preempt_disabled(uaddr, + FUTEX_SHARED, &uval); + /* + * Return if lock acquired or an error happened + */ + if (ret) + break; + } + + /* + * Reread the first reader value again. + */ + first_reader = READ_ONCE(state->first_reader); + if (!first_reader) + first_reader = cmpxchg(&state->first_reader, NULL, + current); + if (!first_reader || !first_reader->on_cpu) + break; + + if (need_resched()) { + osq_unlock(&state->reader_osq); + goto reschedule; + } + + cpu_relax(); + } + osq_unlock(&state->reader_osq); +out: + *pfirst = first_reader; + preempt_enable(); + return ret; + +reschedule: + /* + * Yield the CPU and retry later. + */ + schedule_preempt_disabled(); + goto retry; +} + /* * Userspace tried a 0 -> TID atomic transition of the futex value * and failed. The kernel side here does the whole locking operation. @@ -3921,9 +4031,11 @@ static noinline int futex_lock(u32 __user *uaddr, unsigned int flags, { struct hrtimer_sleeper timeout, *to; struct futex_hash_bucket *hb; + struct task_struct *first_reader = NULL; union futex_key key = FUTEX_KEY_INIT; struct futex_state *state; u32 uval, vpid = shared ? FUTEX_SHARED : task_pid_vnr(current); + int alone = 0; int ret; /* @@ -3989,6 +4101,17 @@ static noinline int futex_lock(u32 __user *uaddr, unsigned int flags, hrtimer_start_expires(&to->timer, HRTIMER_MODE_ABS); /* + * Spin on the first reader and return if we acquired the read lock. + */ + if (shared) { + int rspin_ret = futex_spin_on_reader(uaddr, &first_reader, + state, to); + if (rspin_ret > 0) + goto out_put_state_key; + alone = first_reader ? TP_READER_ALONE : TP_READER_GROUP; + } + + /* * Acquiring the serialization mutex. * * If we got a signal or has some other error, we need to abort @@ -4016,6 +4139,12 @@ static noinline int futex_lock(u32 __user *uaddr, unsigned int flags, mutex_unlock(&state->mutex); out_put_state_key: + /* + * We will be the first reader if (first_reader == NULL). + */ + if (shared && !first_reader) + WRITE_ONCE(state->first_reader, NULL); + if (!put_futex_state_unlocked(state)) { /* * May need to free the futex state object and so must be @@ -4032,7 +4161,7 @@ static noinline int futex_lock(u32 __user *uaddr, unsigned int flags, hrtimer_cancel(&to->timer); destroy_hrtimer_on_stack(&to->timer); } - return ret; + return (ret < 0) ? ret : TP_STATUS_ALONE(ret, alone); } /* -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe linux-doc" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html