[PATCH-tip v6 19/22] TP-futex, doc: Update TP futexes document on shared locking

Waiman Long <longman@xxxxxxxxxx> · Wed, 22 Mar 2017 13:38:55 -0400

The tp-futex.txt was updated to add description about shared locking
support.

Running the futex locking microbenchmark on a 2-socket 36-core E5-2699
v3 system (HT off) running on a 4.11 based kernel, the performance
of TP rwlock with 1:1 reader/writer ratio versus one based on the
wait-wake futexes as well as the Glibc rwlock with a microbenchmark
(1 worker thread per cpu core) running for 10s with a load-to-load
latency of 10 were as follows:

                          WW futex       TP futex         Glibc
                          --------       --------         -----
Total locking ops        30,992,772     49,786,118     14,856,400
Per-thread avg/sec           86,004        138,280         41,267
Per-thread min/sec           81,158        126,338         40,930
Per-thread max/sec           93,940        158,781         41,929
Write lock slowpaths      6,806,849      7,717,125          -
Write unlock slowpaths    8,996,339              5          -
Read lock slowpaths       2,317,716      3,883,809          -
Read unlock slowpaths     4,325,494              2          -

The throughput of TP futex as a read/write lock was about 1.6X of
the WW futex and was about 3.4X of glibc.

The following tables shows the average per-thread locking rates
(36 threads) with different reader percentages:

                          WW futex       TP futex       Glibc
                          --------       --------       -----
   90% readers            112,868        113,925        62,321
   95% readers            132,809        110,293        67,664
  100% reader             152,015        167,902        79,087

With separate 18 reader and 18 writer threads, the the average
per-thread reader and writer locking rates with different load
latencies (l) and reader-preferring rwlocks were:

                          WW futex       TP futex       Glibc
                          --------       --------       -----
 Reader rate, l=1         330,787        278,192       172,245
 Writer rate, l=1               0         13,804             0

 Reader rate, l=5         292,609        267,004       161,246
 Writer rate, l=5               0         14,995             0

 Reader rate, l=50        324,042        251,083       131,786
 Writer rate, l=50              0          4,658             0

The corresponding locking rates with writer-preferring rwlocks are:

                          WW futex       TP futex       Glibc
                          --------       --------       -----
 Reader rate, L=1         118,959        276,180            68
 Writer rate, L=1          24,966         24,739        75,969

 Reader rate, L=5         123,325        280,492            81
 Writer rate, L=5          28,328         21,874        82,331

 Reader rate, L=50         88,673        280,288             4
 Writer rate, L=50         20,442          5,816        78,522

With both the WW futex and Glibc rwlocks, lock starvation happened
for the writers with reader-preferring rwlocks. For writer preferring
rwlocks, the WW futex one fared better. The Glibc one, however,
was close to starving the readers.

Lock starvation should not happen on the TP futexes as long as the
underlying kernel mutex is lock starvation free which is the case
for 4.10 and later kernels.

On a 2-socket 10-core and 80-thread Power8 system, the benchmark
results were:
                          WW futex       TP futex         Glibc
                          --------       --------         -----
Total locking ops         8,713,156     18,897,676      4,707,780
Per-thread avg/sec           10,888         23,598          5,884
Per-thread min/sec            9,097         22,028          5,763
Per-thread max/sec           13,106         24,208          6,007
Write lock slowpaths      2,100,149      4,078,698          -
Write unlock slowpaths    2,324,093              6          -
Read lock slowpaths         500,372      2,160,265          -
Read unlock slowpaths     1,210,698              1          -

In this case, the TP futex is more than 2X the performance of the
WW futex.

Signed-off-by: Waiman Long <longman@xxxxxxxxxx>
---
 Documentation/tp-futex.txt | 170 +++++++++++++++++++++++++++++++++++++++------
 1 file changed, 150 insertions(+), 20 deletions(-)

diff --git a/Documentation/tp-futex.txt b/Documentation/tp-futex.txt
index d040c18..5c781a5 100644
--- a/Documentation/tp-futex.txt
+++ b/Documentation/tp-futex.txt
@@ -80,22 +80,62 @@ Userspace locking can improve throughput by reducing lock hold time,
 but it also has the risk of lock starvation. So kernel locking should
 be used after a number of userspace locking failures.
 
-Inside the kernel, a kernel mutex is used for serialization among
-the futex waiters. Only the top lock waiter which is the owner of
-the serialization mutex is allowed to continuously spin and attempt
-to acquire the lock.  Other lock waiters will have one attempt to
-steal the lock before entering the mutex queues.
+A shared lock acquirer, on the other hand, has to do either one of
+the following 2 actions depends on the current value of the futex:
+
+ 1) If current futex value is 0, atomically put (FUTEX_SHARED + 1)
+    into the lower 30 bits of the futex.
+
+ 2) If the FUTEX_SHARED bit is set and the FUTEX_SHARED_UNLOCK bit
+    isn't, atomically increment the reader count in the lower 24 bits.
+    The other unused bits (24-27) can be used for other management purpose
+    and will be ignored by the kernel.
+
+Any failure to both of the above actions will lead to the following
+new FUTEX_LOCK_SHARED futex(2) syscall.
+
+  futex(uaddr, FUTEX_LOCK_SHARED, prefer_reader, timeout, NULL, 0);
+
+The special FUTEX_SHARED bit (bit 30) is used to distinguish between a
+shared lock and an exclusive lock. If the FUTEX_SHARED bit isn't set,
+the lower 29 bits contain the TID of the exclusive lock owner. If
+the FUTEX_SHARED bit is set, the lower 29 bits contain the number of
+tasks owning the shared lock.  Therefore, only 29 bits are allowed
+to be used by the TID.
+
+The FUTEX_SHARED_UNLOCK bit can be used by userspace to prevent racing
+and to indicate that a shared unlock is in progress as it will disable
+any shared trylock attempt in the kernel.
+
+The prefer_reader flag is used to control if the lock acquisition
+should be done in a prefer reader mode (if set) or a neutral mode
+where there is no preference to either reader or writer.
+
+Inside the kernel, a kernel mutex is used for serialization among the
+futex waiters. Only the top lock waiter which is the owner of the
+serialization mutex is allowed to continuously spin and attempt to
+acquire the lock.  Other lock waiters will have one or two attempts
+to steal the lock before entering the mutex queues.
 
 When the exclusive futex lock owner is no longer running, the top
 waiter will set the FUTEX_WAITERS bit before going to sleep. This is
 to make sure the futex owner will go into the kernel at unlock time
 to wake up the top waiter.
 
+When the futex is shared by multiple owners, there is no way to
+determine if all the shared owners are running or not. In this case,
+the top waiter will spin for a fixed amount of time if no reader
+activity is observed before giving up and going to sleep. Any change
+in the reader count will be considered reader activities and so will
+reset the timer. However, when the hand-off threshold has been reached,
+reader optimistic spinning timer reset will be disabled automatically.
+
 The return values of the above futex locking syscall, if non-negative,
-are status code that consists of 2 fields - the lock acquisition code
-(bits 0-7) and the number of sleeps (bits 8-30) in the optimistic
-spinning loop before acquiring the futex. A negative returned value
-means an error has happened.
+are status code that consists of 3 fields - the lock acquisition code
+(bits 0-7) and the number of sleeps (bits 16-30) in the optimistic
+spinning loop before acquiring the futex. Bits 8-15 are reserved
+for future extension. A negative returned value means an error has
+happened.
 
 The lock acquisition code can have the following values:
  a) 0 - lock stolen as non-top waiter
@@ -108,14 +148,21 @@ issue a FUTEX_UNLOCK futex(2) syscall to wake up the top waiter.
 
   futex(uaddr, FUTEX_UNLOCK, 0, NULL, NULL, 0);
 
-A return value of 1 from the FUTEX_UNLOCK futex(2) syscall indicates
-a task has been woken up. The syscall returns 0 if no sleeping task
-is woken. A negative value will be returned if an error happens.
+For a shared lock owner, unlock is done by atomically decrementing
+the reader count by one. If the resultant futex value has no reader
+remaining, the process has to atomically change the futex value from
+FUTEX_SHARED to 0. If that fails, it has to issue a FUTEX_UNLOCK_SHARED
+futex(2) syscall to wake up the top waiter.
+
+A return value of 1 from the FUTEX_UNLOCK or FUTEX_UNLOCK_SHARED
+futex(2) syscall indicates a task has been woken up. The syscall
+returns 0 if no sleeping task is woken. A negative value will be
+returned if an error happens.
 
-The error number returned by a FUTEX_UNLOCK syscall on an empty futex
-can be used to decide if the TP futex functionality is implemented
-in the kernel. If it is present, an EPERFM error will be returned.
-Otherwise it will return ENOSYS.
+The error number returned by a FUTEX_UNLOCK or FUTEX_UNLOCK_SHARED
+syscall on an empty futex can be used to decide if the TP futex
+functionality is implemented in the kernel. If it is present, an
+EPERFM error will be returned.  Otherwise it will return ENOSYS.
 
 TP futexes require the kernel to have SMP support as well as support
 for the cmpxchg functionality. For architectures that don't support
@@ -137,6 +184,13 @@ situation. It is possible for the kernel to handle this internally,
 but it will probably reduce performance. Userspace code can handle
 this more efficiently.
 
+Shared locking TP futexes, on the other hand, cannot be used with
+robust futexes. The unexpected death of a shared lock owner may
+leave the futex permanently in the shared state leading to deadlock
+of exclusive lock waiters. If necessary, some kind of timeout and
+userspace lock management system have to be used to resolve this
+deadlock problem.
+
 Usage Scenario
 --------------
 
@@ -148,6 +202,24 @@ used when the critical section is long and prone to sleeping. However,
 it may not have the performance gain when compared with a wait-wake
 futex in this case.
 
+A TP futex can also be used to implement a userspace reader-writer
+lock (rwlock) where the writers use the FUTEX_LOCK and FUTEX_UNLOCK
+futex(2) syscalls and the readers used the FUTEX_LOCK_SHARED and
+FUTEX_UNLOCK_SHARED futex(2) syscalls. Beside providing a better
+performance compared with wait-wake futexes, other advantages of
+using the the TP futexes for rwlocks are:
+
+ 1) The simplicity and ease of maintenance of the userspace locking
+    codes. There is only one version of rwlock using TP futex versus
+    a reader-preferring variant and/or a writer-preferring variant
+    when using the wait-wake futexes.
+
+ 2) TP futex is lock-starvation free. That may not be the case
+    with rwlocks implemented with wait-wake futexes especially the
+    reader-preferring variants where the writers can be starved of
+    the lock under certain circumstances. Some writer-preferring
+    rwlock variants may also starve readers of getting the lock.
+
 The wait-wake futexes are more versatile as they can also be used to
 implement other locking primitives like semaphores or conditional
 variables.  So the TP futex is not a direct replacement of the
@@ -157,12 +229,12 @@ futex is likely a better option than the wait-wake futex.
 Sample Code
 -----------
 
-The following are sample code to implement simple mutex lock and
-unlock functions.
+The following are sample code to implement simple mutex or writer
+lock and unlock functions.
 
 __thread int thread_id;
 
-void mutex_lock(int *faddr)
+void write_lock(int *faddr)
 {
 	if (cmpxchg(faddr, 0, thread_id) == 0)
 		return;
@@ -170,7 +242,7 @@ void mutex_lock(int *faddr)
 		;
 }
 
-void mutex_unlock(int *faddr)
+void write_unlock(int *faddr)
 {
 	int old, fval;
 
@@ -178,3 +250,61 @@ void mutex_unlock(int *faddr)
 		return;
 	futex(faddr, FUTEX_UNLOCK, 0, NULL, NULL, 0);
 }
+
+The following sample code can be used to implement shared or reader
+lock and unlock functions.
+
+void read_lock(int *faddr)
+{
+	int val = cmpxchg(faddr, 0, FUTEX_SHARED + 1);
+
+	if (!val)
+		return;
+	for (;;) {
+		int old = val, new;
+
+		if (!old)
+			new = FUTEX_SHARED + 1;
+		else if (!(old & FUTEX_SHARED) ||
+			 (old & ((~FUTEX_SHARED_TID_MASK)|
+				   FUTEX_SHARED_UNLOCK)))
+			break;
+		else
+			new = old + 1;
+		val = cmpxchg(faddr, old, new);
+		if (val == old)
+			return;
+	}
+	while (futex(faddr, FUTEX_LOCK_SHARED, ...) < 0)
+		;
+}
+
+void read_unlock(int *faddr)
+{
+	int val = xadd(faddr, -1) - 1;
+	int old;
+
+	for (;;) {
+		/*
+		 * Return if not the last reader, not in shared locking
+		 * mode or the unlock bit has been set.
+		 */
+		if (!(val & FUTEX_SHARED) ||
+		     (val & (FUTEX_SHARED_UNLOCK|FUTEX_SCNT_MASK)))
+			return;
+		if (val & ~FUTEX_SHARED_TID_MASK) {
+			old = cmpxchg(faddr, val, val|FUTEX_SHARED_UNLOCK);
+			if (old == val)
+				break;	/* Do FUTEX_UNLOCK_SHARED */
+		} else {
+			old = cmpxchg(faddr, val, 0);
+			if (old == val)
+				return;
+		}
+		val = old;
+	}
+	futex(faddr, FUTEX_UNLOCK_SHARED, ...);
+}
+
+More elaborate sample locking codes for the TP futexes are also
+available at the tools/perf/bench/futex-locks.c file.
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html