+ ipc-semc-fix-complex_count-vs-simple-op-race.patch added to -mm tree

akpm@xxxxxxxxxxxxxxxxxxxx · Tue, 12 Jul 2016 14:30:30 -0700

The patch titled
     Subject: ipc/sem.c: Fix complex_count vs. simple op race
has been added to the -mm tree.  Its filename is
     ipc-semc-fix-complex_count-vs-simple-op-race.patch

This patch should soon appear at
    http://ozlabs.org/~akpm/mmots/broken-out/ipc-semc-fix-complex_count-vs-simple-op-race.patch
and later at
    http://ozlabs.org/~akpm/mmotm/broken-out/ipc-semc-fix-complex_count-vs-simple-op-race.patch

Before you just go and hit "reply", please:
   a) Consider who else should be cc'ed
   b) Prefer to cc a suitable mailing list as well
   c) Ideally: find the original patch on the mailing list and do a
      reply-to-all to that, adding suitable additional cc's

*** Remember to use Documentation/SubmitChecklist when testing your code ***

The -mm tree is included into linux-next and is updated
there every 3-4 working days

------------------------------------------------------
From: Manfred Spraul <manfred@xxxxxxxxxxxxxxxx>
Return-Path: <manfred@xxxxxxxxxxxxxxxx>
X-Spam-Checker-Version: SpamAssassin 3.4.1 (2015-04-28) on y.localdomain
X-Spam-Level: 
X-Spam-Status: No, score=-1.5 required=2.5 tests=BAYES_00,T_DKIM_INVALID
	autolearn=ham autolearn_force=no version=3.4.1
Received: from y.localdomain (localhost [127.0.0.1])
	by y.localdomain (8.14.9/8.14.9/Debian-4) with ESMTP id u5PHcSj9013494
	for <akpm@localhost>; Sat, 25 Jun 2016 10:38:29 -0700
X-Original-To: akpm@xxxxxxxxxxxxxxxxxxxxxxxx
Delivered-To: akpm@xxxxxxxxxxxxxxxxxxxxxxxx
Received: from mail.linuxfoundation.org [140.211.169.12]
	by y.localdomain with IMAP (fetchmail-6.3.26)
	for <akpm@localhost> (single-drop); Sat, 25 Jun 2016 10:38:29 -0700 (PDT)
Received: from smtp1.linuxfoundation.org (smtp1.linux-foundation.org [172.17.192.35])
	by mail.linuxfoundation.org (Postfix) with ESMTPS id AA5539D
	for <akpm@xxxxxxxxxxxxxxxxxxxxxxxx>; Sat, 25 Jun 2016 17:38:24 +0000 (UTC)
X-Greylist: whitelisted by SQLgrey-1.7.6
Received: from mail-wm0-f72.google.com (mail-wm0-f72.google.com [74.125.82.72])
	by smtp1.linuxfoundation.org (Postfix) with ESMTPS id 9834321F
	for <akpm@xxxxxxxxxxxxxxxxxxxxxxxx>; Sat, 25 Jun 2016 17:38:22 +0000 (UTC)
Received: by mail-wm0-f72.google.com with SMTP id r190so42216817wmr.0
        for <akpm@xxxxxxxxxxxxxxxxxxxxxxxx>; Sat, 25 Jun 2016 10:38:22 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20130820;
        h=x-gm-message-state:dkim-signature:from:to:cc:subject:date
         :message-id:in-reply-to:references:delivered-to;
        bh=ZXlnqKtK7m/LzOfUF7Y+wb5go3NxMEhMHTbeHufxPTc=;
        b=iT238qdr1PO242FfLABzgBg3GMSJ6KV3druBknu2Et6ndg+KSlXz8mepdfLyE637Kh
         TkoGyyl3TjNHtt72Vzff7jtJlQyYafcro28ANwfWLKhhBi43VRxos+lPwY+W6J+CI89D
         bnnB3NuEfYH65Qxhb9Qt9o6wJxnYI+klVRNibXilurNGebHYGFbPDDCdxpPf/x3wAI1f
         71ND5953+3VNbn8qw4j7oPSBWjTYdOXFtzWytNQ4lUHbod72WjfKQRmtTQOlW/tsNDOQ
         h0BvVOJSjvqbPWFEduyMCKrL7+x8P8nGCfd1uglAVDR+CAI1nSfiHM8AYj0TUFoL48/L
         zSIA==
X-Gm-Message-State: ALyK8tKPa3+vmgboIp/d+BJLUpDxfeAWVOzP9r+EgBPjRyaPj12uoeB0Us+l0gazfaN3afA+Po91y+AIcKYzve7flTqp0uHak7SDvBH9xm8NtiyJf67cH+LgODULOUYswVy3CVEdpit5rcSLFTE7/J22BTvdDF9fxZni8sBm2DYhmpk+okWa5d1AKQaqZN4RZ5Szv5l5MQVp/tGifvZSZzl80GQMDaxq5GhJKgHRS9ycHbQ2Li0/QLQUwtY7dFTgzulH6DMKiAje8Kcw2PDZXu7SUmcM39Yu+GRK1UyYQf3msc4Vf4YL3NPfB3ec2QLfgh/Is8Q=
X-Received: by 10.194.172.102 with SMTP id bb6mr9520517wjc.151.1466876301326;
        Sat, 25 Jun 2016 10:38:21 -0700 (PDT)
X-Received: by 10.194.172.102 with SMTP id bb6mr9520471wjc.151.1466876300382;
        Sat, 25 Jun 2016 10:38:20 -0700 (PDT)
Received: from mail-wm0-x244.google.com (mail-wm0-x244.google.com. [2a00:1450:400c:c09::244])
        by mx.google.com with ESMTPS id y74si2903794wmh.115.2016.06.25.10.38.20
        for <akpm@xxxxxxxxxxxxxxxxxxxx>
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
        Sat, 25 Jun 2016 10:38:20 -0700 (PDT)
Received-SPF: pass (google.com: domain of manfred@xxxxxxxxxxxxxxxx designates 2a00:1450:400c:c09::244 as permitted sender) client-ip=2a00:1450:400c:c09::244;
Authentication-Results: mx.google.com;
       dkim=pass header.i=@colorfullife-com.20150623.gappssmtp.com;
       spf=pass (google.com: domain of manfred@xxxxxxxxxxxxxxxx designates 2a00:1450:400c:c09::244 as permitted sender) smtp.mailfrom=manfred@xxxxxxxxxxxxxxxx
Received: by mail-wm0-x244.google.com with SMTP id 187so13580351wmz.1
        for <akpm@xxxxxxxxxxxxxxxxxxxx>; Sat, 25 Jun 2016 10:38:20 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=colorfullife-com.20150623.gappssmtp.com; s=20150623;
        h=from:to:cc:subject:date:message-id:in-reply-to:references;
        bh=ZXlnqKtK7m/LzOfUF7Y+wb5go3NxMEhMHTbeHufxPTc=;
        b=blE64BOLZLxMEFxM54qfXWTztsLoON9vUsrVVGxUq8WGwhDA/CUaaNzvYAc4gyd0uM
         Rhn6qBqAL1Wvmp82JmiGcr2Pp5COR4fpeD0BvBPxnTRvZ6tPpvXwgIR0BjFiYs86pqmV
         oBivcLuxlnLuuAXHkoJBsdHA9MHcBMgrhKVPGozfVBF7bjyNnYLsTTudT2eLwxZuRrVC
         msCpEh6NJ9U7sXamor07I6A95s1sZnSQF/L/uWzm6+yyCJj7kX+I08I5iE5N6EMaOXD6
         yl/MCT6ZeCkfJIbXgSG/olnWaWJFUua4oebcvTOMDVLWQfnPvFmQ6BRiAwwIZP7HXBqn
         uYfQ==
X-Received: by 10.194.120.167 with SMTP id ld7mr9462649wjb.69.1466876299792;
        Sat, 25 Jun 2016 10:38:19 -0700 (PDT)
Received: from localhost.localdomain (dslb-088-071-110-173.088.071.pools.vodafone-ip.de. [88.71.110.173])
        by smtp.googlemail.com with ESMTPSA id a4sm4630275wjq.40.2016.06.25.10.38.16
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
        Sat, 25 Jun 2016 10:38:19 -0700 (PDT)
To: "H. Peter Anvin" <hpa@xxxxxxxxx>, Peter Zijlstra <peterz@xxxxxxxxxxxxx>,
        Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>,
        Davidlohr Bueso <dave@xxxxxxxxxxxx>
Cc: LKML <linux-kernel@xxxxxxxxxxxxxxx>, Thomas Gleixner <tglx@xxxxxxxxxxxxx>,
        Ingo Molnar <mingo@xxxxxxx>, 1vier1@xxxxxx,
        felixh@xxxxxxxxxxxxxxxxxxxxxxxx,
        Manfred Spraul <manfred@xxxxxxxxxxxxxxxx>, <stable@xxxxxxxxxxxxxxx>
Subject: ipc/sem.c: Fix complex_count vs. simple op race
Date: Sat, 25 Jun 2016 19:37:51 +0200
Message-Id: <1466876272-3824-2-git-send-email-manfred@xxxxxxxxxxxxxxxx>
X-Mailer: git-send-email 2.5.5
In-Reply-To: <1466876272-3824-1-git-send-email-manfred@xxxxxxxxxxxxxxxx>
References: <1466876272-3824-1-git-send-email-manfred@xxxxxxxxxxxxxxxx>
Delivered-To: akpm@xxxxxxxxxxxxxxxxxxxx

Commit 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()") introduced a
race:

sem_lock has a fast path that allows parallel simple operations.
There are two reasons why a simple operation cannot run in parallel:
- a non-simple operations is ongoing (sma->sem_perm.lock held)
- a complex operation is sleeping (sma->complex_count != 0)

As both facts are stored independently, a thread can bypass the current
checks by sleeping in the right positions. See below for more details
(or kernel bugzilla 105651).

The patch fixes that by creating one variable (complex_mode)
that tracks both reasons why parallel operations are not possible.

The patch also updates stale documentation regarding the locking.

With regards to stable kernels:
The patch is required for all kernels that include the
commit 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()") (3.10?)

The alternative is to revert the patch that introduced the race.

Background:
Here is the race of the current implementation:

Thread A: (simple op)
- does the first "sma->complex_count == 0" test

Thread B: (complex op)
- does sem_lock(): This includes an array scan. But the scan can't
  find Thread A, because Thread A does not own sem->lock yet.
- the thread does the operation, increases complex_count,
  drops sem_lock, sleeps

Thread A:
- spin_lock(&sem->lock), spin_is_locked(sma->sem_perm.lock)
- sleeps before the complex_count test

Thread C: (complex op)
- does sem_lock (no array scan, complex_count==1)
- wakes up Thread B.
- decrements complex_count

Thread A:
- does the complex_count test

Bug:
Now both thread A and thread C operate on the same array, without
any synchronization.

Full memory barrier are required to synchronize changes of
complex_mode and the lock operations.

Fixes: 6d07b68ce16a ("ipc/sem.c: optimize sem_lock()")
Reported-by: felixh@xxxxxxxxxxxxxxxxxxxxxxxx
Signed-off-by: Manfred Spraul <manfred@xxxxxxxxxxxxxxxx>
Cc: <stable@xxxxxxxxxxxxxxx>
---
 include/linux/sem.h |   1 +
 ipc/sem.c           | 122 ++++++++++++++++++++++++++++++----------------------
 2 files changed, 71 insertions(+), 52 deletions(-)

diff --git a/include/linux/sem.h b/include/linux/sem.h
index 976ce3a..d0efd6e 100644
--- a/include/linux/sem.h
+++ b/include/linux/sem.h
@@ -21,6 +21,7 @@ struct sem_array {
 	struct list_head	list_id;	/* undo requests on this array */
 	int			sem_nsems;	/* no. of semaphores in array */
 	int			complex_count;	/* pending complex operations */
+	bool			complex_mode;	/* no parallel simple ops */
 };
 
 #ifdef CONFIG_SYSVIPC
diff --git a/ipc/sem.c b/ipc/sem.c
index ae72b3c..538f43a 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -162,14 +162,21 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void *it);
 
 /*
  * Locking:
+ * a) global sem_lock() for read/write
  *	sem_undo.id_next,
  *	sem_array.complex_count,
- *	sem_array.pending{_alter,_cont},
- *	sem_array.sem_undo: global sem_lock() for read/write
- *	sem_undo.proc_next: only "current" is allowed to read/write that field.
+ *	sem_array.complex_mode
+ *	sem_array.pending{_alter,_const},
+ *	sem_array.sem_undo
  *
+ * b) global or semaphore sem_lock() for read/write:
  *	sem_array.sem_base[i].pending_{const,alter}:
- *		global or semaphore sem_lock() for read/write
+ *	sem_array.complex_mode (for read)
+ *
+ * c) special:
+ *	sem_undo_list.list_proc:
+ *	* undo_list->lock for write
+ *	* rcu for read
  */
 
 #define sc_semmsl	sem_ctls[0]
@@ -260,23 +267,25 @@ static void sem_rcu_free(struct rcu_head *head)
 }
 
 /*
- * Wait until all currently ongoing simple ops have completed.
+ * Enter the mode suitable for non-simple operations:
  * Caller must own sem_perm.lock.
- * New simple ops cannot start, because simple ops first check
- * that sem_perm.lock is free.
- * that a) sem_perm.lock is free and b) complex_count is 0.
  */
-static void sem_wait_array(struct sem_array *sma)
+static void complexmode_enter(struct sem_array *sma)
 {
 	int i;
 	struct sem *sem;
 
-	if (sma->complex_count)  {
-		/* The thread that increased sma->complex_count waited on
-		 * all sem->lock locks. Thus we don't need to wait again.
-		 */
+	if (sma->complex_mode)  {
+		/* We are already in complex_mode. Nothing to do */
 		return;
 	}
+	WRITE_ONCE(sma->complex_mode, true);
+
+	/* We need a full barrier:
+	 * The write to complex_mode must be visible
+	 * before we read the first sem->lock spinlock state.
+	 */
+	smp_mb();
 
 	for (i = 0; i < sma->sem_nsems; i++) {
 		sem = sma->sem_base + i;
@@ -285,6 +294,27 @@ static void sem_wait_array(struct sem_array *sma)
 }
 
 /*
+ * Try to leave the mode that disallows simple operations:
+ * Caller must own sem_perm.lock.
+ */
+static void complexmode_tryleave(struct sem_array *sma)
+{
+	if (sma->complex_count)  {
+		/* Complex ops are sleeping.
+		 * We must stay in complex mode
+		 */
+		return;
+	}
+	/*
+	 * Immediately after setting complex_mode to false,
+	 * a simple op can start. Thus: all memory writes
+	 * performed by the current operation must be visible
+	 * before we set complex_mode to false.
+	 */
+	smp_store_release(&sma->complex_mode, false);
+}
+
+/*
  * If the request contains only one semaphore operation, and there are
  * no complex transactions pending, lock only the semaphore involved.
  * Otherwise, lock the entire semaphore array, since we either have
@@ -300,56 +330,40 @@ static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
 		/* Complex operation - acquire a full lock */
 		ipc_lock_object(&sma->sem_perm);
 
-		/* And wait until all simple ops that are processed
-		 * right now have dropped their locks.
-		 */
-		sem_wait_array(sma);
+		/* Prevent parallel simple ops */
+		complexmode_enter(sma);
 		return -1;
 	}
 
 	/*
 	 * Only one semaphore affected - try to optimize locking.
-	 * The rules are:
-	 * - optimized locking is possible if no complex operation
-	 *   is either enqueued or processed right now.
-	 * - The test for enqueued complex ops is simple:
-	 *      sma->complex_count != 0
-	 * - Testing for complex ops that are processed right now is
-	 *   a bit more difficult. Complex ops acquire the full lock
-	 *   and first wait that the running simple ops have completed.
-	 *   (see above)
-	 *   Thus: If we own a simple lock and the global lock is free
-	 *	and complex_count is now 0, then it will stay 0 and
-	 *	thus just locking sem->lock is sufficient.
+	 * Optimized locking is possible if no complex operation
+	 * is either enqueued or processed right now.
+	 *
+	 * Both facts are tracked by complex_mode.
 	 */
 	sem = sma->sem_base + sops->sem_num;
 
-	if (sma->complex_count == 0) {
+	/*
+	 * Initial check for complex_mode. Just an optimization,
+	 * no locking.
+	 */
+	if (!READ_ONCE(sma->complex_mode)) {
 		/*
 		 * It appears that no complex operation is around.
 		 * Acquire the per-semaphore lock.
 		 */
 		spin_lock(&sem->lock);
 
-		/* Then check that the global lock is free */
-		if (!spin_is_locked(&sma->sem_perm.lock)) {
-			/*
-			 * We need a memory barrier with acquire semantics,
-			 * otherwise we can race with another thread that does:
-			 *	complex_count++;
-			 *	spin_unlock(sem_perm.lock);
-			 */
-			smp_acquire__after_ctrl_dep();
+		/*
+		 * A full barrier is required: the write of sem->lock
+		 * must be visible before the read is executed
+		 */
+		smp_mb();
 
-			/*
-			 * Now repeat the test of complex_count:
-			 * It can't change anymore until we drop sem->lock.
-			 * Thus: if is now 0, then it will stay 0.
-			 */
-			if (sma->complex_count == 0) {
-				/* fast path successful! */
-				return sops->sem_num;
-			}
+		if (!READ_ONCE(sma->complex_mode)) {
+			/* fast path successful! */
+			return sops->sem_num;
 		}
 		spin_unlock(&sem->lock);
 	}
@@ -369,7 +383,7 @@ static inline int sem_lock(struct sem_array *sma, struct sembuf *sops,
 		/* Not a false alarm, thus complete the sequence for a
 		 * full lock.
 		 */
-		sem_wait_array(sma);
+		complexmode_enter(sma);
 		return -1;
 	}
 }
@@ -378,6 +392,7 @@ static inline void sem_unlock(struct sem_array *sma, int locknum)
 {
 	if (locknum == -1) {
 		unmerge_queues(sma);
+		complexmode_tryleave(sma);
 		ipc_unlock_object(&sma->sem_perm);
 	} else {
 		struct sem *sem = sma->sem_base + locknum;
@@ -529,6 +544,7 @@ static int newary(struct ipc_namespace *ns, struct ipc_params *params)
 	}
 
 	sma->complex_count = 0;
+	sma->complex_mode = true; /* dropped by sem_unlock below */
 	INIT_LIST_HEAD(&sma->pending_alter);
 	INIT_LIST_HEAD(&sma->pending_const);
 	INIT_LIST_HEAD(&sma->list_id);
@@ -2184,10 +2200,10 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void *it)
 	/*
 	 * The proc interface isn't aware of sem_lock(), it calls
 	 * ipc_lock_object() directly (in sysvipc_find_ipc).
-	 * In order to stay compatible with sem_lock(), we must wait until
-	 * all simple semop() calls have left their critical regions.
+	 * In order to stay compatible with sem_lock(), we must
+	 * enter / leave complex_mode.
 	 */
-	sem_wait_array(sma);
+	complexmode_enter(sma);
 
 	sem_otime = get_semotime(sma);
 
@@ -2204,6 +2220,8 @@ static int sysvipc_sem_proc_show(struct seq_file *s, void *it)
 		   sem_otime,
 		   sma->sem_ctime);
 
+	complexmode_tryleave(sma);
+
 	return 0;
 }
 #endif
-- 
2.5.5

Patches currently in -mm which might be from manfred@xxxxxxxxxxxxxxxx are

ipc-semc-fix-complex_count-vs-simple-op-race.patch
ipc-sem-sem_lock-with-hysteresis.patch

--
To unsubscribe from this list: send the line "unsubscribe mm-commits" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html