On Thu, Jul 21, 2022 at 02:14:16PM -0400, Waiman Long wrote: > > On 7/21/22 03:17, yuxin.ye wrote: > > On Wed, Jul 20, 2022 at 10:25:17PM -0400, Waiman Long wrote: > > > On 7/20/22 03:28, yuxin.ye wrote: > > > > before rt_mutex_adjust_prio_chain(),unlock lock->wait_lock will cause > > > > BUG_ON at kernel/locking/rtmutex.c:1331: > > > The current upstream kernel/locking/rtmutex.c has no BUG_ON() call. Which > > > version of the kernel are you using? > > > > > > Cheers, > > > Longman > > > > > The Linux version is 5.10. > > The upstream has indeed removed the BUG_ON, But in rt_mutex_adjust_prio_chain() > > it is still possible to have a thread is blocked by two locks. Can this situation > > be ignored without BUG_ON? > > No. However, we don't remove the lock like what you do with your patch. It > will corrupt the data if multiple CPUs are allowed to run > rt_mutex_adjust_prio_chain() for the same rt_mutex simultaneously. You need > to find a way to fix the underlying problem. > > BTW, I still can't see a BUG_ON at line 1331 of rtmutex.c with a v5.10 > kernel. Does your source tree have some out-of-tree patches that modifies > rtmutex? > > Cheers, > Longman > Yes, I'm sorry I overlooked that earlier. We applied the RT patch,and the BUG_ON are also introduced by these patches. Back to the question, I think remove the wait_lock unlock before rt_mutex_adjust_prio_chain() is more likely to protect some data. The commont on task_blocks_on_rt_mutex() indicates that must be called with wait_lock held, but it unlock before call rt_mutex_adjust_prio_chain(). This may cause the owner thread to unlock the orig_lock and exit the thead. Finally, when calling put_task_struct(owner) in rt_mutex_adjust_prio_chain(), the thread is blocked by another lock that is deeply hidden. Actully, I'm not sure why rt_mutex_adjust_prio_chain() dosen't need wait_lock protection. Thanks again.
>From 785a93a55536c33b674ef5d794f545a9d333852e Mon Sep 17 00:00:00 2001 From: Thomas Gleixner <tglx@xxxxxxxxxxxxx> Date: Fri, 10 Jun 2011 11:04:15 +0200 Subject: [PATCH 162/304] locking/rtmutex: Handle the various new futex race conditions RT opens a few new interesting race conditions in the rtmutex/futex combo due to futex hash bucket lock being a 'sleeping' spinlock and therefor not disabling preemption. Signed-off-by: Thomas Gleixner <tglx@xxxxxxxxxxxxx> --- kernel/futex.c | 77 ++++++++++++++++++++++++++------- kernel/locking/rtmutex.c | 36 ++++++++++++--- kernel/locking/rtmutex_common.h | 2 + 3 files changed, 94 insertions(+), 21 deletions(-) diff --git a/kernel/futex.c b/kernel/futex.c index 98a6e1b80bfe..c2fe58c34409 100644 --- a/kernel/futex.c +++ b/kernel/futex.c @@ -2155,6 +2155,16 @@ static int futex_requeue(u32 __user *uaddr1, unsigned int flags, */ requeue_pi_wake_futex(this, &key2, hb2); continue; + } else if (ret == -EAGAIN) { + /* + * Waiter was woken by timeout or + * signal and has set pi_blocked_on to + * PI_WAKEUP_INPROGRESS before we + * tried to enqueue it on the rtmutex. + */ + this->pi_state = NULL; + put_pi_state(pi_state); + continue; } else if (ret) { /* * rt_mutex_start_proxy_lock() detected a @@ -3172,7 +3182,7 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags, { struct hrtimer_sleeper timeout, *to; struct rt_mutex_waiter rt_waiter; - struct futex_hash_bucket *hb; + struct futex_hash_bucket *hb, *hb2; union futex_key key2 = FUTEX_KEY_INIT; struct futex_q q = futex_q_init; int res, ret; @@ -3224,20 +3234,55 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags, /* Queue the futex_q, drop the hb lock, wait for wakeup. */ futex_wait_queue_me(hb, &q, to); - spin_lock(&hb->lock); - ret = handle_early_requeue_pi_wakeup(hb, &q, &key2, to); - spin_unlock(&hb->lock); - if (ret) - goto out; + /* + * On RT we must avoid races with requeue and trying to block + * on two mutexes (hb->lock and uaddr2's rtmutex) by + * serializing access to pi_blocked_on with pi_lock. + */ + raw_spin_lock_irq(¤t->pi_lock); + if (current->pi_blocked_on) { + /* + * We have been requeued or are in the process of + * being requeued. + */ + raw_spin_unlock_irq(¤t->pi_lock); + } else { + /* + * Setting pi_blocked_on to PI_WAKEUP_INPROGRESS + * prevents a concurrent requeue from moving us to the + * uaddr2 rtmutex. After that we can safely acquire + * (and possibly block on) hb->lock. + */ + current->pi_blocked_on = PI_WAKEUP_INPROGRESS; + raw_spin_unlock_irq(¤t->pi_lock); + + spin_lock(&hb->lock); + + /* + * Clean up pi_blocked_on. We might leak it otherwise + * when we succeeded with the hb->lock in the fast + * path. + */ + raw_spin_lock_irq(¤t->pi_lock); + current->pi_blocked_on = NULL; + raw_spin_unlock_irq(¤t->pi_lock); + + ret = handle_early_requeue_pi_wakeup(hb, &q, &key2, to); + spin_unlock(&hb->lock); + if (ret) + goto out; + } /* - * In order for us to be here, we know our q.key == key2, and since - * we took the hb->lock above, we also know that futex_requeue() has - * completed and we no longer have to concern ourselves with a wakeup - * race with the atomic proxy lock acquisition by the requeue code. The - * futex_requeue dropped our key1 reference and incremented our key2 - * reference count. + * In order to be here, we have either been requeued, are in + * the process of being requeued, or requeue successfully + * acquired uaddr2 on our behalf. If pi_blocked_on was + * non-null above, we may be racing with a requeue. Do not + * rely on q->lock_ptr to be hb2->lock until after blocking on + * hb->lock or hb2->lock. The futex_requeue dropped our key1 + * reference and incremented our key2 reference count. */ + hb2 = hash_futex(&key2); /* Check if the requeue code acquired the second futex for us. */ if (!q.rt_waiter) { @@ -3246,14 +3291,15 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags, * did a lock-steal - fix up the PI-state in that case. */ if (q.pi_state && (q.pi_state->owner != current)) { - spin_lock(q.lock_ptr); + spin_lock(&hb2->lock); + BUG_ON(&hb2->lock != q.lock_ptr); ret = fixup_pi_state_owner(uaddr2, &q, current); /* * Drop the reference to the pi state which * the requeue_pi() code acquired for us. */ put_pi_state(q.pi_state); - spin_unlock(q.lock_ptr); + spin_unlock(&hb2->lock); /* * Adjust the return value. It's either -EFAULT or * success (1) but the caller expects 0 for success. @@ -3272,7 +3318,8 @@ static int futex_wait_requeue_pi(u32 __user *uaddr, unsigned int flags, pi_mutex = &q.pi_state->pi_mutex; ret = rt_mutex_wait_proxy_lock(pi_mutex, to, &rt_waiter); - spin_lock(q.lock_ptr); + spin_lock(&hb2->lock); + BUG_ON(&hb2->lock != q.lock_ptr); if (ret && !rt_mutex_cleanup_proxy_lock(pi_mutex, &rt_waiter)) ret = 0; diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c index 1d0e9bf0487a..97a5fb19119d 100644 --- a/kernel/locking/rtmutex.c +++ b/kernel/locking/rtmutex.c @@ -136,6 +136,11 @@ static void fixup_rt_mutex_waiters(struct rt_mutex *lock) WRITE_ONCE(*p, owner & ~RT_MUTEX_HAS_WAITERS); } +static int rt_mutex_real_waiter(struct rt_mutex_waiter *waiter) +{ + return waiter && waiter != PI_WAKEUP_INPROGRESS; +} + /* * We can speed up the acquire/release, if there's no debugging state to be * set up. @@ -378,7 +383,8 @@ int max_lock_depth = 1024; static inline struct rt_mutex *task_blocked_on_lock(struct task_struct *p) { - return p->pi_blocked_on ? p->pi_blocked_on->lock : NULL; + return rt_mutex_real_waiter(p->pi_blocked_on) ? + p->pi_blocked_on->lock : NULL; } /* @@ -514,7 +520,7 @@ static int rt_mutex_adjust_prio_chain(struct task_struct *task, * reached or the state of the chain has changed while we * dropped the locks. */ - if (!waiter) + if (!rt_mutex_real_waiter(waiter)) goto out_unlock_pi; /* @@ -947,6 +953,22 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock, return -EDEADLK; raw_spin_lock(&task->pi_lock); + /* + * In the case of futex requeue PI, this will be a proxy + * lock. The task will wake unaware that it is enqueueed on + * this lock. Avoid blocking on two locks and corrupting + * pi_blocked_on via the PI_WAKEUP_INPROGRESS + * flag. futex_wait_requeue_pi() sets this when it wakes up + * before requeue (due to a signal or timeout). Do not enqueue + * the task if PI_WAKEUP_INPROGRESS is set. + */ + if (task != current && task->pi_blocked_on == PI_WAKEUP_INPROGRESS) { + raw_spin_unlock(&task->pi_lock); + return -EAGAIN; + } + + BUG_ON(rt_mutex_real_waiter(task->pi_blocked_on)); + waiter->task = task; waiter->lock = lock; waiter->prio = task->prio; @@ -970,7 +992,7 @@ static int task_blocks_on_rt_mutex(struct rt_mutex *lock, rt_mutex_enqueue_pi(owner, waiter); rt_mutex_adjust_prio(owner); - if (owner->pi_blocked_on) + if (rt_mutex_real_waiter(owner->pi_blocked_on)) chain_walk = 1; } else if (rt_mutex_cond_detect_deadlock(waiter, chwalk)) { chain_walk = 1; @@ -1066,7 +1088,7 @@ static void remove_waiter(struct rt_mutex *lock, { bool is_top_waiter = (waiter == rt_mutex_top_waiter(lock)); struct task_struct *owner = rt_mutex_owner(lock); - struct rt_mutex *next_lock; + struct rt_mutex *next_lock = NULL; lockdep_assert_held(&lock->wait_lock); @@ -1092,7 +1114,8 @@ static void remove_waiter(struct rt_mutex *lock, rt_mutex_adjust_prio(owner); /* Store the lock on which owner is blocked or NULL */ - next_lock = task_blocked_on_lock(owner); + if (rt_mutex_real_waiter(owner->pi_blocked_on)) + next_lock = task_blocked_on_lock(owner); raw_spin_unlock(&owner->pi_lock); @@ -1128,7 +1151,8 @@ void rt_mutex_adjust_pi(struct task_struct *task) raw_spin_lock_irqsave(&task->pi_lock, flags); waiter = task->pi_blocked_on; - if (!waiter || rt_mutex_waiter_equal(waiter, task_to_waiter(task))) { + if (!rt_mutex_real_waiter(waiter) || + rt_mutex_waiter_equal(waiter, task_to_waiter(task))) { raw_spin_unlock_irqrestore(&task->pi_lock, flags); return; } diff --git a/kernel/locking/rtmutex_common.h b/kernel/locking/rtmutex_common.h index b1455dc2366f..096b16cfb096 100644 --- a/kernel/locking/rtmutex_common.h +++ b/kernel/locking/rtmutex_common.h @@ -125,6 +125,8 @@ enum rtmutex_chainwalk { /* * PI-futex support (proxy locking functions, etc.): */ +#define PI_WAKEUP_INPROGRESS ((struct rt_mutex_waiter *) 1) + extern struct task_struct *rt_mutex_next_owner(struct rt_mutex *lock); extern void rt_mutex_init_proxy_locked(struct rt_mutex *lock, struct task_struct *proxy_owner); -- 2.32.0