It seems that history of this locking problem is a trial and error. Commit b40df5743ee8aed8 ("[PATCH] bluetooth: fix socket locking in hci_sock_dev_event()") in 2.6.21-rc4 changed bh_lock_sock() to lock_sock() as an attempt to fix lockdep warning. Then, commit 4ce61d1c7a8ef4c1 ("[BLUETOOTH]: Fix locking in hci_sock_dev_event().") in 2.6.22-rc2 changed lock_sock() to local_bh_disable() + bh_lock_sock_nested() as an attempt to fix sleep in atomic context warning. Then, commit 4b5dd696f81b210c ("Bluetooth: Remove local_bh_disable() from hci_sock.c") in 3.3-rc1 removed local_bh_disable(). Then, commit e305509e678b3a4a ("Bluetooth: use correct lock to prevent UAF of hdev object") in 5.13-rc5 again changed bh_lock_sock_nested() to lock_sock() as an attempt to fix CVE-2021-3573. But unfortunately it is too difficult to convert rw spinlock into mutex; we need to live with current rw spinlock. And we have three choices that can live with current rw spinlock. Patches for these choices are show bottom. All tested by syzbot. (1) Introduce a global mutex dedicated for hci_sock_dev_event(), and block bt_sock_unlink() and concurrent hci_sock_dev_event() callers. This is simplest if it is guaranteed that total delay for lock_sock() on all sockets is short enough. But it is not clear how long lock_sock() might block, for e.g. hci_sock_bound_ioctl() which is called inside lock_sock() section is doing copy_from_user()/copy_to_user() which may result in blocking other lock_sock() waiters for many seconds. I think that POC.zip is demonstrating that this delay is controllable via userfaultfd. Of course, the robust fix will be not to call copy_from_user()/copy_to_user() inside lock_sock() section. But such big change is not suitable for a fix for commit e305509e678b3a4a ("Bluetooth: use correct lock to prevent UAF of hdev object"). (2) Introduce a global mutex for hci_sock_dev_event(), but don't block bt_sock_unlink() nor concurrent hci_sock_dev_event() callers (i.e. use this mutex like a spinlock). Since it will be safe to resume sk_for_each() as long as currently accessing socket remains on that list, we can track which socket is currently accessed, and let bt_sock_unlink() wait if that socket is currently accessed. It is possible that total delay becomes long enough for khungtaskd to complain. Commit 8d0caedb75968304 ("can: bcm/raw/isotp: use per module netdevice notifier") is an example for avoiding khungtaskd warning using this choice. Compared to that commit, this choice for hci_sock_dev_event() case will need to also touch "struct hci_pinfo" because we need to track concurrent hci_sock_dev_event() callers. (3) Don't introduce a global mutex for hci_sock_dev_event(), and don't block bt_sock_unlink() nor concurrent hci_sock_dev_event() callers. Since it will be safe to resume sk_for_each() as long as currently accessing socket remains on that list, take a refcount on currently accessing socket and check if currently accessing socket is still on the list. This choice needs to touch only hci_sock_dev_event(). Which choice do we want to go? Patch for choice (1): ---------------------------------------- net/bluetooth/hci_sock.c | 17 ++++++++++++++--- 1 file changed, 14 insertions(+), 3 deletions(-) diff --git a/net/bluetooth/hci_sock.c b/net/bluetooth/hci_sock.c index b04a5a02ecf3..c860ec4ea7b8 100644 --- a/net/bluetooth/hci_sock.c +++ b/net/bluetooth/hci_sock.c @@ -150,6 +150,8 @@ static struct bt_sock_list hci_sk_list = { .lock = __RW_LOCK_UNLOCKED(hci_sk_list.lock) }; +static DEFINE_MUTEX(hci_sk_list_lock); + static bool is_filtered_packet(struct sock *sk, struct sk_buff *skb) { struct hci_filter *flt; @@ -758,10 +760,13 @@ void hci_sock_dev_event(struct hci_dev *hdev, int event) if (event == HCI_DEV_UNREG) { struct sock *sk; + int put_count = 0; /* Detach sockets from device */ + mutex_lock(&hci_sk_list_lock); read_lock(&hci_sk_list.lock); sk_for_each(sk, &hci_sk_list.head) { + read_unlock(&hci_sk_list.lock); lock_sock(sk); if (hci_pi(sk)->hdev == hdev) { hci_pi(sk)->hdev = NULL; @@ -769,11 +774,15 @@ void hci_sock_dev_event(struct hci_dev *hdev, int event) sk->sk_state = BT_OPEN; sk->sk_state_change(sk); - hci_dev_put(hdev); + put_count++; } release_sock(sk); + read_lock(&hci_sk_list.lock); } read_unlock(&hci_sk_list.lock); + mutex_unlock(&hci_sk_list_lock); + while (put_count--) + hci_dev_put(hdev); } } @@ -838,6 +847,10 @@ static int hci_sock_release(struct socket *sock) if (!sk) return 0; + mutex_lock(&hci_sk_list_lock); + bt_sock_unlink(&hci_sk_list, sk); + mutex_unlock(&hci_sk_list_lock); + lock_sock(sk); switch (hci_pi(sk)->channel) { @@ -859,8 +872,6 @@ static int hci_sock_release(struct socket *sock) break; } - bt_sock_unlink(&hci_sk_list, sk); - hdev = hci_pi(sk)->hdev; if (hdev) { if (hci_pi(sk)->channel == HCI_CHANNEL_USER) { ---------------------------------------- Patch for choice (2): ---------------------------------------- net/bluetooth/hci_sock.c | 39 ++++++++++++++++++++++++++++++++++++--- 1 file changed, 36 insertions(+), 3 deletions(-) diff --git a/net/bluetooth/hci_sock.c b/net/bluetooth/hci_sock.c index b04a5a02ecf3..3e65fcc8c9af 100644 --- a/net/bluetooth/hci_sock.c +++ b/net/bluetooth/hci_sock.c @@ -43,6 +43,8 @@ static DEFINE_IDA(sock_cookie_ida); static atomic_t monitor_promisc = ATOMIC_INIT(0); +static DEFINE_MUTEX(dev_event_lock); + /* ----- HCI socket interface ----- */ /* Socket info */ @@ -57,6 +59,7 @@ struct hci_pinfo { unsigned long flags; __u32 cookie; char comm[TASK_COMM_LEN]; + unsigned int event_in_progress; }; void hci_sock_set_flag(struct sock *sk, int nr) @@ -758,10 +761,15 @@ void hci_sock_dev_event(struct hci_dev *hdev, int event) if (event == HCI_DEV_UNREG) { struct sock *sk; + int put_count = 0; /* Detach sockets from device */ + mutex_lock(&dev_event_lock); read_lock(&hci_sk_list.lock); sk_for_each(sk, &hci_sk_list.head) { + read_unlock(&hci_sk_list.lock); + hci_pi(sk)->event_in_progress++; + mutex_unlock(&dev_event_lock); lock_sock(sk); if (hci_pi(sk)->hdev == hdev) { hci_pi(sk)->hdev = NULL; @@ -769,11 +777,17 @@ void hci_sock_dev_event(struct hci_dev *hdev, int event) sk->sk_state = BT_OPEN; sk->sk_state_change(sk); - hci_dev_put(hdev); + put_count++; } release_sock(sk); + mutex_lock(&dev_event_lock); + hci_pi(sk)->event_in_progress--; + read_lock(&hci_sk_list.lock); } read_unlock(&hci_sk_list.lock); + mutex_unlock(&dev_event_lock); + while (put_count--) + hci_dev_put(hdev); } } @@ -838,6 +852,26 @@ static int hci_sock_release(struct socket *sock) if (!sk) return 0; + /* + * Wait for sk_for_each() in hci_sock_dev_event() to stop accessing + * this sk before unlinking. Need to unlink before lock_sock(), for + * hci_sock_dev_event() calls lock_sock() after incrementing + * event_in_progress counter. + */ + while (1) { + bool unlinked = true; + + mutex_lock(&dev_event_lock); + if (!hci_pi(sk)->event_in_progress) + bt_sock_unlink(&hci_sk_list, sk); + else + unlinked = false; + mutex_unlock(&dev_event_lock); + if (unlinked) + break; + schedule_timeout_uninterruptible(1); + } + lock_sock(sk); switch (hci_pi(sk)->channel) { @@ -859,8 +893,6 @@ static int hci_sock_release(struct socket *sock) break; } - bt_sock_unlink(&hci_sk_list, sk); - hdev = hci_pi(sk)->hdev; if (hdev) { if (hci_pi(sk)->channel == HCI_CHANNEL_USER) { @@ -2049,6 +2081,7 @@ static int hci_sock_create(struct net *net, struct socket *sock, int protocol, sock->state = SS_UNCONNECTED; sk->sk_state = BT_OPEN; + hci_pi(sk)->event_in_progress = 0; bt_sock_link(&hci_sk_list, sk); return 0; } ---------------------------------------- Patch for choice (3): ---------------------------------------- net/bluetooth/hci_sock.c | 35 +++++++++++++++++++++++++++++++++-- 1 file changed, 33 insertions(+), 2 deletions(-) diff --git a/net/bluetooth/hci_sock.c b/net/bluetooth/hci_sock.c index b04a5a02ecf3..38146cf37378 100644 --- a/net/bluetooth/hci_sock.c +++ b/net/bluetooth/hci_sock.c @@ -758,22 +758,53 @@ void hci_sock_dev_event(struct hci_dev *hdev, int event) if (event == HCI_DEV_UNREG) { struct sock *sk; + int put_count = 0; /* Detach sockets from device */ +restart: read_lock(&hci_sk_list.lock); sk_for_each(sk, &hci_sk_list.head) { + /* This sock_hold(sk) is safe, for bt_sock_unlink(sk) + * is not called yet. + */ + sock_hold(sk); + read_unlock(&hci_sk_list.lock); lock_sock(sk); - if (hci_pi(sk)->hdev == hdev) { + write_lock(&hci_sk_list.lock); + /* Check that bt_sock_unlink(sk) is not called yet. */ + if (sk_hashed(sk) && hci_pi(sk)->hdev == hdev) { hci_pi(sk)->hdev = NULL; sk->sk_err = EPIPE; sk->sk_state = BT_OPEN; sk->sk_state_change(sk); - hci_dev_put(hdev); + put_count++; } + write_unlock(&hci_sk_list.lock); release_sock(sk); + read_lock(&hci_sk_list.lock); + /* If bt_sock_unlink(sk) is not called yet, we can + * continue iteration. We can use __sock_put(sk) here + * because hci_sock_release() will call sock_put(sk) + * after bt_sock_unlink(sk). + */ + if (sk_hashed(sk)) { + __sock_put(sk); + continue; + } + /* Otherwise, we need to restart iteration, for the + * next socket pointed by sk->next might be already + * gone. We can't use __sock_put(sk) here because + * hci_sock_release() might have already called + * sock_put(sk) after bt_sock_unlink(sk). + */ + read_unlock(&hci_sk_list.lock); + sock_put(sk); + goto restart; } read_unlock(&hci_sk_list.lock); + while (put_count--) + hci_dev_put(hdev); } } ----------------------------------------