Hi everyone, After reading large lines of code in the net directory of the kernel, I have the following thinkings and may need your guys' suggestions. >> >> Saw this and thought I'd offer my two cents. >> >> BUG: sleeping function called from invalid context >> >> This is the original problem that Tetsuo's patch was trying to fix. >> >> Yes. >> >> >> Under the hood of lock_sock, we call lock_sock_nested which might sleep >> >> because of the mutex_acquire. >> >> Both lock_sock() and lock_sock_nested() might sleep. >> >> >> But we shouldn't sleep while holding the rw spinlock. >> >> Right. In atomic context (e.g. inside interrupt handler, schedulable context >> with interrupts or preemption disabled, schedulable context inside RCU read >> critical section, schedulable context inside a spinlock section), we must not >> call functions (e.g. waiting for a mutex, waiting for a semaphore, waiting for >> a page fault) which are not atomic. >> >> >> So we either have to acquire a spinlock instead of a mutex as was done before, >> >> Regarding hci_sock_dev_event(HCI_DEV_UNREG) case, we can't use a spinlock. >> >> Like LinMa explained, lock_sock() has to be used in order to serialize functions >> (e.g. hci_sock_sendmsg()) which access hci_pi(sk)->hdev between lock_sock(sk) and >> release_sock(sk). And like I explained, we can't defer resetting hci_pi(sk)->hdev >> to NULL, for hci_sock_dev_event(HCI_DEV_UNREG) is responsible for resetting >> hci_pi(sk)->hdev to NULL because the caller of hci_sock_dev_event(HCI_DEV_UNREG) >> immediately destroys resources associated with this hdev. >> This is the critical part of the BUG here. As you can read some similar code, for example, code at /net/iucv/af_iucv.c. The iucv_sock_bind() function will bind the device: static int iucv_sock_bind(struct socket *sock, struct sockaddr *addr, int addr_len) { ... iucv->hs_dev = dev; And this field will be assigned as NULL only when the socket is closed. static void iucv_sock_close(struct sock *sk) { ... if (iucv->hs_dev) { dev_put(iucv->hs_dev); iucv->hs_dev = NULL; sk->sk_bound_dev_if = 0; } Even in the afiucv_netdev_event() function, there is non business with iucv->hs_dev. So why the hci_sock_dev_event(HCI_DEV_UNREG) need to set the NULL pointer and then decrease the ref-count? As Tetsuo said, because the hci_unregister_dev() function, which is the caller of hci_sock_dev_event() will reclaim the resource of the hdev object. It will destroy the workqueue and also clean up the sysfs. If we achieve our patches like the iucv stack, or some other ref-count idea (https://lkml.org/lkml/2021/6/22/1347) without care, the bad thing will happen. Because there is nothing useful in the hdev object, any changes to it make no sense. But wait, the write or dereference for this object can be illegal, but there should be some legal actions, like reading flags? Hence, we can still delay the release of the hdev object to hci_sock_release (like other net code does). We just need to take care of the checking part. One quick patch is shown below, my POC didn't trigger any warning but more checks are needed. diff --git a/net/bluetooth/hci_sock.c b/net/bluetooth/hci_sock.c index 251b9128f..db665f78a 100644 --- a/net/bluetooth/hci_sock.c +++ b/net/bluetooth/hci_sock.c @@ -764,12 +764,9 @@ void hci_sock_dev_event(struct hci_dev *hdev, int event) sk_for_each(sk, &hci_sk_list.head) { bh_lock_sock_nested(sk); if (hci_pi(sk)->hdev == hdev) { - hci_pi(sk)->hdev = NULL; sk->sk_err = EPIPE; sk->sk_state = BT_OPEN; sk->sk_state_change(sk); - - hci_dev_put(hdev); } bh_unlock_sock(sk); } @@ -880,6 +877,7 @@ static int hci_sock_release(struct socket *sock) atomic_dec(&hdev->promisc); hci_dev_put(hdev); + hci_pi(sk)->hdev = NULL; } sock_orphan(sk); @@ -1727,10 +1725,10 @@ static int hci_sock_sendmsg(struct socket *sock, struct msghdr *msg, break; case HCI_CHANNEL_MONITOR: err = -EOPNOTSUPP; - goto done; + goto donefast; case HCI_CHANNEL_LOGGING: err = hci_logging_frame(sk, msg, len); - goto done; + goto donefast; default: mutex_lock(&mgmt_chan_list_lock); chan = __hci_mgmt_chan_find(hci_pi(sk)->channel); @@ -1740,15 +1738,16 @@ static int hci_sock_sendmsg(struct socket *sock, struct msghdr *msg, err = -EINVAL; mutex_unlock(&mgmt_chan_list_lock); - goto done; + goto donefast; } hdev = hci_pi(sk)->hdev; if (!hdev) { err = -EBADFD; - goto done; + goto donefast; } + hci_dev_lock(hdev); if (!test_bit(HCI_UP, &hdev->flags)) { err = -ENETDOWN; goto done; @@ -1832,6 +1831,8 @@ static int hci_sock_sendmsg(struct socket *sock, struct msghdr *msg, err = len; done: + hci_dev_unlock(hdev); +donefast: release_sock(sk); return err; In short, this patch delays the hci_dev_put() to hci_sock_release() and keeps the old bh_lock_sock_nested(). Once we did that, the UAF in hci_sock_bound_ioctl() are fixed. ( The four different commands in hci_sock_bound_ioctl will just traverse a empty linked list ) For another UAF point: hci_sock_sendmsg(), this patch uses hci_dev_lock() to make sure the flags and resource in hdev will not be released till the sendmsg is finished. (Dislike the hci_sock_create(), the hci_sock_sendmsg() can sleep so the mutex lock is possible) Of course, more auditing is needed but I just want to share this to you. Any suggestions and discussions will be much appreciated. >> >> or we need to move lock_sock out of the rw spinlock critical section as Tetsuo proposes. >> >> Exactly. Since this is a regression introduced when fixing CVE-2021-3573, Linux >> distributors are waiting for this patch so that they can apply the fix for CVE-2021-3573. >> This patch should be sent to linux.git and stables as soon as possible. But due to little >> attention on this patch, I'm already testing this patch in linux-next.git via my tree. >> I'll drop when Bluetooth maintainers pick this patch up for linux-5.14-rcX. (Or should I >> directly send to Linus?) >> >> >> >> > >> > My bad, was thinking more about the problem and noticed your poc was for hci_sock_sendmsg, >> > not hci_sock_dev_event. >> >> I didn't catch this part. Are you talking about a different poc? >> As far as I'm aware, exp.c in POC.zip was for hci_sock_bound_ioctl(HCIUNBLOCKADDR). >> >> hci_sock_bound_ioctl(HCIUNBLOCKADDR) (which is called between lock_sock() and release_sock()) >> calls copy_from_user() which might cause page fault, and userfaultfd mechanism allows an attacker >> to slowdown page fault handling enough to hci_sock_dev_event(HCI_DEV_UNREG) to return without >> waiting for hci_sock_bound_ioctl(HCIUNBLOCKADDR) to call release_sock(). This race window >> results in UAF (doesn't it, LinMa?). >> >> > In this case, it's not clear to me why the atomic context is being violated. >> >> In atomic context (in hci_sock_dev_event(HCI_DEV_UNREG) case, between >> read_lock(&hci_sk_list.lock) and read_unlock(&hci_sk_list.lock)), we must not call >> lock_sock(sk) which might wait for hci_sock_bound_ioctl(HCIUNBLOCKADDR) to call release_sock(). >> >> > >> > Sorry for the noise. >> > >> >>> >> >>> The patch provided by Desmond adds the local_bh_disable() before the bh_lock_sock() so I also try that in >> >>> >> >>> --- a/net/bluetooth/hci_sock.c >> >>> +++ b/net/bluetooth/hci_sock.c >> >>> @@ -762,6 +762,7 @@ void hci_sock_dev_event(struct hci_dev *hdev, int event) >> >>> /* Detach sockets from device */ >> >>> read_lock(&hci_sk_list.lock); >> >>> sk_for_each(sk, &hci_sk_list.head) { >> >>> + local_bh_disable(); >> >>> bh_lock_sock_nested(sk); >> >>> if (hci_pi(sk)->hdev == hdev) { >> >>> hci_pi(sk)->hdev = NULL; >> >>> @@ -772,6 +773,7 @@ void hci_sock_dev_event(struct hci_dev *hdev, int event) >> >>> hci_dev_put(hdev); >> >>> } >> >>> bh_unlock_sock(sk); >> >>> + local_bh_enable(); >> >>> } >> >>> read_unlock(&hci_sk_list.lock); >> >>> } >> >>> >> >>> But this is not useful, the UAF still occurs >> >>> >> >> >> >> I might be very mistaken on this, but I believe the UAF still happens because >> >> you can't really mix bh_lock_sock* and lock_sock* to protect the same things. >> >> Right. https://www.kernel.org/doc/html/v5.13/kernel-hacking/locking.html >> >> >> The former holds the spinlock &sk->sk_lock.slock and synchronizes between >> >> user contexts and bottom halves, >> >> serializes access to resources which might be accessed from atomic (i.e. non-schedulable) contexts >> >> >> while the latter holds a mutex on &sk->sk_lock.dep_map to synchronize between >> >> multiple users. >> >> serializes access to resources which are accessed from only schedulable (i.e. non-atomic) contexts >> >> >> >> >> One option I can think of would be to switch instances of lock_sock to bh_lock_sock_nested >> >> for users that might race (such as hci_sock_sendmsg, hci_sock_bound_ioctl, and others as >> >> needed). But I'm not sure if that's quite what we want, plus we would need to ensure that >> >> sleeping functions aren't called between the bh_lock/unlock. >> >> We can't do it for hci_sock_dev_event(HCI_DEV_UNREG). Regards Lin Ma