The current guarantees that a given hci_conn is not freed concurrently appear to be: - hci_dev_lock is held, or, - rcu_read_lock is held (doesn't guarantee valid conn state), or, - hci_conn_get refcount held (doesn't guarantee valid conn state), - current task is running from hdev->workqueue (which is ordered) The last condition is not exactly true as hci_conn_del is currently called also elsewhere, but is assumed in hci_event.c and hci_core.c. It does not appear possible to assume in hdev->req_workqueue, especially in hci_sync callbacks, that a given conn stays alive at any time if none of the above precautions are taken. Similarly, any conn_hash iteration without locks there appears invalid. E.g. Disconnect events emitted from remote appear possible to occur at any time in hci_sync. There are also some hci_conn_del callsites not in hdev->workqueue which can run concurrently, eg in __iso_sock_close (which also doesn't hold hdev->lock). KASAN crashes are known on real HW, and the deletion race conditions can be hit in VHCI (see the other BlueZ tester patch series). *** What to do? One way to guard against using already freed conns is hci_conn_get + check that the connection was not deleted, inside suitable critical section. I didn't find a reliable existing check if hci_conn_del was run. To enable using hci_conn_get, the series here suggests again adding HCI_CONN_DELETED flag to indicate whether hci_conn_del has run (with hdev->lock or in hdev->workqueue). It also adds some helpers to make writing hci_sync callbacks that operate on a given hci_conn or need hdev->lock easier to write. The patches here add hci_conn_sync_queue, which check if hci_conn was deleted in meantime, and hold hdev->lock in the callback (but release it during waits), so that concurrent modification of hci_conn can only happen during event waits. This is something that is already assumed but might not be true eg. if the two workqueues run jobs on different CPU. The last two patches in the series are some motivating ISO related changes, for proper cleanup of CIS, lookup by handles doesn't quite work (and also doesn't protect against the deletion race). *** I tried to find some alternatives, but this seemed minimal one. I don't see how to make use of hci_conn_drop/hold here, as they appear to have different purpose, if we change that then how socket channels work seem to a new mechanism. E.g. nonzero refcnt from socket should not keep connection alive when eg. Disconnect event from remote occurs. In that case we also need to clean up the connection ASAP so the handle can be reused. One alternative that makes continuing conn_hash iteration a bit simpler would be to remove hci_conn from conn_hash only when hci_conn_get refcount hits zero, so holding hci_conn_get reference would keep the conn a valid iteration cursor. Also setting conn->type to INVALID_LINK on deletion could then make lookup functions skip deleted conns. However, one would need to take a look at all places where conn_hash is iterated and think it through. It maybe could also be possible to not allow events to be processed while hci_sync is running, except when it is waiting for an event (= more or less, take hci_cmd_sync_dev_lock in all callbacks so hdev->lock serializes things). However, it seems a deletion flag would be needed also here, as the conn might be gone while we are waiting for events. Pauli Virtanen (6): Bluetooth: hci_conn: hci_conn_cleanup is not needed, combine with del Bluetooth: hci_conn: add hci_conn_is_alive Bluetooth: hci_sync: add hci_conn_sync_queue and hci_cmd_sync_dev_(un)lock Bluetooth: hci_sync: fix locking in hci_conn_abort and hci_disconnect_all_sync Bluetooth: hci_sync: delete CIS in BT_OPEN/CONNECT/BOUND when aborting Bluetooth: ISO: handle bound CIS cleanup via hci_conn include/net/bluetooth/hci_core.h | 22 +++++ include/net/bluetooth/hci_sync.h | 3 + net/bluetooth/hci_conn.c | 157 ++++++++++++++++++++----------- net/bluetooth/hci_sync.c | 97 +++++++++++++++---- net/bluetooth/iso.c | 14 +-- 5 files changed, 211 insertions(+), 82 deletions(-) -- 2.41.0