Hi Cheng, On Mon, Dec 16, 2024 at 9:49 PM Cheng Jiang (IOE) <quic_chejiang@xxxxxxxxxxx> wrote: > > Hi Luiz, > > On 12/16/2024 10:42 PM, Luiz Augusto von Dentz wrote: > > Hi Cheng, > > > > On Mon, Dec 16, 2024 at 9:32 AM Luiz Augusto von Dentz > > <luiz.dentz@xxxxxxxxx> wrote: > >> > >> Hi Cheng, > >> > >> On Mon, Dec 16, 2024 at 3:08 AM Cheng Jiang <quic_chejiang@xxxxxxxxxxx> wrote: > >>> > >>> Sometimes, the remote device doesn't acknowledge the LL_TERMINATE_IND > >>> in time, requiring the controller to wait for the supervision timeout, > >>> which may exceed 2 seconds. In the current implementation, the > >>> HCI_EV_DISCONN_COMPLETE event is ignored if it arrives late, since > >>> the hci_abort_conn_sync has cleaned up the connection after 2 seconds. > >>> This causes the mgmt to get stuck, resulting in bluetoothd waiting > >>> indefinitely for the mgmt response to the disconnect. To recover, > >>> restarting bluetoothd is necessary. > >>> > >>> bluetoothctl log like this: > >>> [Designer Mouse]# disconnect D9:B5:6C:F2:51:91 > >>> Attempting to disconnect from D9:B5:6C:F2:51:91 > >>> [Designer Mouse]# > >>> [Designer Mouse]# power off > >>> [Designer Mouse]# > >>> Failed to set power off: org.freedesktop.DBus.Error.NoReply. > >>> > >>> Signed-off-by: Cheng Jiang <quic_chejiang@xxxxxxxxxxx> > >>> --- > >>> include/net/bluetooth/hci_core.h | 2 ++ > >>> net/bluetooth/hci_conn.c | 9 +++++++++ > >>> net/bluetooth/hci_event.c | 9 +++++++++ > >>> net/bluetooth/hci_sync.c | 18 ++++++++++++++++++ > >>> 4 files changed, 38 insertions(+) > >>> > >>> diff --git a/include/net/bluetooth/hci_core.h b/include/net/bluetooth/hci_core.h > >>> index 734cd50cd..2ab079dcf 100644 > >>> --- a/include/net/bluetooth/hci_core.h > >>> +++ b/include/net/bluetooth/hci_core.h > >>> @@ -753,6 +753,8 @@ struct hci_conn { > >>> > >>> struct bt_codec codec; > >>> > >>> + struct completion disc_ev_comp; > >>> + > >>> void (*connect_cfm_cb) (struct hci_conn *conn, u8 status); > >>> void (*security_cfm_cb) (struct hci_conn *conn, u8 status); > >>> void (*disconn_cfm_cb) (struct hci_conn *conn, u8 reason); > >>> diff --git a/net/bluetooth/hci_conn.c b/net/bluetooth/hci_conn.c > >>> index d097e308a..e0244e191 100644 > >>> --- a/net/bluetooth/hci_conn.c > >>> +++ b/net/bluetooth/hci_conn.c > >>> @@ -1028,6 +1028,15 @@ static struct hci_conn *__hci_conn_add(struct hci_dev *hdev, int type, bdaddr_t > >>> > >>> hci_conn_init_sysfs(conn); > >>> > >>> + /* This disc_ev_comp is inited when we send a disconnect request to > >>> + * the remote device but fail to receive the disconnect complete > >>> + * event within the expected time (2 seconds). This occurs because > >>> + * the remote device doesn't ack the terminate indication, forcing > >>> + * the controller to wait for the supervision timeout. > >>> + */ > >>> + init_completion(&conn->disc_ev_comp); > >>> + complete(&conn->disc_ev_comp); > >>> + > >>> return conn; > >>> } > >>> > >>> diff --git a/net/bluetooth/hci_event.c b/net/bluetooth/hci_event.c > >>> index 2cc7a9306..60ecb2b18 100644 > >>> --- a/net/bluetooth/hci_event.c > >>> +++ b/net/bluetooth/hci_event.c > >>> @@ -3366,6 +3366,15 @@ static void hci_disconn_complete_evt(struct hci_dev *hdev, void *data, > >>> if (!conn) > >>> goto unlock; > >>> > >>> + /* Wake up disc_ev_comp here is ok. Since we hold the hdev lock > >>> + * hci_abort_conn_sync will wait hdev lock release to continue. > >>> + */ > >>> + if (!completion_done(&conn->disc_ev_comp)) { > >>> + complete(&conn->disc_ev_comp); > >>> + /* Add some delay for hci_abort_conn_sync to handle the complete */ > >>> + usleep_range(100, 1000); > >>> + } > >>> + > >>> if (ev->status) { > >>> mgmt_disconnect_failed(hdev, &conn->dst, conn->type, > >>> conn->dst_type, ev->status); > >>> diff --git a/net/bluetooth/hci_sync.c b/net/bluetooth/hci_sync.c > >>> index 0badec712..783d04b57 100644 > >>> --- a/net/bluetooth/hci_sync.c > >>> +++ b/net/bluetooth/hci_sync.c > >>> @@ -5590,6 +5590,24 @@ int hci_abort_conn_sync(struct hci_dev *hdev, struct hci_conn *conn, u8 reason) > >>> break; > >>> } > >>> > >>> + /* Check whether the connection is successfully disconnected. > >>> + * Sometimes the remote device doesn't acknowledge the > >>> + * LL_TERMINATE_IND in time, requiring the controller to wait > >>> + * for the supervision timeout, which may exceed 2 seconds. In > >>> + * this case, we need to wait for the HCI_EV_DISCONN_COMPLETE > >>> + * event before cleaning up the connection. > >>> + */ > >>> + if (err == -ETIMEDOUT) { > >>> + u32 idle_delay = msecs_to_jiffies(10 * conn->le_supv_timeout); > >>> + > >>> + reinit_completion(&conn->disc_ev_comp); > >>> + if (!wait_for_completion_timeout(&conn->disc_ev_comp, idle_delay)) { > >>> + bt_dev_warn(hdev, "Failed to get complete"); > >>> + mgmt_disconnect_failed(hdev, &conn->dst, conn->type, > >>> + conn->dst_type, conn->abort_reason); > >>> + } > >>> + } > >> > >> Why don't we just set the supervision timeout as timeout then? If we > >> will have to wait for it anyway just change hci_disconnect_sync to use > >> 10 * conn->le_supv_timeout as timeout instead. > >> > hci_disconnect_sync uses __hci_cmd_sync_status_sk to wait for the > HCI_EV_DISCONN_COMPLETE event, which will send the command in hci_cmd_work. > In hci_cmd_work, it will start a timer with HCI_CMD_TIMEOUT(2s) to wait > for the event. So even in hci_disconnect_sync we can set the timeout to > supervision timeout, hci_disconnect_sync still timeout after 2s. Wait, why are you talking about HCI_CMD_TIMEOUT when I told you to use the supervision timeout instead? If it still timeout after to 2 seconds then there is something still forcing HCI_CMD_TIMEOUT which shouldn't happen. > >> That said, we really need to fix bluetoothd if it is not able to be > >> cleaned up if SET_POWERED command fails, but it looks like it is > >> handling errors correctly so it sounds like something else is at play. > > > The issue arises after a 2-second timeout of hci_disconnect_sync. During > hci_abort_conn_sync, the connection is cleaned up by hci_conn_failed. > after supervision timeout, the disconnect complete event arrives, but > it returns at line 3370 since the connection has already been removed. > As a result, bluetoothd does not send the mgmt event for MGMT_OP_DISCONNECT > to the application layer (bluetoothctl), causing bluetoothctl to get stuck > and unable to perform other mgmt commands. The command shall have completed regardless if disconnect complete has been received or not, the is the whole point of having a timeout, so this makes no sense to me, or you are not describing what is happening here. Also there is no MGMT_OP_DISCONNECT pending, it is MGMT_OP_SET_POWERED, if you are talking about the DISCONNECTED event that is a totally different thing and perhaps that is the source of the problem because if we do cleanup hci_conn even in case the command fails/times out then we should be generating a disconnected event as well. > > 3355 static void hci_disconn_complete_evt(struct hci_dev *hdev, void *data, > 3356 struct sk_buff *skb) > 3357 { > 3358 struct hci_ev_disconn_complete *ev = data; > 3359 u8 reason; > 3360 struct hci_conn_params *params; > 3361 struct hci_conn *conn; > 3362 bool mgmt_connected; > 3363 > 3364 bt_dev_dbg(hdev, "status 0x%2.2x", ev->status); > 3365 > 3366 hci_dev_lock(hdev); > 3367 > 3368 conn = hci_conn_hash_lookup_handle(hdev, __le16_to_cpu(ev->handle)); > 3369 if (!conn) > 3370 goto unlock; > 3371 > 3372 if (ev->status) { > 3373 mgmt_disconnect_failed(hdev, &conn->dst, conn->type, > 3374 conn->dst_type, ev->status); > 3375 goto unlock; > 3376 } > 3377 > 3378 conn->state = BT_CLOSED; > 3379 > 3380 mgmt_connected = test_and_clear_bit(HCI_CONN_MGMT_CONNECTED, &conn->flags); > 3381 > > > I double checked this and apparently this could no longer fail: > > > > + /* Disregard possible errors since hci_conn_del shall have been > > + * called even in case of errors had occurred since it would > > + * then cause hci_conn_failed to be called which calls > > + * hci_conn_del internally. > > + */ > > + hci_abort_conn_sync(hdev, conn, reason); > > > > So it will clean up the hci_conn no matter what is the timeout, so > > either you don't have this change or it is not working for some > > reason. > > > The issue is mgmt command is not repsonsed by bluetoothd, then the bluetootlctl is > blocked. It does not happen during the power off stage. It happened when disconnect > a BLE device, but the disconnect complete event sent from controller to host 2s later. > Then it causes the mgmt in bluetoothctl is blocked as decribed as above. There is a difference about a MGMT command, initiated by bluetoothd, versus a MGMT event initiated by the kernel, so the daemon is not blocked it just don't get a disconnection event, which is different than a command complete. What is the command sequence that you use to reproduce the problem? > >>> hci_dev_lock(hdev); > >>> > >>> /* Check if the connection has been cleaned up concurrently */ > >>> > >>> base-commit: e25c8d66f6786300b680866c0e0139981273feba > >>> -- > >>> 2.34.1 > >>> > >> > >> > >> -- > >> Luiz Augusto von Dentz > > > > > > > -- Luiz Augusto von Dentz