Hi Cheng, On Tue, Dec 17, 2024 at 12:51 AM Cheng Jiang (IOE) <quic_chejiang@xxxxxxxxxxx> wrote: > > Hi Luiz, > > On 12/17/2024 11:15 AM, Luiz Augusto von Dentz wrote: > > Hi Cheng, > > > > On Mon, Dec 16, 2024 at 9:49 PM Cheng Jiang (IOE) > > <quic_chejiang@xxxxxxxxxxx> wrote: > >> > >> Hi Luiz, > >> > >> On 12/16/2024 10:42 PM, Luiz Augusto von Dentz wrote: > >>> Hi Cheng, > >>> > >>> On Mon, Dec 16, 2024 at 9:32 AM Luiz Augusto von Dentz > >>> <luiz.dentz@xxxxxxxxx> wrote: > >>>> > >>>> Hi Cheng, > >>>> > >>>> On Mon, Dec 16, 2024 at 3:08 AM Cheng Jiang <quic_chejiang@xxxxxxxxxxx> wrote: > >>>>> > >>>>> Sometimes, the remote device doesn't acknowledge the LL_TERMINATE_IND > >>>>> in time, requiring the controller to wait for the supervision timeout, > >>>>> which may exceed 2 seconds. In the current implementation, the > >>>>> HCI_EV_DISCONN_COMPLETE event is ignored if it arrives late, since > >>>>> the hci_abort_conn_sync has cleaned up the connection after 2 seconds. > >>>>> This causes the mgmt to get stuck, resulting in bluetoothd waiting > >>>>> indefinitely for the mgmt response to the disconnect. To recover, > >>>>> restarting bluetoothd is necessary. > >>>>> > >>>>> bluetoothctl log like this: > >>>>> [Designer Mouse]# disconnect D9:B5:6C:F2:51:91 > >>>>> Attempting to disconnect from D9:B5:6C:F2:51:91 > >>>>> [Designer Mouse]# > >>>>> [Designer Mouse]# power off > >>>>> [Designer Mouse]# > >>>>> Failed to set power off: org.freedesktop.DBus.Error.NoReply. > >>>>> > >>>>> Signed-off-by: Cheng Jiang <quic_chejiang@xxxxxxxxxxx> > >>>>> --- > >>>>> include/net/bluetooth/hci_core.h | 2 ++ > >>>>> net/bluetooth/hci_conn.c | 9 +++++++++ > >>>>> net/bluetooth/hci_event.c | 9 +++++++++ > >>>>> net/bluetooth/hci_sync.c | 18 ++++++++++++++++++ > >>>>> 4 files changed, 38 insertions(+) > >>>>> > >>>>> diff --git a/include/net/bluetooth/hci_core.h b/include/net/bluetooth/hci_core.h > >>>>> index 734cd50cd..2ab079dcf 100644 > >>>>> --- a/include/net/bluetooth/hci_core.h > >>>>> +++ b/include/net/bluetooth/hci_core.h > >>>>> @@ -753,6 +753,8 @@ struct hci_conn { > >>>>> > >>>>> struct bt_codec codec; > >>>>> > >>>>> + struct completion disc_ev_comp; > >>>>> + > >>>>> void (*connect_cfm_cb) (struct hci_conn *conn, u8 status); > >>>>> void (*security_cfm_cb) (struct hci_conn *conn, u8 status); > >>>>> void (*disconn_cfm_cb) (struct hci_conn *conn, u8 reason); > >>>>> diff --git a/net/bluetooth/hci_conn.c b/net/bluetooth/hci_conn.c > >>>>> index d097e308a..e0244e191 100644 > >>>>> --- a/net/bluetooth/hci_conn.c > >>>>> +++ b/net/bluetooth/hci_conn.c > >>>>> @@ -1028,6 +1028,15 @@ static struct hci_conn *__hci_conn_add(struct hci_dev *hdev, int type, bdaddr_t > >>>>> > >>>>> hci_conn_init_sysfs(conn); > >>>>> > >>>>> + /* This disc_ev_comp is inited when we send a disconnect request to > >>>>> + * the remote device but fail to receive the disconnect complete > >>>>> + * event within the expected time (2 seconds). This occurs because > >>>>> + * the remote device doesn't ack the terminate indication, forcing > >>>>> + * the controller to wait for the supervision timeout. > >>>>> + */ > >>>>> + init_completion(&conn->disc_ev_comp); > >>>>> + complete(&conn->disc_ev_comp); > >>>>> + > >>>>> return conn; > >>>>> } > >>>>> > >>>>> diff --git a/net/bluetooth/hci_event.c b/net/bluetooth/hci_event.c > >>>>> index 2cc7a9306..60ecb2b18 100644 > >>>>> --- a/net/bluetooth/hci_event.c > >>>>> +++ b/net/bluetooth/hci_event.c > >>>>> @@ -3366,6 +3366,15 @@ static void hci_disconn_complete_evt(struct hci_dev *hdev, void *data, > >>>>> if (!conn) > >>>>> goto unlock; > >>>>> > >>>>> + /* Wake up disc_ev_comp here is ok. Since we hold the hdev lock > >>>>> + * hci_abort_conn_sync will wait hdev lock release to continue. > >>>>> + */ > >>>>> + if (!completion_done(&conn->disc_ev_comp)) { > >>>>> + complete(&conn->disc_ev_comp); > >>>>> + /* Add some delay for hci_abort_conn_sync to handle the complete */ > >>>>> + usleep_range(100, 1000); > >>>>> + } > >>>>> + > >>>>> if (ev->status) { > >>>>> mgmt_disconnect_failed(hdev, &conn->dst, conn->type, > >>>>> conn->dst_type, ev->status); > >>>>> diff --git a/net/bluetooth/hci_sync.c b/net/bluetooth/hci_sync.c > >>>>> index 0badec712..783d04b57 100644 > >>>>> --- a/net/bluetooth/hci_sync.c > >>>>> +++ b/net/bluetooth/hci_sync.c > >>>>> @@ -5590,6 +5590,24 @@ int hci_abort_conn_sync(struct hci_dev *hdev, struct hci_conn *conn, u8 reason) > >>>>> break; > >>>>> } > >>>>> > >>>>> + /* Check whether the connection is successfully disconnected. > >>>>> + * Sometimes the remote device doesn't acknowledge the > >>>>> + * LL_TERMINATE_IND in time, requiring the controller to wait > >>>>> + * for the supervision timeout, which may exceed 2 seconds. In > >>>>> + * this case, we need to wait for the HCI_EV_DISCONN_COMPLETE > >>>>> + * event before cleaning up the connection. > >>>>> + */ > >>>>> + if (err == -ETIMEDOUT) { > >>>>> + u32 idle_delay = msecs_to_jiffies(10 * conn->le_supv_timeout); > >>>>> + > >>>>> + reinit_completion(&conn->disc_ev_comp); > >>>>> + if (!wait_for_completion_timeout(&conn->disc_ev_comp, idle_delay)) { > >>>>> + bt_dev_warn(hdev, "Failed to get complete"); > >>>>> + mgmt_disconnect_failed(hdev, &conn->dst, conn->type, > >>>>> + conn->dst_type, conn->abort_reason); > >>>>> + } > >>>>> + } > >>>> > >>>> Why don't we just set the supervision timeout as timeout then? If we > >>>> will have to wait for it anyway just change hci_disconnect_sync to use > >>>> 10 * conn->le_supv_timeout as timeout instead. > >>>> > >> hci_disconnect_sync uses __hci_cmd_sync_status_sk to wait for the > >> HCI_EV_DISCONN_COMPLETE event, which will send the command in hci_cmd_work. > >> In hci_cmd_work, it will start a timer with HCI_CMD_TIMEOUT(2s) to wait > >> for the event. So even in hci_disconnect_sync we can set the timeout to > >> supervision timeout, hci_disconnect_sync still timeout after 2s. > > > > Wait, why are you talking about HCI_CMD_TIMEOUT when I told you to use > > the supervision timeout instead? If it still timeout after to 2 > > seconds then there is something still forcing HCI_CMD_TIMEOUT which > > shouldn't happen. > > > Since the lower layer (hci_cmd_work) has set the timeout to HCI_CMD_TIMEOUT, so > even the upper layer set to a larger timeout value, like supervision timeout, > it still get the timeout after HCI_CMD_TIMEOUT. The timeout flow is: > hci_disconnect_sync -> __hci_cmd_sync_sk(wait_event_interruptible_timeout) -> > hci_cmd_work -> hci_cmd_timeout -> hci_cmd_sync_cancel_sync -> wake up the > wait_event_interruptible_timeout in __hci_cmd_sync_sk -> hci_disconnect_sync timeout. > > So even if we set a large timeout value in hci_disconnect_sync, it doesn't work > since it's waked up by other event, not the real timeout. > > What's more, in the hci_disconnect_sync, if the reason it not power_off, it waits > for the disconnect complete event rather than command status event. According to > BT core spec, disconnect complete event has to wait for remote's ack or wait until > supervision timeout. It's a valid case that the disconnect complete event taking > more than 2s. You seems to be confusing the role of 2 different timers: err = wait_event_interruptible_timeout(hdev->req_wait_q, hdev->req_status != HCI_REQ_PEND, timeout); and queue_delayed_work(hdev->workqueue, &hdev->cmd_timer, HCI_CMD_TIMEOUT); The former waits for a specific event, while the later waits for handle_cmd_cnt_and_timer, each can have a distinct timeout as in the code bellow: return __hci_cmd_sync_status_sk(hdev, HCI_OP_LE_EXT_CREATE_CONN, plen, data, HCI_EV_LE_ENHANCED_CONN_COMPLETE, conn->conn_timeout, NULL); The reason there are 2 timers is that we need to track the number of commands outstanding in the controller, and no you can't delay Command Status: When the Controller receives the HCI_Disconnect command, it _shall_ send the HCI_Command_Status event to the Host. So even if HCI_Disconnection_Complete is delayed the following shall still work provided the HCI_Command_Status is still being generated accordingly: index 0badec7120ab..0ab607fb6433 100644 --- a/net/bluetooth/hci_sync.c +++ b/net/bluetooth/hci_sync.c @@ -5444,7 +5444,7 @@ static int hci_disconnect_sync(struct hci_dev *hdev, struct hci_conn *conn, return __hci_cmd_sync_status_sk(hdev, HCI_OP_DISCONNECT, sizeof(cp), &cp, HCI_EV_DISCONN_COMPLETE, - HCI_CMD_TIMEOUT, NULL); + conn->disc_timeout, NULL); return __hci_cmd_sync_status(hdev, HCI_OP_DISCONNECT, sizeof(cp), &cp, HCI_CMD_TIMEOUT); Then we need to adjust conn->disc_timeout according to supervision timeout, that said it appears disc_timeout is actually acting as idle timeout, so perhaps we need a separate field, also it doesn't look like we track the supervision timeout for non-LE links. > >>>> That said, we really need to fix bluetoothd if it is not able to be > >>>> cleaned up if SET_POWERED command fails, but it looks like it is > >>>> handling errors correctly so it sounds like something else is at play. > >>> > >> The issue arises after a 2-second timeout of hci_disconnect_sync. During > >> hci_abort_conn_sync, the connection is cleaned up by hci_conn_failed. > >> after supervision timeout, the disconnect complete event arrives, but > >> it returns at line 3370 since the connection has already been removed. > >> As a result, bluetoothd does not send the mgmt event for MGMT_OP_DISCONNECT > >> to the application layer (bluetoothctl), causing bluetoothctl to get stuck > >> and unable to perform other mgmt commands. > > > > The command shall have completed regardless if disconnect complete has > > been received or not, the is the whole point of having a timeout, so > > this makes no sense to me, or you are not describing what is happening > > here. Also there is no MGMT_OP_DISCONNECT pending, it is > > MGMT_OP_SET_POWERED, if you are talking about the DISCONNECTED event > > that is a totally different thing and perhaps that is the source of > > the problem because if we do cleanup hci_conn even in case the command > > fails/times out then we should be generating a disconnected event as > > well. > > > Here is the flow describe the issue. For normal case: > ┌───────────┐ ┌──────────┐ ┌──────┐ ┌──────────┐ ┌──────┐ > │bluetootctl│ │bluetoothd│ │kernel│ │controller│ │remote│ > └─────┬─────┘ └─────┬────┘ └───┬──┘ └─────┬────┘ └───┬──┘ > │ disconnect cmd │ │ │ │ > │─────────────────────>│ │ │ │ > │ │ │ │ │ > │ │ MGMT_OP │ │ │ > │ │ _DISCONNECT │ │ │ > │ │───────────────────>│ │ │ > │ │ │ │ │ > │ │ │ HCI_Disconnect │ │ > │ │ │───────────────────>│ │ > │ │ │ │ │ > │ │ │ │ LL_TERMINATE │ > │ │ │ │ _IND │ > │ │ │ │───────────────────>│ > │ │ │ │ │ > │ │ │ │ ACK │ > │ │ │ │<───────────────────│ > │ │ │ │ │ > │ │ │ Disc_Comp_Evt │ │ > │ │ │<───────────────────│ │ > │ │ │ │ │ > │ │ MGMT Response │ │ │ > │ │<───────────────────│ │ │ > │ │ │ │ │ > │ disc succ │ │ │ │ > │<─────────────────────│ │ │ │ > ┌─────┴─────┐ ┌─────┴────┐ ┌───┴──┐ ┌─────┴────┐ ┌───┴──┐ > │bluetootctl│ │bluetoothd│ │kernel│ │controller│ │remote│ > └───────────┘ └──────────┘ └──────┘ └──────────┘ └──────┘ > > > The failure case like this: > > ┌───────────┐ ┌──────────┐ ┌──────┐ ┌──────────┐ ┌──────┐ > │bluetootctl│ │bluetoothd│ │kernel│ │controller│ │remote│ > └─────┬─────┘ └─────┬────┘ └───┬──┘ └─────┬────┘ └───┬──┘ > │ disconnect │ │ │ │ > │ cmd │ │ │ │ > │─────────────────────>│ │ │ │ > │ │ │ │ │ > │ │ MGMT_OP_ │ │ │ > │ │ DISCONNECT │ │ │ > │ │───────────────────>│ │ │ > │ │ │ │ │ > │ │ ┌┴┐ HCI_ │ │ > │ │ │ │ Disconnect │ │ > │ │ │ │ ──────────────────>│ │ > │ │ │ │ │ │ > │ │ │ │ │ LL_TERMINATE ┌┴┐ > │ │ │ │ │ _IND │ │ > │ │ │ │ │─────────────────> │ │ > │ │ │ │ │ │ │ > │ │ │ │ 2s │ │ │ > │ │ │ │ │ │ │ > │ │ │ │ │ │ │ More > │ │ │ │ │ │ │ than > │ │ │ │ │ │ │ 2s > │ │ │ │ │ │ │ > │ │ │ │ │ │ │ > │ │ │ │ │ │ │ > │ │ └┬┘ │ │ │ > │ │ │────┐ │ │ │ > │ │ │ │ hci_disconnect │ │ │ > │ │ │<───┘ sync timeout, │ │ │ > │ │ │ del 'con' by │ │ │ > │ │ │ hci_conn_failed│ │ │ > │ │ │ │ └┬┘ > │ │ │ │ ACK │ > │ │ │ │<───────────────────│ > │ │ │ │ │ > │ │ │ Disc_Comp_Evt │ │ > │ │ │<────────────────────│ │ > │ │ │ │ │ > │ │ │────┐ │ │ > │ \│/ │ │ ignore the │ │ > │ X │<───┘ event since │ │ > │ /│\ │ 'con' has been │ │ > │ │ MGMT │ deleted │ │ > │ │ Response │ │ │ > │ │<─ ─ ─ ─ ─ ─ ─ ─ ─ ─│ │ │ > ┌─────┴─────┐ ┌─────┴────┐ ┌───┴──┐ ┌─────┴────┐ ┌───┴──┐ > │bluetootctl│ │bluetoothd│ │kernel│ │controller│ │remote│ > └───────────┘ └──────────┘ └──────┘ └──────────┘ └──────┘ > > > So in the failure case, the bluetoothd is blocked by waiting the mgmt response from kernel. From > our test, bluetoothd can't accept any command related to mgmt from bluetothctl. > > >> > >> 3355 static void hci_disconn_complete_evt(struct hci_dev *hdev, void *data, > >> 3356 struct sk_buff *skb) > >> 3357 { > >> 3358 struct hci_ev_disconn_complete *ev = data; > >> 3359 u8 reason; > >> 3360 struct hci_conn_params *params; > >> 3361 struct hci_conn *conn; > >> 3362 bool mgmt_connected; > >> 3363 > >> 3364 bt_dev_dbg(hdev, "status 0x%2.2x", ev->status); > >> 3365 > >> 3366 hci_dev_lock(hdev); > >> 3367 > >> 3368 conn = hci_conn_hash_lookup_handle(hdev, __le16_to_cpu(ev->handle)); > >> 3369 if (!conn) > >> 3370 goto unlock; > >> 3371 > >> 3372 if (ev->status) { > >> 3373 mgmt_disconnect_failed(hdev, &conn->dst, conn->type, > >> 3374 conn->dst_type, ev->status); > >> 3375 goto unlock; > >> 3376 } > >> 3377 > >> 3378 conn->state = BT_CLOSED; > >> 3379 > >> 3380 mgmt_connected = test_and_clear_bit(HCI_CONN_MGMT_CONNECTED, &conn->flags); > >> 3381 > >> > >>> I double checked this and apparently this could no longer fail: > >>> > >>> + /* Disregard possible errors since hci_conn_del shall have been > >>> + * called even in case of errors had occurred since it would > >>> + * then cause hci_conn_failed to be called which calls > >>> + * hci_conn_del internally. > >>> + */ > >>> + hci_abort_conn_sync(hdev, conn, reason); > >>> > >>> So it will clean up the hci_conn no matter what is the timeout, so > >>> either you don't have this change or it is not working for some > >>> reason. > >>> > >> The issue is mgmt command is not repsonsed by bluetoothd, then the bluetootlctl is > >> blocked. It does not happen during the power off stage. It happened when disconnect > >> a BLE device, but the disconnect complete event sent from controller to host 2s later. > >> Then it causes the mgmt in bluetoothctl is blocked as decribed as above. > > > > There is a difference about a MGMT command, initiated by bluetoothd, > > versus a MGMT event initiated by the kernel, so the daemon is not > > blocked it just don't get a disconnection event, which is different > > than a command complete. > > > > What is the command sequence that you use to reproduce the problem? > Here is the command log: > [CHG] Controller 8C:FD:F0:21:81:87 Pairable: yes > [bluetooth]# power on > Changing power on succeeded > [bluetooth]# connect CF:90:67:3C:7A:56 > Attempting to connect to CF:90:67:3C:7A:56 > [CHG] Device CF:90:67:3C:7A:56 Connected: yes > Connection successful > [CHG] Device CF:90:67:3C:7A:56 ServicesResolved: yes > [Designer Mouse]# > [Designer Mouse]# disconnect D9:B5:6C:F2:51:91 > Attempting to disconnect from D9:B5:6C:F2:51:91 ## no disconnection success response > [Designer Mouse]# > [Designer Mouse]# power off > [Designer Mouse]# > Failed to set power off: org.freedesktop.DBus.Error.NoReply > > To easily reproduce this issue, we use a firmware which always send the disconnect > complete event more than 2s. Then the issue occurred 100%. > > Actually, the root cause is the hci_disconnect_sync doesn't handle this case since it > relies on __hci_cmd_sync_status_sk, which maximum timeout value is constrained to 2s. > > > > >>>>> hci_dev_lock(hdev); > >>>>> > >>>>> /* Check if the connection has been cleaned up concurrently */ > >>>>> > >>>>> base-commit: e25c8d66f6786300b680866c0e0139981273feba > >>>>> -- > >>>>> 2.34.1 > >>>>> > >>>> > >>>> > >>>> -- > >>>> Luiz Augusto von Dentz > >>> > >>> > >>> > >> > > > > > -- Luiz Augusto von Dentz