Hi Luiz, On 12/17/2024 11:13 PM, Luiz Augusto von Dentz wrote: > Hi Cheng, > > On Tue, Dec 17, 2024 at 12:51 AM Cheng Jiang (IOE) > <quic_chejiang@xxxxxxxxxxx> wrote: >> >> Hi Luiz, >> >> On 12/17/2024 11:15 AM, Luiz Augusto von Dentz wrote: >>> Hi Cheng, >>> >>> On Mon, Dec 16, 2024 at 9:49 PM Cheng Jiang (IOE) >>> <quic_chejiang@xxxxxxxxxxx> wrote: >>>> >>>> Hi Luiz, >>>> >>>> On 12/16/2024 10:42 PM, Luiz Augusto von Dentz wrote: >>>>> Hi Cheng, >>>>> >>>>> On Mon, Dec 16, 2024 at 9:32 AM Luiz Augusto von Dentz >>>>> <luiz.dentz@xxxxxxxxx> wrote: >>>>>> >>>>>> Hi Cheng, >>>>>> >>>>>> On Mon, Dec 16, 2024 at 3:08 AM Cheng Jiang <quic_chejiang@xxxxxxxxxxx> wrote: >>>>>>> >>>>>>> Sometimes, the remote device doesn't acknowledge the LL_TERMINATE_IND >>>>>>> in time, requiring the controller to wait for the supervision timeout, >>>>>>> which may exceed 2 seconds. In the current implementation, the >>>>>>> HCI_EV_DISCONN_COMPLETE event is ignored if it arrives late, since >>>>>>> the hci_abort_conn_sync has cleaned up the connection after 2 seconds. >>>>>>> This causes the mgmt to get stuck, resulting in bluetoothd waiting >>>>>>> indefinitely for the mgmt response to the disconnect. To recover, >>>>>>> restarting bluetoothd is necessary. >>>>>>> >>>>>>> bluetoothctl log like this: >>>>>>> [Designer Mouse]# disconnect D9:B5:6C:F2:51:91 >>>>>>> Attempting to disconnect from D9:B5:6C:F2:51:91 >>>>>>> [Designer Mouse]# >>>>>>> [Designer Mouse]# power off >>>>>>> [Designer Mouse]# >>>>>>> Failed to set power off: org.freedesktop.DBus.Error.NoReply. >>>>>>> >>>>>>> Signed-off-by: Cheng Jiang <quic_chejiang@xxxxxxxxxxx> >>>>>>> --- >>>>>>> include/net/bluetooth/hci_core.h | 2 ++ >>>>>>> net/bluetooth/hci_conn.c | 9 +++++++++ >>>>>>> net/bluetooth/hci_event.c | 9 +++++++++ >>>>>>> net/bluetooth/hci_sync.c | 18 ++++++++++++++++++ >>>>>>> 4 files changed, 38 insertions(+) >>>>>>> >>>>>>> diff --git a/include/net/bluetooth/hci_core.h b/include/net/bluetooth/hci_core.h >>>>>>> index 734cd50cd..2ab079dcf 100644 >>>>>>> --- a/include/net/bluetooth/hci_core.h >>>>>>> +++ b/include/net/bluetooth/hci_core.h >>>>>>> @@ -753,6 +753,8 @@ struct hci_conn { >>>>>>> >>>>>>> struct bt_codec codec; >>>>>>> >>>>>>> + struct completion disc_ev_comp; >>>>>>> + >>>>>>> void (*connect_cfm_cb) (struct hci_conn *conn, u8 status); >>>>>>> void (*security_cfm_cb) (struct hci_conn *conn, u8 status); >>>>>>> void (*disconn_cfm_cb) (struct hci_conn *conn, u8 reason); >>>>>>> diff --git a/net/bluetooth/hci_conn.c b/net/bluetooth/hci_conn.c >>>>>>> index d097e308a..e0244e191 100644 >>>>>>> --- a/net/bluetooth/hci_conn.c >>>>>>> +++ b/net/bluetooth/hci_conn.c >>>>>>> @@ -1028,6 +1028,15 @@ static struct hci_conn *__hci_conn_add(struct hci_dev *hdev, int type, bdaddr_t >>>>>>> >>>>>>> hci_conn_init_sysfs(conn); >>>>>>> >>>>>>> + /* This disc_ev_comp is inited when we send a disconnect request to >>>>>>> + * the remote device but fail to receive the disconnect complete >>>>>>> + * event within the expected time (2 seconds). This occurs because >>>>>>> + * the remote device doesn't ack the terminate indication, forcing >>>>>>> + * the controller to wait for the supervision timeout. >>>>>>> + */ >>>>>>> + init_completion(&conn->disc_ev_comp); >>>>>>> + complete(&conn->disc_ev_comp); >>>>>>> + >>>>>>> return conn; >>>>>>> } >>>>>>> >>>>>>> diff --git a/net/bluetooth/hci_event.c b/net/bluetooth/hci_event.c >>>>>>> index 2cc7a9306..60ecb2b18 100644 >>>>>>> --- a/net/bluetooth/hci_event.c >>>>>>> +++ b/net/bluetooth/hci_event.c >>>>>>> @@ -3366,6 +3366,15 @@ static void hci_disconn_complete_evt(struct hci_dev *hdev, void *data, >>>>>>> if (!conn) >>>>>>> goto unlock; >>>>>>> >>>>>>> + /* Wake up disc_ev_comp here is ok. Since we hold the hdev lock >>>>>>> + * hci_abort_conn_sync will wait hdev lock release to continue. >>>>>>> + */ >>>>>>> + if (!completion_done(&conn->disc_ev_comp)) { >>>>>>> + complete(&conn->disc_ev_comp); >>>>>>> + /* Add some delay for hci_abort_conn_sync to handle the complete */ >>>>>>> + usleep_range(100, 1000); >>>>>>> + } >>>>>>> + >>>>>>> if (ev->status) { >>>>>>> mgmt_disconnect_failed(hdev, &conn->dst, conn->type, >>>>>>> conn->dst_type, ev->status); >>>>>>> diff --git a/net/bluetooth/hci_sync.c b/net/bluetooth/hci_sync.c >>>>>>> index 0badec712..783d04b57 100644 >>>>>>> --- a/net/bluetooth/hci_sync.c >>>>>>> +++ b/net/bluetooth/hci_sync.c >>>>>>> @@ -5590,6 +5590,24 @@ int hci_abort_conn_sync(struct hci_dev *hdev, struct hci_conn *conn, u8 reason) >>>>>>> break; >>>>>>> } >>>>>>> >>>>>>> + /* Check whether the connection is successfully disconnected. >>>>>>> + * Sometimes the remote device doesn't acknowledge the >>>>>>> + * LL_TERMINATE_IND in time, requiring the controller to wait >>>>>>> + * for the supervision timeout, which may exceed 2 seconds. In >>>>>>> + * this case, we need to wait for the HCI_EV_DISCONN_COMPLETE >>>>>>> + * event before cleaning up the connection. >>>>>>> + */ >>>>>>> + if (err == -ETIMEDOUT) { >>>>>>> + u32 idle_delay = msecs_to_jiffies(10 * conn->le_supv_timeout); >>>>>>> + >>>>>>> + reinit_completion(&conn->disc_ev_comp); >>>>>>> + if (!wait_for_completion_timeout(&conn->disc_ev_comp, idle_delay)) { >>>>>>> + bt_dev_warn(hdev, "Failed to get complete"); >>>>>>> + mgmt_disconnect_failed(hdev, &conn->dst, conn->type, >>>>>>> + conn->dst_type, conn->abort_reason); >>>>>>> + } >>>>>>> + } >>>>>> >>>>>> Why don't we just set the supervision timeout as timeout then? If we >>>>>> will have to wait for it anyway just change hci_disconnect_sync to use >>>>>> 10 * conn->le_supv_timeout as timeout instead. >>>>>> >>>> hci_disconnect_sync uses __hci_cmd_sync_status_sk to wait for the >>>> HCI_EV_DISCONN_COMPLETE event, which will send the command in hci_cmd_work. >>>> In hci_cmd_work, it will start a timer with HCI_CMD_TIMEOUT(2s) to wait >>>> for the event. So even in hci_disconnect_sync we can set the timeout to >>>> supervision timeout, hci_disconnect_sync still timeout after 2s. >>> >>> Wait, why are you talking about HCI_CMD_TIMEOUT when I told you to use >>> the supervision timeout instead? If it still timeout after to 2 >>> seconds then there is something still forcing HCI_CMD_TIMEOUT which >>> shouldn't happen. >>> >> Since the lower layer (hci_cmd_work) has set the timeout to HCI_CMD_TIMEOUT, so >> even the upper layer set to a larger timeout value, like supervision timeout, >> it still get the timeout after HCI_CMD_TIMEOUT. The timeout flow is: >> hci_disconnect_sync -> __hci_cmd_sync_sk(wait_event_interruptible_timeout) -> >> hci_cmd_work -> hci_cmd_timeout -> hci_cmd_sync_cancel_sync -> wake up the >> wait_event_interruptible_timeout in __hci_cmd_sync_sk -> hci_disconnect_sync timeout. >> >> So even if we set a large timeout value in hci_disconnect_sync, it doesn't work >> since it's waked up by other event, not the real timeout. >> >> What's more, in the hci_disconnect_sync, if the reason it not power_off, it waits >> for the disconnect complete event rather than command status event. According to >> BT core spec, disconnect complete event has to wait for remote's ack or wait until >> supervision timeout. It's a valid case that the disconnect complete event taking >> more than 2s. > > You seems to be confusing the role of 2 different timers: > > err = wait_event_interruptible_timeout(hdev->req_wait_q, > hdev->req_status != HCI_REQ_PEND, > timeout); > > and > > queue_delayed_work(hdev->workqueue, &hdev->cmd_timer, > HCI_CMD_TIMEOUT); > > The former waits for a specific event, while the later waits for > handle_cmd_cnt_and_timer, each can have a distinct timeout as in the > code bellow: > > return __hci_cmd_sync_status_sk(hdev, HCI_OP_LE_EXT_CREATE_CONN, > plen, data, > HCI_EV_LE_ENHANCED_CONN_COMPLETE, > conn->conn_timeout, NULL); > > The reason there are 2 timers is that we need to track the number of > commands outstanding in the controller, and no you can't delay Command > Status: > > When the Controller receives the HCI_Disconnect command, it _shall_ send the > HCI_Command_Status event to the Host. > > So even if HCI_Disconnection_Complete is delayed the following shall > still work provided the HCI_Command_Status is still being generated > accordingly: > > index 0badec7120ab..0ab607fb6433 100644 > --- a/net/bluetooth/hci_sync.c > +++ b/net/bluetooth/hci_sync.c > @@ -5444,7 +5444,7 @@ static int hci_disconnect_sync(struct hci_dev > *hdev, struct hci_conn *conn, > return __hci_cmd_sync_status_sk(hdev, HCI_OP_DISCONNECT, > sizeof(cp), &cp, > HCI_EV_DISCONN_COMPLETE, > - HCI_CMD_TIMEOUT, NULL); > + conn->disc_timeout, NULL); > > return __hci_cmd_sync_status(hdev, HCI_OP_DISCONNECT, sizeof(cp), &cp, > HCI_CMD_TIMEOUT); > > Then we need to adjust conn->disc_timeout according to supervision > timeout, that said it appears disc_timeout is actually acting as idle > timeout, so perhaps we need a separate field, also it doesn't look > like we track the supervision timeout for non-LE links. > Yes, you are right. I mixed up the two timers. Thank you for the detail explanation. I just found you have fixed this issue by 227a0cdf4a028a73dc256d0f5144b4808d718893. Link: https://github.com/bluez/bluez/issues/932 We are using an old codebase. Sorry for this. BTW, do you think we should change the timeout value here? Set the timeout value based on the link type. @@ -5419,6 +5419,7 @@ static int hci_disconnect_sync(struct hci_dev *hdev, struct hci_conn *conn, u8 reason) { struct hci_cp_disconnect cp; + u32 disc_timeout; if (test_bit(HCI_CONN_BIG_CREATED, &conn->flags)) { /* This is a BIS connection, hci_conn_del will @@ -5440,11 +5441,18 @@ static int hci_disconnect_sync(struct hci_dev *hdev, struct hci_conn *conn, * used when suspending or powering off, where we don't want to wait * for the peer's response. */ - if (reason != HCI_ERROR_REMOTE_POWER_OFF) + + if (reason != HCI_ERROR_REMOTE_POWER_OFF) { + if (conn->type == LE_LINK) + disc_timeout = msecs_to_jiffies(10 * conn->le_supv_timeout); + else + disc_timeout = conn->disc_timeout; + return __hci_cmd_sync_status_sk(hdev, HCI_OP_DISCONNECT, sizeof(cp), &cp, HCI_EV_DISCONN_COMPLETE, - HCI_CMD_TIMEOUT, NULL); + disc_timeout, NULL); + } >>>>>> That said, we really need to fix bluetoothd if it is not able to be >>>>>> cleaned up if SET_POWERED command fails, but it looks like it is >>>>>> handling errors correctly so it sounds like something else is at play. >>>>> >>>> The issue arises after a 2-second timeout of hci_disconnect_sync. During >>>> hci_abort_conn_sync, the connection is cleaned up by hci_conn_failed. >>>> after supervision timeout, the disconnect complete event arrives, but >>>> it returns at line 3370 since the connection has already been removed. >>>> As a result, bluetoothd does not send the mgmt event for MGMT_OP_DISCONNECT >>>> to the application layer (bluetoothctl), causing bluetoothctl to get stuck >>>> and unable to perform other mgmt commands. >>> >>> The command shall have completed regardless if disconnect complete has >>> been received or not, the is the whole point of having a timeout, so >>> this makes no sense to me, or you are not describing what is happening >>> here. Also there is no MGMT_OP_DISCONNECT pending, it is >>> MGMT_OP_SET_POWERED, if you are talking about the DISCONNECTED event >>> that is a totally different thing and perhaps that is the source of >>> the problem because if we do cleanup hci_conn even in case the command >>> fails/times out then we should be generating a disconnected event as >>> well. >>> >> Here is the flow describe the issue. For normal case: >> ┌───────────┐ ┌──────────┐ ┌──────┐ ┌──────────┐ ┌──────┐ >> │bluetootctl│ │bluetoothd│ │kernel│ │controller│ │remote│ >> └─────┬─────┘ └─────┬────┘ └───┬──┘ └─────┬────┘ └───┬──┘ >> │ disconnect cmd │ │ │ │ >> │─────────────────────>│ │ │ │ >> │ │ │ │ │ >> │ │ MGMT_OP │ │ │ >> │ │ _DISCONNECT │ │ │ >> │ │───────────────────>│ │ │ >> │ │ │ │ │ >> │ │ │ HCI_Disconnect │ │ >> │ │ │───────────────────>│ │ >> │ │ │ │ │ >> │ │ │ │ LL_TERMINATE │ >> │ │ │ │ _IND │ >> │ │ │ │───────────────────>│ >> │ │ │ │ │ >> │ │ │ │ ACK │ >> │ │ │ │<───────────────────│ >> │ │ │ │ │ >> │ │ │ Disc_Comp_Evt │ │ >> │ │ │<───────────────────│ │ >> │ │ │ │ │ >> │ │ MGMT Response │ │ │ >> │ │<───────────────────│ │ │ >> │ │ │ │ │ >> │ disc succ │ │ │ │ >> │<─────────────────────│ │ │ │ >> ┌─────┴─────┐ ┌─────┴────┐ ┌───┴──┐ ┌─────┴────┐ ┌───┴──┐ >> │bluetootctl│ │bluetoothd│ │kernel│ │controller│ │remote│ >> └───────────┘ └──────────┘ └──────┘ └──────────┘ └──────┘ >> >> >> The failure case like this: >> >> ┌───────────┐ ┌──────────┐ ┌──────┐ ┌──────────┐ ┌──────┐ >> │bluetootctl│ │bluetoothd│ │kernel│ │controller│ │remote│ >> └─────┬─────┘ └─────┬────┘ └───┬──┘ └─────┬────┘ └───┬──┘ >> │ disconnect │ │ │ │ >> │ cmd │ │ │ │ >> │─────────────────────>│ │ │ │ >> │ │ │ │ │ >> │ │ MGMT_OP_ │ │ │ >> │ │ DISCONNECT │ │ │ >> │ │───────────────────>│ │ │ >> │ │ │ │ │ >> │ │ ┌┴┐ HCI_ │ │ >> │ │ │ │ Disconnect │ │ >> │ │ │ │ ──────────────────>│ │ >> │ │ │ │ │ │ >> │ │ │ │ │ LL_TERMINATE ┌┴┐ >> │ │ │ │ │ _IND │ │ >> │ │ │ │ │─────────────────> │ │ >> │ │ │ │ │ │ │ >> │ │ │ │ 2s │ │ │ >> │ │ │ │ │ │ │ >> │ │ │ │ │ │ │ More >> │ │ │ │ │ │ │ than >> │ │ │ │ │ │ │ 2s >> │ │ │ │ │ │ │ >> │ │ │ │ │ │ │ >> │ │ │ │ │ │ │ >> │ │ └┬┘ │ │ │ >> │ │ │────┐ │ │ │ >> │ │ │ │ hci_disconnect │ │ │ >> │ │ │<───┘ sync timeout, │ │ │ >> │ │ │ del 'con' by │ │ │ >> │ │ │ hci_conn_failed│ │ │ >> │ │ │ │ └┬┘ >> │ │ │ │ ACK │ >> │ │ │ │<───────────────────│ >> │ │ │ │ │ >> │ │ │ Disc_Comp_Evt │ │ >> │ │ │<────────────────────│ │ >> │ │ │ │ │ >> │ │ │────┐ │ │ >> │ \│/ │ │ ignore the │ │ >> │ X │<───┘ event since │ │ >> │ /│\ │ 'con' has been │ │ >> │ │ MGMT │ deleted │ │ >> │ │ Response │ │ │ >> │ │<─ ─ ─ ─ ─ ─ ─ ─ ─ ─│ │ │ >> ┌─────┴─────┐ ┌─────┴────┐ ┌───┴──┐ ┌─────┴────┐ ┌───┴──┐ >> │bluetootctl│ │bluetoothd│ │kernel│ │controller│ │remote│ >> └───────────┘ └──────────┘ └──────┘ └──────────┘ └──────┘ >> >> >> So in the failure case, the bluetoothd is blocked by waiting the mgmt response from kernel. From >> our test, bluetoothd can't accept any command related to mgmt from bluetothctl. >> >>>> >>>> 3355 static void hci_disconn_complete_evt(struct hci_dev *hdev, void *data, >>>> 3356 struct sk_buff *skb) >>>> 3357 { >>>> 3358 struct hci_ev_disconn_complete *ev = data; >>>> 3359 u8 reason; >>>> 3360 struct hci_conn_params *params; >>>> 3361 struct hci_conn *conn; >>>> 3362 bool mgmt_connected; >>>> 3363 >>>> 3364 bt_dev_dbg(hdev, "status 0x%2.2x", ev->status); >>>> 3365 >>>> 3366 hci_dev_lock(hdev); >>>> 3367 >>>> 3368 conn = hci_conn_hash_lookup_handle(hdev, __le16_to_cpu(ev->handle)); >>>> 3369 if (!conn) >>>> 3370 goto unlock; >>>> 3371 >>>> 3372 if (ev->status) { >>>> 3373 mgmt_disconnect_failed(hdev, &conn->dst, conn->type, >>>> 3374 conn->dst_type, ev->status); >>>> 3375 goto unlock; >>>> 3376 } >>>> 3377 >>>> 3378 conn->state = BT_CLOSED; >>>> 3379 >>>> 3380 mgmt_connected = test_and_clear_bit(HCI_CONN_MGMT_CONNECTED, &conn->flags); >>>> 3381 >>>> >>>>> I double checked this and apparently this could no longer fail: >>>>> >>>>> + /* Disregard possible errors since hci_conn_del shall have been >>>>> + * called even in case of errors had occurred since it would >>>>> + * then cause hci_conn_failed to be called which calls >>>>> + * hci_conn_del internally. >>>>> + */ >>>>> + hci_abort_conn_sync(hdev, conn, reason); >>>>> >>>>> So it will clean up the hci_conn no matter what is the timeout, so >>>>> either you don't have this change or it is not working for some >>>>> reason. >>>>> >>>> The issue is mgmt command is not repsonsed by bluetoothd, then the bluetootlctl is >>>> blocked. It does not happen during the power off stage. It happened when disconnect >>>> a BLE device, but the disconnect complete event sent from controller to host 2s later. >>>> Then it causes the mgmt in bluetoothctl is blocked as decribed as above. >>> >>> There is a difference about a MGMT command, initiated by bluetoothd, >>> versus a MGMT event initiated by the kernel, so the daemon is not >>> blocked it just don't get a disconnection event, which is different >>> than a command complete. >>> >>> What is the command sequence that you use to reproduce the problem? >> Here is the command log: >> [CHG] Controller 8C:FD:F0:21:81:87 Pairable: yes >> [bluetooth]# power on >> Changing power on succeeded >> [bluetooth]# connect CF:90:67:3C:7A:56 >> Attempting to connect to CF:90:67:3C:7A:56 >> [CHG] Device CF:90:67:3C:7A:56 Connected: yes >> Connection successful >> [CHG] Device CF:90:67:3C:7A:56 ServicesResolved: yes >> [Designer Mouse]# >> [Designer Mouse]# disconnect D9:B5:6C:F2:51:91 >> Attempting to disconnect from D9:B5:6C:F2:51:91 ## no disconnection success response >> [Designer Mouse]# >> [Designer Mouse]# power off >> [Designer Mouse]# >> Failed to set power off: org.freedesktop.DBus.Error.NoReply >> >> To easily reproduce this issue, we use a firmware which always send the disconnect >> complete event more than 2s. Then the issue occurred 100%. >> >> Actually, the root cause is the hci_disconnect_sync doesn't handle this case since it >> relies on __hci_cmd_sync_status_sk, which maximum timeout value is constrained to 2s. >> >>> >>>>>>> hci_dev_lock(hdev); >>>>>>> >>>>>>> /* Check if the connection has been cleaned up concurrently */ >>>>>>> >>>>>>> base-commit: e25c8d66f6786300b680866c0e0139981273feba >>>>>>> -- >>>>>>> 2.34.1 >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Luiz Augusto von Dentz >>>>> >>>>> >>>>> >>>> >>> >>> >> > >