Re: [PATCH v1] Bluetooth: hci_sync: Fix disconnect complete event timeout issue

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Cheng,

On Tue, Dec 17, 2024 at 12:51 AM Cheng Jiang (IOE)
<quic_chejiang@xxxxxxxxxxx> wrote:
>
> Hi Luiz,
>
> On 12/17/2024 11:15 AM, Luiz Augusto von Dentz wrote:
> > Hi Cheng,
> >
> > On Mon, Dec 16, 2024 at 9:49 PM Cheng Jiang (IOE)
> > <quic_chejiang@xxxxxxxxxxx> wrote:
> >>
> >> Hi Luiz,
> >>
> >> On 12/16/2024 10:42 PM, Luiz Augusto von Dentz wrote:
> >>> Hi Cheng,
> >>>
> >>> On Mon, Dec 16, 2024 at 9:32 AM Luiz Augusto von Dentz
> >>> <luiz.dentz@xxxxxxxxx> wrote:
> >>>>
> >>>> Hi Cheng,
> >>>>
> >>>> On Mon, Dec 16, 2024 at 3:08 AM Cheng Jiang <quic_chejiang@xxxxxxxxxxx> wrote:
> >>>>>
> >>>>> Sometimes, the remote device doesn't acknowledge the LL_TERMINATE_IND
> >>>>> in time, requiring the controller to wait for the supervision timeout,
> >>>>> which may exceed 2 seconds. In the current implementation, the
> >>>>> HCI_EV_DISCONN_COMPLETE event is ignored if it arrives late, since
> >>>>> the hci_abort_conn_sync has cleaned up the connection after 2 seconds.
> >>>>> This causes the mgmt to get stuck, resulting in bluetoothd waiting
> >>>>> indefinitely for the mgmt response to the disconnect. To recover,
> >>>>> restarting bluetoothd is necessary.
> >>>>>
> >>>>> bluetoothctl log like this:
> >>>>> [Designer Mouse]# disconnect D9:B5:6C:F2:51:91
> >>>>> Attempting to disconnect from D9:B5:6C:F2:51:91
> >>>>> [Designer Mouse]#
> >>>>> [Designer Mouse]# power off
> >>>>> [Designer Mouse]#
> >>>>> Failed to set power off: org.freedesktop.DBus.Error.NoReply.
> >>>>>
> >>>>> Signed-off-by: Cheng Jiang <quic_chejiang@xxxxxxxxxxx>
> >>>>> ---
> >>>>>  include/net/bluetooth/hci_core.h |  2 ++
> >>>>>  net/bluetooth/hci_conn.c         |  9 +++++++++
> >>>>>  net/bluetooth/hci_event.c        |  9 +++++++++
> >>>>>  net/bluetooth/hci_sync.c         | 18 ++++++++++++++++++
> >>>>>  4 files changed, 38 insertions(+)
> >>>>>
> >>>>> diff --git a/include/net/bluetooth/hci_core.h b/include/net/bluetooth/hci_core.h
> >>>>> index 734cd50cd..2ab079dcf 100644
> >>>>> --- a/include/net/bluetooth/hci_core.h
> >>>>> +++ b/include/net/bluetooth/hci_core.h
> >>>>> @@ -753,6 +753,8 @@ struct hci_conn {
> >>>>>
> >>>>>         struct bt_codec codec;
> >>>>>
> >>>>> +       struct completion disc_ev_comp;
> >>>>> +
> >>>>>         void (*connect_cfm_cb)  (struct hci_conn *conn, u8 status);
> >>>>>         void (*security_cfm_cb) (struct hci_conn *conn, u8 status);
> >>>>>         void (*disconn_cfm_cb)  (struct hci_conn *conn, u8 reason);
> >>>>> diff --git a/net/bluetooth/hci_conn.c b/net/bluetooth/hci_conn.c
> >>>>> index d097e308a..e0244e191 100644
> >>>>> --- a/net/bluetooth/hci_conn.c
> >>>>> +++ b/net/bluetooth/hci_conn.c
> >>>>> @@ -1028,6 +1028,15 @@ static struct hci_conn *__hci_conn_add(struct hci_dev *hdev, int type, bdaddr_t
> >>>>>
> >>>>>         hci_conn_init_sysfs(conn);
> >>>>>
> >>>>> +       /* This disc_ev_comp is inited when we send a disconnect request to
> >>>>> +        * the remote device but fail to receive the disconnect complete
> >>>>> +        * event within the expected time (2 seconds). This occurs because
> >>>>> +        * the remote device doesn't ack the terminate indication, forcing
> >>>>> +        * the controller to wait for the supervision timeout.
> >>>>> +        */
> >>>>> +       init_completion(&conn->disc_ev_comp);
> >>>>> +       complete(&conn->disc_ev_comp);
> >>>>> +
> >>>>>         return conn;
> >>>>>  }
> >>>>>
> >>>>> diff --git a/net/bluetooth/hci_event.c b/net/bluetooth/hci_event.c
> >>>>> index 2cc7a9306..60ecb2b18 100644
> >>>>> --- a/net/bluetooth/hci_event.c
> >>>>> +++ b/net/bluetooth/hci_event.c
> >>>>> @@ -3366,6 +3366,15 @@ static void hci_disconn_complete_evt(struct hci_dev *hdev, void *data,
> >>>>>         if (!conn)
> >>>>>                 goto unlock;
> >>>>>
> >>>>> +       /* Wake up disc_ev_comp here is ok. Since we hold the hdev lock
> >>>>> +        * hci_abort_conn_sync will wait hdev lock release to continue.
> >>>>> +        */
> >>>>> +       if (!completion_done(&conn->disc_ev_comp)) {
> >>>>> +               complete(&conn->disc_ev_comp);
> >>>>> +               /* Add some delay for hci_abort_conn_sync to handle the complete */
> >>>>> +               usleep_range(100, 1000);
> >>>>> +       }
> >>>>> +
> >>>>>         if (ev->status) {
> >>>>>                 mgmt_disconnect_failed(hdev, &conn->dst, conn->type,
> >>>>>                                        conn->dst_type, ev->status);
> >>>>> diff --git a/net/bluetooth/hci_sync.c b/net/bluetooth/hci_sync.c
> >>>>> index 0badec712..783d04b57 100644
> >>>>> --- a/net/bluetooth/hci_sync.c
> >>>>> +++ b/net/bluetooth/hci_sync.c
> >>>>> @@ -5590,6 +5590,24 @@ int hci_abort_conn_sync(struct hci_dev *hdev, struct hci_conn *conn, u8 reason)
> >>>>>                 break;
> >>>>>         }
> >>>>>
> >>>>> +       /* Check whether the connection is successfully disconnected.
> >>>>> +        * Sometimes the remote device doesn't acknowledge the
> >>>>> +        * LL_TERMINATE_IND in time, requiring the controller to wait
> >>>>> +        * for the supervision timeout, which may exceed 2 seconds. In
> >>>>> +        * this case, we need to wait for the HCI_EV_DISCONN_COMPLETE
> >>>>> +        * event before cleaning up the connection.
> >>>>> +        */
> >>>>> +       if (err == -ETIMEDOUT) {
> >>>>> +               u32 idle_delay = msecs_to_jiffies(10 * conn->le_supv_timeout);
> >>>>> +
> >>>>> +               reinit_completion(&conn->disc_ev_comp);
> >>>>> +               if (!wait_for_completion_timeout(&conn->disc_ev_comp, idle_delay)) {
> >>>>> +                       bt_dev_warn(hdev, "Failed to get complete");
> >>>>> +                       mgmt_disconnect_failed(hdev, &conn->dst, conn->type,
> >>>>> +                                              conn->dst_type, conn->abort_reason);
> >>>>> +               }
> >>>>> +       }
> >>>>
> >>>> Why don't we just set the supervision timeout as timeout then? If we
> >>>> will have to wait for it anyway just change hci_disconnect_sync to use
> >>>> 10 * conn->le_supv_timeout as timeout instead.
> >>>>
> >> hci_disconnect_sync uses __hci_cmd_sync_status_sk to wait for the
> >> HCI_EV_DISCONN_COMPLETE event, which will send the command in hci_cmd_work.
> >> In hci_cmd_work, it will start a timer with HCI_CMD_TIMEOUT(2s) to wait
> >> for the event. So even in hci_disconnect_sync we can set the timeout to
> >> supervision timeout, hci_disconnect_sync still timeout after 2s.
> >
> > Wait, why are you talking about HCI_CMD_TIMEOUT when I told you to use
> > the supervision timeout instead? If it still timeout after to 2
> > seconds then there is something still forcing HCI_CMD_TIMEOUT which
> > shouldn't happen.
> >
> Since the lower layer (hci_cmd_work) has set the timeout to HCI_CMD_TIMEOUT, so
> even the upper layer set to a larger timeout value, like supervision timeout,
> it still get the timeout after HCI_CMD_TIMEOUT. The timeout flow is:
> hci_disconnect_sync -> __hci_cmd_sync_sk(wait_event_interruptible_timeout) ->
> hci_cmd_work -> hci_cmd_timeout -> hci_cmd_sync_cancel_sync -> wake up the
> wait_event_interruptible_timeout in __hci_cmd_sync_sk -> hci_disconnect_sync timeout.
>
> So even if we set a large timeout value in hci_disconnect_sync, it doesn't work
> since it's waked up by other event, not the real timeout.
>
> What's more, in the hci_disconnect_sync, if the reason it not power_off, it waits
> for the disconnect complete event rather than command status event. According to
> BT core spec, disconnect complete event has to wait for remote's ack or wait until
> supervision timeout. It's a valid case that the disconnect complete event taking
> more than 2s.

You seems to be confusing the role of 2 different timers:

    err = wait_event_interruptible_timeout(hdev->req_wait_q,
                           hdev->req_status != HCI_REQ_PEND,
                           timeout);

and

            queue_delayed_work(hdev->workqueue, &hdev->cmd_timer,
                       HCI_CMD_TIMEOUT);

The former waits for a specific event, while the later waits for
handle_cmd_cnt_and_timer, each can have a distinct timeout as in the
code bellow:

    return __hci_cmd_sync_status_sk(hdev, HCI_OP_LE_EXT_CREATE_CONN,
                    plen, data,
                    HCI_EV_LE_ENHANCED_CONN_COMPLETE,
                    conn->conn_timeout, NULL);

The reason there are 2 timers is that we need to track the number of
commands outstanding in the controller, and no you can't delay Command
Status:

When the Controller receives the HCI_Disconnect command, it _shall_ send the
HCI_Command_Status event to the Host.

So even if HCI_Disconnection_Complete is delayed the following shall
still work provided the HCI_Command_Status is still being generated
accordingly:

index 0badec7120ab..0ab607fb6433 100644
--- a/net/bluetooth/hci_sync.c
+++ b/net/bluetooth/hci_sync.c
@@ -5444,7 +5444,7 @@ static int hci_disconnect_sync(struct hci_dev
*hdev, struct hci_conn *conn,
                return __hci_cmd_sync_status_sk(hdev, HCI_OP_DISCONNECT,
                                                sizeof(cp), &cp,
                                                HCI_EV_DISCONN_COMPLETE,
-                                               HCI_CMD_TIMEOUT, NULL);
+                                               conn->disc_timeout, NULL);

        return __hci_cmd_sync_status(hdev, HCI_OP_DISCONNECT, sizeof(cp), &cp,
                                     HCI_CMD_TIMEOUT);

Then we need to adjust conn->disc_timeout according to supervision
timeout, that said it appears disc_timeout is actually acting as idle
timeout, so perhaps we need a separate field, also it doesn't look
like we track the supervision timeout for non-LE links.

> >>>> That said, we really need to fix bluetoothd if it is not able to be
> >>>> cleaned up if SET_POWERED command fails, but it looks like it is
> >>>> handling errors correctly so it sounds like something else is at play.
> >>>
> >> The issue arises after a 2-second timeout of hci_disconnect_sync. During
> >> hci_abort_conn_sync, the connection is cleaned up by hci_conn_failed.
> >> after supervision timeout, the disconnect complete event arrives, but
> >> it returns at line 3370 since the connection has already been removed.
> >> As a result, bluetoothd does not send the mgmt event for MGMT_OP_DISCONNECT
> >> to the application layer (bluetoothctl), causing bluetoothctl to get stuck
> >> and unable to perform other mgmt commands.
> >
> > The command shall have completed regardless if disconnect complete has
> > been received or not, the is the whole point of having a timeout, so
> > this makes no sense to me, or you are not describing what is happening
> > here. Also there is no MGMT_OP_DISCONNECT pending, it is
> > MGMT_OP_SET_POWERED, if you are talking about the DISCONNECTED event
> > that is a totally different thing and perhaps that is the source of
> > the problem because if we do cleanup hci_conn even in case the command
> > fails/times out then we should be generating a disconnected event as
> > well.
> >
> Here is the flow describe the issue. For normal case:
> ┌───────────┐          ┌──────────┐           ┌──────┐           ┌──────────┐           ┌──────┐
> │bluetootctl│          │bluetoothd│           │kernel│           │controller│           │remote│
> └─────┬─────┘          └─────┬────┘           └───┬──┘           └─────┬────┘           └───┬──┘
>       │   disconnect cmd     │                    │                    │                    │
>       │─────────────────────>│                    │                    │                    │
>       │                      │                    │                    │                    │
>       │                      │    MGMT_OP         │                    │                    │
>       │                      │    _DISCONNECT     │                    │                    │
>       │                      │───────────────────>│                    │                    │
>       │                      │                    │                    │                    │
>       │                      │                    │  HCI_Disconnect    │                    │
>       │                      │                    │───────────────────>│                    │
>       │                      │                    │                    │                    │
>       │                      │                    │                    │   LL_TERMINATE     │
>       │                      │                    │                    │   _IND             │
>       │                      │                    │                    │───────────────────>│
>       │                      │                    │                    │                    │
>       │                      │                    │                    │        ACK         │
>       │                      │                    │                    │<───────────────────│
>       │                      │                    │                    │                    │
>       │                      │                    │   Disc_Comp_Evt    │                    │
>       │                      │                    │<───────────────────│                    │
>       │                      │                    │                    │                    │
>       │                      │   MGMT Response    │                    │                    │
>       │                      │<───────────────────│                    │                    │
>       │                      │                    │                    │                    │
>       │      disc succ       │                    │                    │                    │
>       │<─────────────────────│                    │                    │                    │
> ┌─────┴─────┐          ┌─────┴────┐           ┌───┴──┐           ┌─────┴────┐           ┌───┴──┐
> │bluetootctl│          │bluetoothd│           │kernel│           │controller│           │remote│
> └───────────┘          └──────────┘           └──────┘           └──────────┘           └──────┘
>
>
> The failure case like this:
>
> ┌───────────┐          ┌──────────┐           ┌──────┐            ┌──────────┐           ┌──────┐
> │bluetootctl│          │bluetoothd│           │kernel│            │controller│           │remote│
> └─────┬─────┘          └─────┬────┘           └───┬──┘            └─────┬────┘           └───┬──┘
>       │     disconnect       │                    │                     │                    │
>       │     cmd              │                    │                     │                    │
>       │─────────────────────>│                    │                     │                    │
>       │                      │                    │                     │                    │
>       │                      │    MGMT_OP_        │                     │                    │
>       │                      │    DISCONNECT      │                     │                    │
>       │                      │───────────────────>│                     │                    │
>       │                      │                    │                     │                    │
>       │                      │                   ┌┴┐     HCI_           │                    │
>       │                      │                   │ │     Disconnect     │                    │
>       │                      │                   │ │ ──────────────────>│                    │
>       │                      │                   │ │                    │                    │
>       │                      │                   │ │                    │  LL_TERMINATE     ┌┴┐
>       │                      │                   │ │                    │  _IND             │ │
>       │                      │                   │ │                    │─────────────────> │ │
>       │                      │                   │ │                    │                   │ │
>       │                      │                   │ │ 2s                 │                   │ │
>       │                      │                   │ │                    │                   │ │
>       │                      │                   │ │                    │                   │ │ More
>       │                      │                   │ │                    │                   │ │ than
>       │                      │                   │ │                    │                   │ │ 2s
>       │                      │                   │ │                    │                   │ │
>       │                      │                   │ │                    │                   │ │
>       │                      │                   │ │                    │                   │ │
>       │                      │                   └┬┘                    │                   │ │
>       │                      │                    │────┐                │                   │ │
>       │                      │                    │    │ hci_disconnect │                   │ │
>       │                      │                    │<───┘ sync timeout,  │                   │ │
>       │                      │                    │      del 'con' by   │                   │ │
>       │                      │                    │      hci_conn_failed│                   │ │
>       │                      │                    │                     │                   └┬┘
>       │                      │                    │                     │        ACK         │
>       │                      │                    │                     │<───────────────────│
>       │                      │                    │                     │                    │
>       │                      │                    │   Disc_Comp_Evt     │                    │
>       │                      │                    │<────────────────────│                    │
>       │                      │                    │                     │                    │
>       │                      │                    │────┐                │                    │
>       │                     \│/                   │    │ ignore the     │                    │
>       │                      X                    │<───┘ event since    │                    │
>       │                     /│\                   │      'con' has been │                    │
>       │                      │     MGMT           │      deleted        │                    │
>       │                      │     Response       │                     │                    │
>       │                      │<─ ─ ─ ─ ─ ─ ─ ─ ─ ─│                     │                    │
> ┌─────┴─────┐          ┌─────┴────┐           ┌───┴──┐            ┌─────┴────┐           ┌───┴──┐
> │bluetootctl│          │bluetoothd│           │kernel│            │controller│           │remote│
> └───────────┘          └──────────┘           └──────┘            └──────────┘           └──────┘
>
>
> So in the failure case, the bluetoothd is blocked by waiting the mgmt response from kernel. From
> our test, bluetoothd can't accept any command related to mgmt from bluetothctl.
>
> >>
> >> 3355 static void hci_disconn_complete_evt(struct hci_dev *hdev, void *data,
> >> 3356              struct sk_buff *skb)
> >> 3357 {
> >> 3358   struct hci_ev_disconn_complete *ev = data;
> >> 3359   u8 reason;
> >> 3360   struct hci_conn_params *params;
> >> 3361   struct hci_conn *conn;
> >> 3362   bool mgmt_connected;
> >> 3363
> >> 3364   bt_dev_dbg(hdev, "status 0x%2.2x", ev->status);
> >> 3365
> >> 3366   hci_dev_lock(hdev);
> >> 3367
> >> 3368   conn = hci_conn_hash_lookup_handle(hdev, __le16_to_cpu(ev->handle));
> >> 3369   if (!conn)
> >> 3370     goto unlock;
> >> 3371
> >> 3372   if (ev->status) {
> >> 3373     mgmt_disconnect_failed(hdev, &conn->dst, conn->type,
> >> 3374                conn->dst_type, ev->status);
> >> 3375     goto unlock;
> >> 3376   }
> >> 3377
> >> 3378   conn->state = BT_CLOSED;
> >> 3379
> >> 3380   mgmt_connected = test_and_clear_bit(HCI_CONN_MGMT_CONNECTED, &conn->flags);
> >> 3381
> >>
> >>> I double checked this and apparently this could no longer fail:
> >>>
> >>> +               /* Disregard possible errors since hci_conn_del shall have been
> >>> +                * called even in case of errors had occurred since it would
> >>> +                * then cause hci_conn_failed to be called which calls
> >>> +                * hci_conn_del internally.
> >>> +                */
> >>> +               hci_abort_conn_sync(hdev, conn, reason);
> >>>
> >>> So it will clean up the hci_conn no matter what is the timeout, so
> >>> either you don't have this change or it is not working for some
> >>> reason.
> >>>
> >> The issue is mgmt command is not repsonsed by bluetoothd, then the bluetootlctl is
> >> blocked. It does not happen during the power off stage. It happened when disconnect
> >> a BLE device, but the disconnect complete event sent from controller to host 2s later.
> >> Then it causes the mgmt in bluetoothctl is blocked as decribed as above.
> >
> > There is a difference about a MGMT command, initiated by bluetoothd,
> > versus a MGMT event initiated by the kernel, so the daemon is not
> > blocked it just don't get a disconnection event, which is different
> > than a command complete.
> >
> > What is the command sequence that you use to reproduce the problem?
> Here is the command log:
> [CHG] Controller 8C:FD:F0:21:81:87 Pairable: yes
> [bluetooth]# power on
> Changing power on succeeded
> [bluetooth]# connect CF:90:67:3C:7A:56
> Attempting to connect to CF:90:67:3C:7A:56
> [CHG] Device CF:90:67:3C:7A:56 Connected: yes
> Connection successful
> [CHG] Device CF:90:67:3C:7A:56 ServicesResolved: yes
> [Designer Mouse]#
> [Designer Mouse]# disconnect D9:B5:6C:F2:51:91
> Attempting to disconnect from D9:B5:6C:F2:51:91 ## no disconnection success response
> [Designer Mouse]#
> [Designer Mouse]# power off
> [Designer Mouse]#
> Failed to set power off: org.freedesktop.DBus.Error.NoReply
>
> To easily reproduce this issue, we use a firmware which always send the disconnect
> complete event more than 2s. Then the issue occurred 100%.
>
> Actually, the root cause is the hci_disconnect_sync doesn't handle this case since it
> relies on __hci_cmd_sync_status_sk, which maximum timeout value is constrained to 2s.
>
> >
> >>>>>         hci_dev_lock(hdev);
> >>>>>
> >>>>>         /* Check if the connection has been cleaned up concurrently */
> >>>>>
> >>>>> base-commit: e25c8d66f6786300b680866c0e0139981273feba
> >>>>> --
> >>>>> 2.34.1
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>>> Luiz Augusto von Dentz
> >>>
> >>>
> >>>
> >>
> >
> >
>


-- 
Luiz Augusto von Dentz





[Index of Archives]     [Bluez Devel]     [Linux Wireless Networking]     [Linux Wireless Personal Area Networking]     [Linux ATH6KL]     [Linux USB Devel]     [Linux Media Drivers]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [Big List of Linux Books]

  Powered by Linux