Hi Jonas, On Mon, Feb 12, 2024 at 7:29 PM Jonas Dreßler <verdre@xxxxxxx> wrote: > > Hi Luiz, > > On 09.02.24 14:36, Luiz Augusto von Dentz wrote: > > Hi Jonas, > > > > On Fri, Feb 9, 2024 at 7:37 AM Jonas Dreßler <verdre@xxxxxxx> wrote: > >> > >> Hi everyone! > >> > >> On 08.02.24 16:32, syzbot wrote: > >>> syzbot has bisected this issue to: > >>> > >>> commit 456561ba8e495e9320c1f304bf1cd3d1043cbe7b > >>> Author: Jonas Dreßler <verdre@xxxxxxx> > >>> Date: Tue Feb 6 11:08:13 2024 +0000 > >>> > >>> Bluetooth: hci_conn: Only do ACL connections sequentially > >>> > >>> bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=154f8550180000 > >>> start commit: b1d3a0e70c38 Add linux-next specific files for 20240208 > >>> git tree: linux-next > >>> final oops: https://syzkaller.appspot.com/x/report.txt?x=174f8550180000 > >>> console output: https://syzkaller.appspot.com/x/log.txt?x=134f8550180000 > >>> kernel config: https://syzkaller.appspot.com/x/.config?x=bb693ba195662a06 > >>> dashboard link: https://syzkaller.appspot.com/bug?extid=3f0a39be7a2035700868 > >>> syz repro: https://syzkaller.appspot.com/x/repro.syz?x=11d95147e80000 > >>> C reproducer: https://syzkaller.appspot.com/x/repro.c?x=107c2d8fe80000 > >>> > >>> Reported-by: syzbot+3f0a39be7a2035700868@xxxxxxxxxxxxxxxxxxxxxxxxx > >>> Fixes: 456561ba8e49 ("Bluetooth: hci_conn: Only do ACL connections sequentially") > >>> > >>> For information about bisection process see: https://goo.gl/tpsmEJ#bisection > >> > >> Hmm, looking at the backtraces, I think the issue that the introduction of the > >> sequential connect has introduced another async case: In hci_connect_acl(), when > >> we call hci_acl_create_connection_sync(), the conn state no longer immediately > >> gets set to BT_CONNECT, but remains in BT_OPEN or BT_CLOSED until the hci_sync > >> queue actually executes __hci_acl_create_connection_sync(). > > > > Need to double check but I think we do set BT_CONNECT in case of LE > > when it is queued so which shall prevent it to be queued multiple > > times. > > Nope, we set it only from within the hci_sync callback, see > hci_connect_le_sync(). IMO that makes sense, because in > hci_abort_conn_sync(), we send a HCI_OP_CREATE_CONN_CANCEL in case > of conn->state == BT_CONNECT, and this OP we wouldn't want to send > while the command is still waiting in the queue. Yep, well I did a change which should work regardless of BT_CONNECT only being set at the callback: https://patchwork.kernel.org/project/bluetooth/patch/20240209141612.3554051-1-luiz.dentz@xxxxxxxxx/ > > > >> This means that now hci_connect_acl() is happy to do multiple > >> hci_acl_create_connection_sync calls, and the hci_sync machinery will happily > >> execute them right after each other. Then the newly introduced hci_abort_conn_sync() > >> in __hci_acl_create_connection_sync() calls hci_conn_del() and frees the conn > >> object, so the second time we enter __hci_acl_create_connection_sync(), > >> things blow up. > >> > >> It looks to me like in theory the hci_connect_le_sync() logic is prone to a > >> similar issue, but in practice that's prohibited because in hci_connect_le_sync() > >> we lookup whether the conn object still exists and bail out if it doesn't. > >> > >> Even for LE though I think we can queue multiple hci_connect_le_sync() calls > >> and those will happily send HCI_OP_LE_CREATE_CONN no matter what the connection > >> state actually is? > >> > >> So assuming this analysis is correct, what do we do to fix this? It seems to me > >> that > >> > >> 1) we want a BT_CONNECT_QUEUED state for connections, so that the state > >> machine covers this additional stage that we have for ACL and LE connections now. > > > > BT_CONNECT already indicates that connection procedure is in progress. > > I still think an additional state is necessary. Alternatively (and maybe > a lot nicer than the extra state) would be to add some functions to hci_sync > to check for and remove queued/ongoing commands, I'm thinking of a new > > bool hci_cmd_sync_has(struct hci_dev *hdev, hci_cmd_sync_work_func_t func, void *data); > void hci_cmd_sync_cancel(struct hci_dev *hdev, hci_cmd_sync_work_func_t func, void *data); > > I think if we had those, in addition to not needing the additional state, > we could also simplify the hci_abort_conn() code quite a bit and possibly > get rid of the passing of connection handles to hci_connect_le_sync(). > > I'll give that a try and propose a small patch tomorrow. Yeah, I did something like in the past for hci_cmd_sync_queue_once, I will resend it since it had the following function as well: bool hci_cmd_sync_lookup(struct hci_dev *hdev, hci_cmd_sync_work_func_t func, void *data, hci_cmd_sync_work_destroy_t destroy, bool cancel); > Cheers, > Jonas > > > > >> 2) the conn object can still disappear while the __hci_acl_create_connection_sync() > >> is queued, so we need something like the "if conn doesn't exist anymore, bail out" > >> check from hci_connect_le_sync() in __hci_acl_create_connection_sync(), too. > > > > Btw, I'd probably clean up the connect function and create something > > like hci_connect/hci_connect_sync which takes care of the details > > internally like it was done to abort. > > > >> That said, the current check in hci_connect_le_sync() that's using the connection > >> handle to lookup the conn does not seem great, aren't these handles re-used > >> after connections are torn down? > > > > Well we could perhaps do a lookup by pointer to see if the connection > > hasn't been removed in the meantime, that said to force a clash on the > > handles it need to happen in between abort, which frees the handle, > > and connect, anyway the real culprit here is that we should be able to > > abort the cmd_sync callback like we do in LE: > > > > https://github.com/bluez/bluetooth-next/blob/master/net/bluetooth/hci_conn.c#L2943 > > > > That way we stop the connect callback to run and don't have to worry > > about handle re-use. > > > >> Cheers, > >> Jonas > > > > > > -- Luiz Augusto von Dentz