Hi Marleen, On Mon, Nov 13, 2023 at 9:23 AM Marleen Vos <marleen.vos@xxxxxxxxxxxxx> wrote: > > Hi Luiz, > > On Tue, Nov 7, 2023 at 4:10 PM Luiz Augusto von Dentz <luiz.dentz@xxxxxxxxx> wrote: >> >> Hi Marleen, >> >> On Tue, Nov 7, 2023 at 3:46 AM Marleen Vos <marleen.vos@xxxxxxxxxxxxx> wrote: >> > >> > Hi Luiz, >> > >> > Because Olivier has been and still is being swamped with other work, I'm >> > kind of trying to take over these patches. So far I can reproduce the >> > memleak on a recent kernel without the patches. >> > >> > Olivier told me you added tests to check for the memleak in >> > l2cap-tester. Can you point me towards more info on how exactly you run >> > these tests, as the info I find on the web is rather minimal. >> > >> > From what I read in the thread, it looks like your tests don't catch >> > the memleak? >> >> I'm currently on vacation, I should be back next week, try checking >> l2cap-tester to emulate the leak as you said it was still >> reproducible. > > > I hope you had a great vacation! > Using git bisect, I found that commit 4ab81f16c68a602b2b69e333ae08d8748a9398de is the one where the leaks stop being reproducable by the method we use. At first sight the changes you made seem unrelated to the memleak, so the main question for us is now: has the resolution been accidental or intentional? So it is not reproducible anymore? > The commit message says: > commit 4ab81f16c68a602b2b69e333ae08d8748a9398de > Author: Luiz Augusto von Dentz <luiz.von.dentz@xxxxxxxxx> > Date: Mon Jun 26 17:25:06 2023 -0700 > > Bluetooth: hci_conn: Consolidate code for aborting connections > > [ Upstream commit a13f316e90fdb1fb6df6582e845aa9b3270f3581 ] > > This consolidates code for aborting connections using > hci_cmd_sync_queue so it is synchronized with other threads, but > because of the fact that some commands may block the cmd_sync_queue > while waiting specific events this attempt to cancel those requests by > using hci_cmd_sync_cancel. > > Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@xxxxxxxxx> > Stable-dep-of: 94d9ba9f9888 ("Bluetooth: hci_sync: Fix UAF in hci_disconnect_all_sync") > Signed-off-by: Sasha Levin <sashal@xxxxxxxxxx> > > Maybe you can shed some light on why this commit seems to fix the leaks? Well I suppose it is due to the cleanup process freeing l2cap_conn in the process, while perhaps the previous code didn't. > Kind regards, > Marleen > >> >> >> > Kind regards, >> > >> > Marleen >> > >> > >> > On 14/09/2023 23:17, Luiz Augusto von Dentz wrote: >> > > Hi Olivier, >> > > >> > > On Wed, Sep 13, 2023 at 3:25 PM Olivier L'Heureux >> > > <olivier.lheureux@xxxxxxx> wrote: >> > >> Hello Luiz, >> > >> >> > >> Thanks for your review. >> > >> >> > >> On 05/09/2023 22:42, Luiz Augusto von Dentz wrote: >> > >>> Hi Olivier, >> > >>> >> > >>> On Mon, Sep 4, 2023 at 3:12 PM Olivier L'Heureux >> > >>> <olivier.lheureux@xxxxxxx> wrote: >> > >>>> Request for Comments >> > >> [...] >> > >>>> The "ble-memleak-repro" program reproduces the memory leak, if the >> > >>>> kernel is not patched. Its source is in >> > >>>> "package/ble-memleak-repro/ble-memleak-repro.c" [18]. >> > >>> We should probably create a test in l2cap-tester using SO_SNDTIMEO >> > >>> first, so we can make sure CI test such case and we are able to detect >> > >>> if the problem is reintroduced later: >> > >>> >> > >>> https://github.com/bluez/bluez/blob/master/tools/l2cap-tester.c >> > >> I didn't know about "l2cap-tester.c". Agree, it would be great to >> > >> integrate my test app into it. I could try, but I don't know the test >> > >> framework yet. >> > >> >> > >>>> Memory Leak Fix >> > >>>> =============== >> > >>>> >> > >>>> We have fixed the memory leak, see the patch series in >> > >>>> "patches/linux/": >> > >>>> >> > >>>> 1. The first patch >> > >>>> "0001-ARM-dts-stm32-Add-Bluetooth-usart2-feature-on-stm32m.patch" [11] >> > >>>> enables Bluetooth on the DK2. >> > >>> This above should probably be sent separately. >> > >>> >> > >>>> 2. The second patch >> > >>>> "0002-Bluetooth-add-many-traces-for-allocation-free-refcou.patch" [12] >> > >>>> adds many dynamic debug traces that help understand the kernel >> > >>>> behaviour and freeing problems. >> > >>> I'm fine with this change, but we better use the likes of bt_dev_dbg >> > >>> instead of BT_DBG. >> > >> This commit intended to increase the visibility during debugging, and >> > >> I was not intending it for a permanent presence in the kernel. >> > >> But if you find it useful, I can submit a patch RFC v2 with >> > >> "bt_dev_dbg()" instead of "BT_DBG()". Note that there is currently no >> > >> "bt_dev_dbg()" in "l2cap_core.c" yet. >> > >> >> > >>>> "0003-Bluetooth-L2CAP-use-refcount-on-struct-l2cap_chan-co.patch" [13] >> > >>>> and >> > >>>> "0004-Bluetooth-L2CAP-free-the-leaking-struct-l2cap_conn.patch" [14] >> > >>>> fix the memory leak. >> > >>>> 4. Patches >> > >>>> "0005-Bluetooth-introduce-hci_conn_free-for-better-structu.patch" [15], >> > >>>> "0006-Bluetooth-L2CAP-inc-refcount-if-reuse-struct-l2cap_c.patch" [16] >> > >>>> and >> > >>>> "0007-Bluetooth-unlink-objects-to-avoid-use-after-free.patch" [17] >> > >>>> fixes the use-after-free that appears when the "struct l2cap_conn" >> > >>>> [5] and "struct hci_conn" [6] objects are freed. >> > >>> These I'm not very comfortable applying as they are, I'm afraid there >> > >>> could be regressions if they are not accompanied with proper tests, >> > >>> besides I think there are less intrusive ways to cleanup l2cap_conn, >> > >>> see below. >> > >>> >> > >>>> The commit messages explain why the commit is needed, which problem >> > >>>> the commit solves, and how. >> > >>>> >> > >>>> The first and second patches are present for the memory leak >> > >>>> reproduction only, they should not be part of a final fix. >> > >>>> >> > >>>> Patch Status >> > >>>> ============ >> > >>>> >> > >>>> As of writing, the memory leak is fixed. The fix opened the door to >> > >>>> other problems, especially use-after-free, sometimes followed by >> > >>>> crashes due to NULL dereferences. We think there are weak references >> > >>>> (i.e. pointers that don't increment the refcounter) previously >> > >>>> pointing to memory that was never freed, but now is freed. On kernels >> > >>>> v5.13 and v5.15, patches 0005, 0006 and 0007 seem to fix the >> > >>>> use-after-free and NULL dereferences, but not on kernel v6.5 yet. >> > >>> I think the problem is that the lifetime of l2cap_conn shall be hooked >> > >>> with hci_chan, but the likes of hci_chan_list_flush -> hci_chan_del >> > >>> don't actually trigger l2cap_conn_del, which imo is the culprit here, >> > >>> because connect_cfm/disconnect_cfm is not called when the connection >> > >>> procedure has been aborted prematurely, so perhaps we shall get rid of >> > >>> the likes of l2cap_connect_cfm/l2cap_disconn_cfm and instead do the >> > >>> cleanup with in the following order: >> > >>> hci_conn_cleanup->hci_chan_list_flush->hci_chan_del->l2cap_conn_del, >> > >>> that way we avoid having multiple code paths attempting to cleanup >> > >>> objects associated with hci_conn/hci_chan. >> > >> I fully agree with your analysis, which correspond to mine: we should >> > >> call "l2cap_conn_del()", it would properly clean the allocations in >> > >> "l2cap_conn_add()". >> > >> I have tried but it was not obvious to find the right place to call >> > >> "l2cap_conn_del()" with the proper locking. >> > >> As you write, connect_cfm/disconnect_cfm is not called when the >> > >> connection is aborted, and that is the root cause of the memory leak. >> > > Btw, I was trying to reproduce it with the following test set, but at >> > > least kmemleak was not able to detect leaks of l2cap_conn: >> > > >> > > https://patchwork.kernel.org/project/bluetooth/list/?series=784343 >> > > >> > >> Your proposal is most probably the best way to go. >> > >> >> > >>> I'd change hci_chan_create to take a del callback to avoid having >> > >>> circular dependencies on one another though. >> > >> Interesting, could you explain how you would do it? Perhaps point on >> > >> a callback example? >> > >> >> > >>>> Reproducing with Buildroot >> > >>>> ========================== >> > >>>> >> > >>>> We have reproduced and fixed the memory leak with Buildroot [7] and a >> > >>>> "ble-memleak-repro" test application on an ST's Discovery Kit DK2 [4]. >> > >>>> >> > >>>> The "ble-memleak-repro" repository [1] contains the sources of a >> > >>>> complete external Buildroot customisation [8], with the patches, a >> > >>>> README, and more explanations to reproduce the problem with Buildroot >> > >>>> on a DK2. >> > >>>> >> > >>>> References: >> > >>>> =========== >> > >>>> >> > >>>> - [1]:<https://gitlab.com/essensium-mind/ble-memleak-repro.git> >> > >>>> "ble-memleak-repro repository" >> > >> [...] >> > >> >> > >> -- >> > >> Olivier L'Heureux >> >> >> >> -- >> Luiz Augusto von Dentz -- Luiz Augusto von Dentz