Hi Olivier, On Wed, Sep 13, 2023 at 3:25 PM Olivier L'Heureux <olivier.lheureux@xxxxxxx> wrote: > > Hello Luiz, > > Thanks for your review. > > On 05/09/2023 22:42, Luiz Augusto von Dentz wrote: > > Hi Olivier, > > > > On Mon, Sep 4, 2023 at 3:12 PM Olivier L'Heureux > > <olivier.lheureux@xxxxxxx> wrote: > >> > >> Request for Comments > [...] > >> > >> The "ble-memleak-repro" program reproduces the memory leak, if the > >> kernel is not patched. Its source is in > >> "package/ble-memleak-repro/ble-memleak-repro.c" [18]. > > > > We should probably create a test in l2cap-tester using SO_SNDTIMEO > > first, so we can make sure CI test such case and we are able to detect > > if the problem is reintroduced later: > > > > https://github.com/bluez/bluez/blob/master/tools/l2cap-tester.c > > I didn't know about "l2cap-tester.c". Agree, it would be great to > integrate my test app into it. I could try, but I don't know the test > framework yet. > > >> Memory Leak Fix > >> =============== > >> > >> We have fixed the memory leak, see the patch series in > >> "patches/linux/": > >> > >> 1. The first patch > >> "0001-ARM-dts-stm32-Add-Bluetooth-usart2-feature-on-stm32m.patch" [11] > >> enables Bluetooth on the DK2. > > > > This above should probably be sent separately. > > > >> 2. The second patch > >> "0002-Bluetooth-add-many-traces-for-allocation-free-refcou.patch" [12] > >> adds many dynamic debug traces that help understand the kernel > >> behaviour and freeing problems. > > > > I'm fine with this change, but we better use the likes of bt_dev_dbg > > instead of BT_DBG. > > This commit intended to increase the visibility during debugging, and > I was not intending it for a permanent presence in the kernel. > But if you find it useful, I can submit a patch RFC v2 with > "bt_dev_dbg()" instead of "BT_DBG()". Note that there is currently no > "bt_dev_dbg()" in "l2cap_core.c" yet. > > >> "0003-Bluetooth-L2CAP-use-refcount-on-struct-l2cap_chan-co.patch" [13] > >> and > >> "0004-Bluetooth-L2CAP-free-the-leaking-struct-l2cap_conn.patch" [14] > >> fix the memory leak. > >> 4. Patches > >> "0005-Bluetooth-introduce-hci_conn_free-for-better-structu.patch" [15], > >> "0006-Bluetooth-L2CAP-inc-refcount-if-reuse-struct-l2cap_c.patch" [16] > >> and > >> "0007-Bluetooth-unlink-objects-to-avoid-use-after-free.patch" [17] > >> fixes the use-after-free that appears when the "struct l2cap_conn" > >> [5] and "struct hci_conn" [6] objects are freed. > > > > These I'm not very comfortable applying as they are, I'm afraid there > > could be regressions if they are not accompanied with proper tests, > > besides I think there are less intrusive ways to cleanup l2cap_conn, > > see below. > > > >> The commit messages explain why the commit is needed, which problem > >> the commit solves, and how. > >> > >> The first and second patches are present for the memory leak > >> reproduction only, they should not be part of a final fix. > >> > >> Patch Status > >> ============ > >> > >> As of writing, the memory leak is fixed. The fix opened the door to > >> other problems, especially use-after-free, sometimes followed by > >> crashes due to NULL dereferences. We think there are weak references > >> (i.e. pointers that don't increment the refcounter) previously > >> pointing to memory that was never freed, but now is freed. On kernels > >> v5.13 and v5.15, patches 0005, 0006 and 0007 seem to fix the > >> use-after-free and NULL dereferences, but not on kernel v6.5 yet. > > > > I think the problem is that the lifetime of l2cap_conn shall be hooked > > with hci_chan, but the likes of hci_chan_list_flush -> hci_chan_del > > don't actually trigger l2cap_conn_del, which imo is the culprit here, > > because connect_cfm/disconnect_cfm is not called when the connection > > procedure has been aborted prematurely, so perhaps we shall get rid of > > the likes of l2cap_connect_cfm/l2cap_disconn_cfm and instead do the > > cleanup with in the following order: > > hci_conn_cleanup->hci_chan_list_flush->hci_chan_del->l2cap_conn_del, > > that way we avoid having multiple code paths attempting to cleanup > > objects associated with hci_conn/hci_chan. > > I fully agree with your analysis, which correspond to mine: we should > call "l2cap_conn_del()", it would properly clean the allocations in > "l2cap_conn_add()". > I have tried but it was not obvious to find the right place to call > "l2cap_conn_del()" with the proper locking. > As you write, connect_cfm/disconnect_cfm is not called when the > connection is aborted, and that is the root cause of the memory leak. Btw, I was trying to reproduce it with the following test set, but at least kmemleak was not able to detect leaks of l2cap_conn: https://patchwork.kernel.org/project/bluetooth/list/?series=784343 > Your proposal is most probably the best way to go. > > > I'd change hci_chan_create to take a del callback to avoid having > > circular dependencies on one another though. > > Interesting, could you explain how you would do it? Perhaps point on > a callback example? > > >> Reproducing with Buildroot > >> ========================== > >> > >> We have reproduced and fixed the memory leak with Buildroot [7] and a > >> "ble-memleak-repro" test application on an ST's Discovery Kit DK2 [4]. > >> > >> The "ble-memleak-repro" repository [1] contains the sources of a > >> complete external Buildroot customisation [8], with the patches, a > >> README, and more explanations to reproduce the problem with Buildroot > >> on a DK2. > >> > >> References: > >> =========== > >> > >> - [1]: <https://gitlab.com/essensium-mind/ble-memleak-repro.git> > >> "ble-memleak-repro repository" > [...] > > -- > Olivier L'Heureux -- Luiz Augusto von Dentz