Hello Luiz,
Thanks for your review.
On 05/09/2023 22:42, Luiz Augusto von Dentz wrote:
Hi Olivier,
On Mon, Sep 4, 2023 at 3:12 PM Olivier L'Heureux
<olivier.lheureux@xxxxxxx> wrote:
Request for Comments
[...]
The "ble-memleak-repro" program reproduces the memory leak, if the
kernel is not patched. Its source is in
"package/ble-memleak-repro/ble-memleak-repro.c" [18].
We should probably create a test in l2cap-tester using SO_SNDTIMEO
first, so we can make sure CI test such case and we are able to detect
if the problem is reintroduced later:
https://github.com/bluez/bluez/blob/master/tools/l2cap-tester.c
I didn't know about "l2cap-tester.c". Agree, it would be great to
integrate my test app into it. I could try, but I don't know the test
framework yet.
Memory Leak Fix
===============
We have fixed the memory leak, see the patch series in
"patches/linux/":
1. The first patch
"0001-ARM-dts-stm32-Add-Bluetooth-usart2-feature-on-stm32m.patch" [11]
enables Bluetooth on the DK2.
This above should probably be sent separately.
2. The second patch
"0002-Bluetooth-add-many-traces-for-allocation-free-refcou.patch" [12]
adds many dynamic debug traces that help understand the kernel
behaviour and freeing problems.
I'm fine with this change, but we better use the likes of bt_dev_dbg
instead of BT_DBG.
This commit intended to increase the visibility during debugging, and
I was not intending it for a permanent presence in the kernel.
But if you find it useful, I can submit a patch RFC v2 with
"bt_dev_dbg()" instead of "BT_DBG()". Note that there is currently no
"bt_dev_dbg()" in "l2cap_core.c" yet.
"0003-Bluetooth-L2CAP-use-refcount-on-struct-l2cap_chan-co.patch" [13]
and
"0004-Bluetooth-L2CAP-free-the-leaking-struct-l2cap_conn.patch" [14]
fix the memory leak.
4. Patches
"0005-Bluetooth-introduce-hci_conn_free-for-better-structu.patch" [15],
"0006-Bluetooth-L2CAP-inc-refcount-if-reuse-struct-l2cap_c.patch" [16]
and
"0007-Bluetooth-unlink-objects-to-avoid-use-after-free.patch" [17]
fixes the use-after-free that appears when the "struct l2cap_conn"
[5] and "struct hci_conn" [6] objects are freed.
These I'm not very comfortable applying as they are, I'm afraid there
could be regressions if they are not accompanied with proper tests,
besides I think there are less intrusive ways to cleanup l2cap_conn,
see below.
The commit messages explain why the commit is needed, which problem
the commit solves, and how.
The first and second patches are present for the memory leak
reproduction only, they should not be part of a final fix.
Patch Status
============
As of writing, the memory leak is fixed. The fix opened the door to
other problems, especially use-after-free, sometimes followed by
crashes due to NULL dereferences. We think there are weak references
(i.e. pointers that don't increment the refcounter) previously
pointing to memory that was never freed, but now is freed. On kernels
v5.13 and v5.15, patches 0005, 0006 and 0007 seem to fix the
use-after-free and NULL dereferences, but not on kernel v6.5 yet.
I think the problem is that the lifetime of l2cap_conn shall be hooked
with hci_chan, but the likes of hci_chan_list_flush -> hci_chan_del
don't actually trigger l2cap_conn_del, which imo is the culprit here,
because connect_cfm/disconnect_cfm is not called when the connection
procedure has been aborted prematurely, so perhaps we shall get rid of
the likes of l2cap_connect_cfm/l2cap_disconn_cfm and instead do the
cleanup with in the following order:
hci_conn_cleanup->hci_chan_list_flush->hci_chan_del->l2cap_conn_del,
that way we avoid having multiple code paths attempting to cleanup
objects associated with hci_conn/hci_chan.
I fully agree with your analysis, which correspond to mine: we should
call "l2cap_conn_del()", it would properly clean the allocations in
"l2cap_conn_add()".
I have tried but it was not obvious to find the right place to call
"l2cap_conn_del()" with the proper locking.
As you write, connect_cfm/disconnect_cfm is not called when the
connection is aborted, and that is the root cause of the memory leak.
Your proposal is most probably the best way to go.
I'd change hci_chan_create to take a del callback to avoid having
circular dependencies on one another though.
Interesting, could you explain how you would do it? Perhaps point on
a callback example?
Reproducing with Buildroot
==========================
We have reproduced and fixed the memory leak with Buildroot [7] and a
"ble-memleak-repro" test application on an ST's Discovery Kit DK2 [4].
The "ble-memleak-repro" repository [1] contains the sources of a
complete external Buildroot customisation [8], with the patches, a
README, and more explanations to reproduce the problem with Buildroot
on a DK2.
References:
===========
- [1]:<https://gitlab.com/essensium-mind/ble-memleak-repro.git>
"ble-memleak-repro repository"
[...]
--
Olivier L'Heureux