On Tue, Jun 15, 2021 at 4:30 AM Pavel Skripkin <paskripkin@xxxxxxxxx> wrote: > > On Mon, 14 Jun 2021 23:04:03 +0800 > Dongliang Mu <mudongliangabcd@xxxxxxxxx> wrote: > > > On Mon, Jun 14, 2021 at 10:47 PM Pavel Skripkin > > <paskripkin@xxxxxxxxx> wrote: > > > > > > On Mon, 14 Jun 2021 22:40:55 +0800 > > > Dongliang Mu <mudongliangabcd@xxxxxxxxx> wrote: > > > > > > > On Mon, Jun 14, 2021 at 10:25 PM Pavel Skripkin > > > > <paskripkin@xxxxxxxxx> wrote: > > > > > > > > > > On Mon, 14 Jun 2021 22:19:10 +0800 > > > > > Dongliang Mu <mudongliangabcd@xxxxxxxxx> wrote: > > > > > > > > > > > On Mon, Jun 14, 2021 at 9:34 PM Pavel Skripkin > > > > > > <paskripkin@xxxxxxxxx> wrote: > > > > > > > > > > > > > > On Mon, 14 Jun 2021 21:22:43 +0800 > > > > > > > Dongliang Mu <mudongliangabcd@xxxxxxxxx> wrote: > > > > > > > > > > > > > > > Dear kernel developers, > > > > > > > > > > > > > > > > I was trying to debug the crash - memory leak in > > > > > > > > hwsim_add_one [1] recently. However, I encountered a > > > > > > > > disgusting issue: my breakpoint and printk/pr_alert in the > > > > > > > > functions that will be surely executed do not work. The > > > > > > > > stack trace is in the following. I wrote this email to > > > > > > > > ask for some suggestions on how to debug such cases? > > > > > > > > > > > > > > > > Thanks very much. Looking forward to your reply. > > > > > > > > > > > > > > > > > > > > > > Hi, Dongliang! > > > > > > > > > > > > > > This bug is not similar to others on the dashboard. I spent > > > > > > > some time debugging it a week ago. The main problem here, > > > > > > > that memory allocation happens in the boot time: > > > > > > > > > > > > > > > [<ffffffff84359255>] kernel_init+0xc/0x1a7 > > > > > > > > init/main.c:1447 > > > > > > > > > > > > > > > > > > > Oh, nice catch. No wonder why my debugging does not work. :( > > > > > > > > > > > > > and reproducer simply tries to > > > > > > > free this data. You can use ftrace to look at it. Smth like > > > > > > > this: > > > > > > > > > > > > > > $ echo 'hwsim_*' > $TRACE_DIR/set_ftrace_filter > > > > > > > > > > > > Thanks for your suggestion. > > > > > > > > > > > > Do you have any conclusions about this case? If you have found > > > > > > out the root cause and start writing patches, I will turn my > > > > > > focus to other cases. > > > > > > > > > > No, I had some busy days and I have nothing about this bug for > > > > > now. I've just traced the reproducer execution and that's all :) > > > > > > > > > > I guess, some error handling paths are broken, but Im not sure > > > > > > > > In the beginning, I agreed with you. However, after I manually > > > > checked functions: hwsim_probe (initialization) and hwsim_remove > > > > (cleanup), then things may be different. The cleanup looks > > > > correct to me. I would like to debug but stuck with the debugging > > > > process. > > > > > > > > And there is another issue: the cleanup function also does not > > > > output anything or hit the breakpoint. I don't quite understand > > > > it since the cleanup is not at the boot time. > > > > > > > > Any idea? > > > > > > > > > > Output from ftrace (syzkaller repro): > > > > > > root@syzkaller:~# cat /sys/kernel/tracing/trace > > > # tracer: function_graph > > > # > > > # CPU DURATION FUNCTION CALLS > > > # | | | | | | | > > > 1) | hwsim_del_radio_nl() { > > > 1) | hwsim_del() { > > > 1) | hwsim_edge_unsubscribe_me() { > > > 1) ! 310.041 us | hwsim_free_edge(); > > > 1) ! 665.221 us | } > > > 1) * 52999.05 us | } > > > 1) * 53035.38 us | } > > > > > > Cleanup function is not the case, I think :) > > > > It seems like I spot the incorrect cleanup function (hwsim_remove is > > the right one is in my mind). Let me learn how to use ftrace to log > > the executed functions and then discuss this case with you guys. > > > > Hmmm, I think, there is a mess with lists. > > I just want to share my debug results, I have no idea about the fix for > now. > > In hwsim_probe() edge for phy->idx = 1 is allocated, then reproduces > sends a request to delete phy with idx == 0, so this check in > hwsim_edge_unsubscribe_me(): > > if (e->endpoint->idx == phy->idx) { > ... clean up code ... > } > > won't be passed and edge won't be freed (because it was allocated for > phy with idx == 1). Allocated edge for phy 1 becomes leaked after > hwsim_del(). I can't really see the code where phy with idx == 1 can > be deleted from list... Thanks for sharing your debugging result. hwsim_phys | --------------------------------- | | sub0 (edges) sub1 (edges) ----> e (idx = 1) ----> e (idx = 0) hwsim_del_radio_nl will call hwsim_del to delete phy (idx:1). However, in this function, it only deletes the e in the edge list of sub1. Then it deletes phy (i.e., sub0) from the hwsim_phys list. So it leaves the e in the edge list of sub0 non-free. I proposed a patch and test it successfully in the syzbot dashboard. diff --git a/drivers/net/ieee802154/mac802154_hwsim.c b/drivers/net/ieee802154/mac802154_hwsim.c index da9135231c07..b05159cff33a 100644 --- a/drivers/net/ieee802154/mac802154_hwsim.c +++ b/drivers/net/ieee802154/mac802154_hwsim.c @@ -824,9 +824,16 @@ static int hwsim_add_one(struct genl_info *info, struct device *dev, static void hwsim_del(struct hwsim_phy *phy) { struct hwsim_pib *pib; + struct hwsim_edge *e; hwsim_edge_unsubscribe_me(phy); + // remove the edges in the list + list_for_each_entry_rcu(e, &phy->edges, list) { + list_del_rcu(&e->list); + hwsim_free_edge(e); + } + list_del(&phy->list); rcu_read_lock(); I will send a patch later. > > Maybe, it's kmemleak bug. Similar strange case was with this one > https://syzkaller.appspot.com/bug?id=3a325b8389fc41c1bc94de0f4ac437ed13cce584. > I find it strange, that I could reach leaked pointers after kmemleak reported a > leak. Im not familiar with kmemleak internals and I might be wrong > > > With regards, > Pavel Skripkin