Lukas Bulwahn <lukas.bulwahn@xxxxxxxxx> writes: > On Mon, 27 Jan 2020, Jouni Högander wrote: > >> Lukas Bulwahn <lukas.bulwahn@xxxxxxxxx> writes: >> >> > On Wed, 22 Jan 2020, Jouni Högander wrote: >> > >> >> Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx> writes: >> >> >> > Now queued up, I'll push out -rc2 versions with this fix. >> >> >> > >> >> >> > greg k-h >> >> >> >> >> >> We have also been informed about another regression these two commits >> >> >> are causing: >> >> >> >> >> >> https://lore.kernel.org/lkml/ace19af4-7cae-babd-bac5-cd3505dcd874@xxxxxxxxxxxxxxxxxxx/ >> >> >> >> >> >> I suggest to drop these two patches from this queue, and give us a >> >> >> week to shake out the regressions of the change, and once ready, we >> >> >> can include the complete set of fixes to stable (probably in a week or >> >> >> two). >> >> > >> >> > Ok, thanks for the information, I've now dropped them from all of the >> >> > queues that had them in them. >> >> > >> >> > greg k-h >> >> >> >> I have now run more extensive Syzkaller testing on following patches: >> >> >> >> cb626bf566eb net-sysfs: Fix reference count leak >> >> ddd9b5e3e765 net-sysfs: Call dev_hold always in rx_queue_add_kobject >> >> e0b60903b434 net-sysfs: Call dev_hold always in netdev_queue_add_kobje >> >> 48a322b6f996 net-sysfs: fix netdev_queue_add_kobject() breakage >> >> b8eb718348b8 net-sysfs: Fix reference count leak in rx|netdev_queue_add_kobject >> >> >> >> These patches are fixing couple of memory leaks including this one found >> >> by Syzbot: https://syzkaller.appspot.com/bug?extid=ad8ca40ecd77896d51e2 >> >> >> >> I can reproduce these memory leaks in following stable branches: 4.14, >> >> 4.19, and 5.4. >> >> >> >> These are all now merged into net/master tree and based on my testing >> >> they are ready to be taken into stable branches as well. >> >> >> > >> > + syzkaller list >> > Jouni et. al, please drop Linus in further responses; Linus, it was wrong >> > to add you to this thread in the first place (reason is explained below) >> > >> > Jouni, thanks for investigating. >> > >> > It raises the following questions and comments: >> > >> > - Does the memory leak NOT appear on 4.9 and earlier LTS branches (or did >> > you not check that)? If it does not appear, can you bisect it with the >> > reproducer to the commit between 4.14 and 4.9? >> >> I tested and these memory leaks are not reproucible in 4.9 and earlier. >> >> > >> > - Do the reproducers you found with your syzkaller testing show the same >> > behaviour (same bisection) as the reproducers from syzbot? >> >> Yes, they are same. >> >> > >> > - I fear syzbot's automatic bisection on is wrong, and Linus' commit >> > 0e034f5c4bc4 ("iwlwifi: fix mis-merge that breaks the driver") is not to >> > blame here; that commit did not cause the memory leak, but fixed some >> > unrelated issue that simply confuses syzbot's automatic bisection. >> > >> > Just FYI: Dmitry Vyukov's evaluation of the syzbot bisection shows that >> > about 50% are wrong, e.g., due to multiple bugs being triggered with one >> > reproducer and the difficulty of automatically identifying them of being >> > different due to different root causes (despite the smart heuristics of >> > syzkaller & syzbot). So, to identify the actual commit on which the memory >> > leak first appeared, you need to bisect manually with your own judgement >> > if the reported bug stack trace fits to the issue you investigating. Or >> > you use syzbot's automatic bisection but then with a reduced kernel config >> > that cannot be confused by other issues. You might possibly also hit a >> > "beginning of time" in your bisection, where KASAN was simply not >> > supported, then the initially causing commit can simply not determined by >> > bisection with the reproducer and needs some code inspection and >> > archaeology with git. Can you go ahead try to identify the correct commit >> > for this issue? >> >> These two commits (that are not in 4.9 and earlier) are intorducing these leaks: >> >> commit e331c9066901dfe40bea4647521b86e9fb9901bb >> Author: YueHaibing <yuehaibing@xxxxxxxxxx> >> Date: Tue Mar 19 10:16:53 2019 +0800 >> >> net-sysfs: call dev_hold if kobject_init_and_add success >> >> [ Upstream commit a3e23f719f5c4a38ffb3d30c8d7632a4ed8ccd9e ] >> >> In netdev_queue_add_kobject and rx_queue_add_kobject, >> if sysfs_create_group failed, kobject_put will call >> netdev_queue_release to decrease dev refcont, however >> dev_hold has not be called. So we will see this while >> unregistering dev: >> >> unregister_netdevice: waiting for bcsh0 to become free. Usage count = -1 >> >> Reported-by: Hulk Robot <hulkci@xxxxxxxxxx> >> Fixes: d0d668371679 ("net: don't decrement kobj reference count on init fail >> ure") >> Signed-off-by: YueHaibing <yuehaibing@xxxxxxxxxx> >> Signed-off-by: David S. Miller <davem@xxxxxxxxxxxxx> >> Signed-off-by: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx> >> >> commit d0d6683716791b2a2761a1bb025c613eb73da6c3 >> Author: stephen hemminger <stephen@xxxxxxxxxxxxxxxxxx> >> Date: Fri Aug 18 13:46:19 2017 -0700 >> >> net: don't decrement kobj reference count on init failure >> >> If kobject_init_and_add failed, then the failure path would >> decrement the reference count of the queue kobject whose reference >> count was already zero. >> >> Fixes: 114cf5802165 ("bql: Byte queue limits") >> Signed-off-by: Stephen Hemminger <sthemmin@xxxxxxxxxxxxx> >> Signed-off-by: David S. Miller <davem@xxxxxxxxxxxxx> >> > > But, it seems that we now have just a long sequences of fix patches. > > This commit from 2011 seems to be the initial buggy one: > > commit 114cf5802165ee93e3ab461c9c505cd94a08b800 > Author: Tom Herbert <therbert@xxxxxxxxxx> > Date: Mon Nov 28 16:33:09 2011 +0000 > > bql: Byte queue limits > > And then we just have fixes over fixes: > > 114cf5802165ee93e3ab461c9c505cd94a08b800 > fixed by d0d6683716791b2a2761a1bb025c613eb73da6c3 > fixed by a3e23f719f5c4a38ffb3d30c8d7632a4ed8ccd9e > fixed by the sequence of your five patches, mentioned above > > > If that is right, we should be able to find a reproducer with syzkaller on > the versions before d0d668371679 ("net: don't decrement kobj reference > count on init failure") with fault injection enabled or some manually > injected fault by modifying the source code to always fail on init to > really trigger the init failure, and see the reference count go below > zero. > > All further issues should also have reproducers found with syzkaller. > If we have a good feeling on the reproducers and this series of fixes > really fixed the issue now here for all cases, we should suggest to > backport all of the fixes to 4.4 and 4.9. > > We should NOT just have Greg pick up a subset of the patches and backport > them to 4.4 and 4.9, that will likely break more than it fixes. Yes, this is the case. > > Jouni, did you see Greg's bot inform you that he would pick up your latest > patch for 4.4 and 4.9? Please respond to those emails to make sure a > complete set of patches is picked up, which we tested with all those > intermediate reproducers and an extensive syzkaller run hitting the > net-sysfs interface (e.g., by configuring the corpus and check > coverage). I already responded to not pick these patches into 4.4 and 4.9. > > If you cannot do this testing for 4.4 and 4.9 now quickly (you > potentially have less than 24 hours), we should hold those new patches > back for 4.4 and 4.9, as none of the fixes seem to be applied at all right > now and the users have not complained yet on 4.4 and 4.9. > Once testing of the whole fix sequence is done, we request to backport all > patches at once for 4.4 and 4.9. If we want to pick whole set including older patches I think I need more time for identifying which older patches (apart from these two I identified causing the memory leak) should be taken in and for testing. > > Lukas BR, Jouni Högander