Re: [bisected]kernel BUG at lib/list_debug.c:30! (list_add corruption. prev->next should be nex)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 11/26/22 17:54, Waiman Long wrote:

On 11/26/22 10:53, Jens Axboe wrote:
On 11/26/22 7:29 AM, Yi Zhang wrote:
Hi Jens
Sorry for the delay as I couldn't reproduce it with the original
for-6.2/block branch.
Finally, I rebased the for-6.2/block branch on 6.1-rc6 and was able to
bisect it:


951d1e94801f95a3fc1c75ff342431c9f519dd14 is the first bad commit
commit 951d1e94801f95a3fc1c75ff342431c9f519dd14
Author: Waiman Long <longman@xxxxxxxxxx>
Date:   Fri Nov 4 20:59:02 2022 -0400

     blk-cgroup: Flush stats at blkgs destruction path

     As noted by Michal, the blkg_iostat_set's in the lockless list
     hold reference to blkg's to protect against their removal. Those
     blkg's hold reference to blkcg. When a cgroup is being destroyed,
     cgroup_rstat_flush() is only called at css_release_work_fn() which is      called when the blkcg reference count reaches 0. This circular dependency
     will prevent blkcg from being freed until some other events cause
     cgroup_rstat_flush() to be called to flush out the pending blkcg stats.

     To prevent this delayed blkcg removal, add a new cgroup_rstat_css_flush()      function to flush stats for a given css and cpu and call it at the blkgs      destruction path, blkcg_destroy_blkgs(), whenever there are still some
     pending stats to be flushed. This will ensure that blkcg reference
     count can reach 0 ASAP.

     Signed-off-by: Waiman Long <longman@xxxxxxxxxx>
     Acked-by: Tejun Heo <tj@xxxxxxxxxx>
     Link: https://lore.kernel.org/r/20221105005902.407297-4-longman@xxxxxxxxxx
     Signed-off-by: Jens Axboe <axboe@xxxxxxxxx>
Waiman, let me know if you have an idea what is going on here and can
send in a fix, or if I need to revert this one. From looking at the
lists of commits after these reports came in, I did suspect this
commit. But I don't know enough about this area to render an opinion
on a fix without spending more time on it.

Sure. I will take a closer look at that. Will let you know my investigation result ASAP.

Thanks Yi for allowing me to access the system that can reproduce the bug. I found out that the panic problem is fixed by moving the rstat flushing before the destruction of blkgs in blkcg_destroy_blkgs(). I will post another patch later to fix that bug. However, I want to spend a bit more time to see if I can figure out what cause the panic in the first place.

Cheers,
Longman




[Index of Archives]     [Linux RAID]     [Linux SCSI]     [Linux ATA RAID]     [IDE]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Device Mapper]

  Powered by Linux