Hi colyli and mingzhe.zou: I am trying to reproduce this problem, maybe it is a random problem. It is triggered only when IO error reading priorities occurs. The same operation was performed on three servers, replacing the 12T disk with a 16T disk. Only one server triggered the bug. The on-site operation steps are as follows: 1. Create a bache device. make-bcache -C /dev/nvme2n1p1 -B /dev/sda --writeback --force --wipe-bcache /dev/sda is a 12T SATA disk. /dev/nvme2n1p1 is the first partition of the nvme disk. The partition size is 1024G. The partition command is parted -s --align optimal /dev/nvme2n1 mkpart primary 2048s 1024GiB 2. Execute fio test on bcache0 cat /home/script/run-fio-randrw.sh bcache_name=$1 if [ -z "${bcache_name}" ];then echo bcache_name is empty exit -1 fi fio --filename=/dev/${bcache_name} --ioengine=libaio --rw=randrw --bs=4k --size=100% --iodepth=128 --numjobs=4 --direct=1 --name=randrw --group_reporting --runtime=30 --ramp_time=5 --lockmem=1G | tee -a ./randrw-iops_k1.log Execute bash run-fio-randrw.sh multiple times bcache0 2. Shutdown poweroff No bcache data clearing operation was performed 3. Replace the 12T SATA disk with a 16T SATA disk After shutting down, unplug the 12T hard disk and replace it with a 16T hard disk. 4. Adjust the size of the nvme2n1 partition to 1536G parted -s --align optimal /dev/nvme2n1 mkpart primary 2048s 1536GiB Kernel panic occurs after partitioning is completed 5. Restart the system, but cannot enter the system normally. It is always in the restart state. 6. Enter the rescue mode through the CD, clear the nvme2n1p1 super block information. After restarting again, you can enter the system normally. wipefs -af /dev/nvme2n1p1 7. Repartition again, triggering kernel panic again. parted -s --align optimal /dev/nvme2n1 mkpart primary 2048s 1536GiB The same operation was performed on the other two servers, and no panic was triggered. The server with the problem was able to enter the system normally after the root of the cache_set structure was determined to be empty. I updated the description of the problem in the link below. bugzilla: https://gitee.com/openeuler/kernel/issues/IB3YQZ Your suggestion was correct. I removed the unnecessary btree_cache iserr_or_null check. ------------ If the bcache cache disk contains damaged data, when the bcache cache disk partition is directly operated, the system-udevd service is triggered to call the bcache-register program to register the bcache device,resulting in kernel oops. Signed-off-by: cheliequan <cheliequan@xxxxxxxxxx> --- drivers/md/bcache/super.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c index fd97730479d8..c72f5576e4da 100644 --- a/drivers/md/bcache/super.c +++ b/drivers/md/bcache/super.c @@ -1741,8 +1741,10 @@ static void cache_set_flush(struct closure *cl) if (!IS_ERR_OR_NULL(c->gc_thread)) kthread_stop(c->gc_thread); - if (!IS_ERR(c->root)) - list_add(&c->root->list, &c->btree_cache); + if (!IS_ERR_OR_NULL(c->root)) { + if (!list_empty(&c->root->list)) + list_add(&c->root->list, &c->btree_cache); + } /* * Avoid flushing cached nodes if cache set is retiring -- 2.33.0 Coly Li <colyli@xxxxxxx> 于2024年11月13日周三 21:54写道: > > > > > 2024年11月13日 16:58,mingzhe.zou@xxxxxxxxxxxx 写道: > > > > Hi, cheliequan and Coly: > > > > I saw some dmesg printing information from https://gitee.com/openeuler/kernel/issues/IB3YQZ > > > > This is a distribution issue, not upstream bug. > > The kernel is based on 5.10 kernel. If it can be confirmed to reproduce with latest upstream kernel, I can take a look. > > Thanks. > > Coly Li > > > > > > [snipped] >