> 2024年11月14日 21:10,liequan che <liequanche@xxxxxxxxx> 写道: > > Hi colyli and mingzhe.zou: > I am trying to reproduce this problem, maybe it is a random problem. > It is triggered only when IO error reading priorities occurs. > The same operation was performed on three servers, replacing the 12T > disk with a 16T disk. Only one server triggered the bug. The on-site What do you mean “replacing 12T disk with a 16T disk” ? > operation steps are as follows: > 1. Create a bache device. > make-bcache -C /dev/nvme2n1p1 -B /dev/sda --writeback --force --wipe-bcache > /dev/sda is a 12T SATA disk. > /dev/nvme2n1p1 is the first partition of the nvme disk. The partition > size is 1024G. > The partition command is parted -s --align optimal /dev/nvme2n1 mkpart > primary 2048s 1024GiB > 2. Execute fio test on bcache0 > > cat /home/script/run-fio-randrw.sh > bcache_name=$1 > if [ -z "${bcache_name}" ];then > echo bcache_name is empty > exit -1 > fi > > fio --filename=/dev/${bcache_name} --ioengine=libaio --rw=randrw > --bs=4k --size=100% --iodepth=128 --numjobs=4 --direct=1 --name=randrw > --group_reporting --runtime=30 --ramp_time=5 --lockmem=1G | tee -a > ./randrw-iops_k1.log > Execute bash run-fio-randrw.sh multiple times bcache0 > 2. Shutdown > poweroff > No bcache data clearing operation was performed What is the “bcache data clearing operation” here? > 3. Replace the 12T SATA disk with a 16T SATA disk > After shutting down, unplug the 12T hard disk and replace it with a > 16T hard disk. It seems you did something bcache doesn’t support. Replace the backing device... > 4. Adjust the size of the nvme2n1 partition to 1536G > parted -s --align optimal /dev/nvme2n1 mkpart primary 2048s 1536GiB > Kernel panic occurs after partitioning is completed Yes it is expected, bcache doesn’t support resize on cache device. The operation will result a corrupted meta data layout, it is expected. > 5. Restart the system, but cannot enter the system normally. It is > always in the restart state. > 6. Enter the rescue mode through the CD, clear the nvme2n1p1 super > block information. After restarting again, you can enter the system > normally. > wipefs -af /dev/nvme2n1p1 OK, the cache device is cleared. > 7. Repartition again, triggering kernel panic again. > parted -s --align optimal /dev/nvme2n1 mkpart primary 2048s 1536GiB > The same operation was performed on the other two servers, and no > panic was triggered. I guess this is another undefine operation. I assume the cache device is still references somewhere. A reboot should follow the wipefs. > The server with the problem was able to enter the system normally > after the root of the cache_set structure was determined to be empty. > I updated the description of the problem in the link below. No, if you clean up the partition, no cache device will exist. Cache registration won’t treat it as a bcache device. OK, from the above description, I see you replace the backing device (and I don’t know where the previous data was), then you extend the cache device size. They are all unsupported operations. It is very possible that the unsupported operations results undefined aftermath. > bugzilla: https://gitee.com/openeuler/kernel/issues/IB3YQZ > Your suggestion was correct. I removed the unnecessary btree_cache > iserr_or_null check. Here in the linux-bcache mailing list, we don’t handle distribution specific bug. Unless it is in upstream too. But from the above description IHMO they are invalid operations, so I don’t see there is a valid bug. > ------------ > If the bcache cache disk contains damaged data, > when the bcache cache disk partition is directly operated, > the system-udevd service is triggered to call the bcache-register > program to register the bcache device,resulting in kernel oops. > > Signed-off-by: cheliequan <cheliequan@xxxxxxxxxx> > > --- > drivers/md/bcache/super.c | 6 ++++-- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c > index fd97730479d8..c72f5576e4da 100644 > --- a/drivers/md/bcache/super.c > +++ b/drivers/md/bcache/super.c > @@ -1741,8 +1741,10 @@ static void cache_set_flush(struct closure *cl) > if (!IS_ERR_OR_NULL(c->gc_thread)) > kthread_stop(c->gc_thread); > > - if (!IS_ERR(c->root)) > - list_add(&c->root->list, &c->btree_cache); > + if (!IS_ERR_OR_NULL(c->root)) { > + if (!list_empty(&c->root->list)) > + list_add(&c->root->list, &c->btree_cache); > + } > The patch just avoid an explicit kernel panic of the undefined device status. More damages are on the way even you try to veil this panic. Thanks. Coly Li > /* > * Avoid flushing cached nodes if cache set is retiring > -- > 2.33.0