On Mon, Jul 03, 2023 at 09:44:03AM +0800, Yu Kuai wrote: > Hi, > > 在 2023/07/02 18:04, Xueshi Hu 写道: > > When a raid device is reshaped, in-flight bio may reference outdated > > r1conf::raid_disks and r1bio::poolinfo. This can trigger a bug in > > three possible paths: > > > > 1. In function "raid1d". If a bio fails to submit, it will be resent to > > raid1d for retrying the submission, which increases r1conf::nr_queued. > > If the reshape happens, the in-flight bio cannot be freed normally as > > the old mempool has been destroyed. > > 2. In raid1_write_request. If one raw device is blocked, the kernel will > > allow the barrier and wait for the raw device became ready, this makes > > the raid reshape possible. Then, the local variable "disks" before the > > label "retry_write" is outdated. Additionally, the kernel cannot reuse the > > old r1bio. > > 3. In raid_end_bio_io. The kernel must free the r1bio first and then > > allow the barrier. > > > > By freezing the queue, we can ensure that there are no in-flight bios > > during reshape. This prevents bio from referencing the outdated > > r1conf::raid_disks or r1bio::poolinfo. > > I didn't look into the details of the problem you described, but even if > the problem exist, freeze queue can't help at all, blk_mq_freeze_queue() > for bio-based device can't guarantee that threre are no in-flight bios. > > Thanks, > Kuai > > > > Signed-off-by: Xueshi Hu <xueshi.hu@xxxxxxxxxx> > > --- > > drivers/md/raid1.c | 3 +++ > > 1 file changed, 3 insertions(+) > > > > diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c > > index dd25832eb045..d8d6825d0af6 100644 > > --- a/drivers/md/raid1.c > > +++ b/drivers/md/raid1.c > > @@ -3247,6 +3247,7 @@ static int raid1_reshape(struct mddev *mddev) > > unsigned long flags; > > int d, d2; > > int ret; > > + struct request_queue *q = mddev->queue; > > memset(&newpool, 0, sizeof(newpool)); > > memset(&oldpool, 0, sizeof(oldpool)); > > @@ -3296,6 +3297,7 @@ static int raid1_reshape(struct mddev *mddev) > > return -ENOMEM; > > } > > + blk_mq_freeze_queue(q); > > freeze_array(conf, 0); > > /* ok, everything is stopped */ > > @@ -3333,6 +3335,7 @@ static int raid1_reshape(struct mddev *mddev) > > md_wakeup_thread(mddev->thread); > > mempool_exit(&oldpool); > > + blk_mq_unfreeze_queue(q); > > return 0; > > } > > Use this bash script, it's easy to trigger the bug 1. Firstly, start fio to make requests on raid device 2. Set one of the raw devices' state into "blocked" 3. Reshape the raid device and "-blocked" the raw device ``` parted -s /dev/sda -- mklabel gpt parted /dev/sda -- mkpart primary 0G 1G parted -s /dev/sdc -- mklabel gpt parted /dev/sdc -- mkpart primary 0G 1G yes | mdadm --create /dev/md10 --level=mirror \ --force --raid-devices=2 /dev/sda1 /dev/sdc1 mdadm --wait /dev/md10 nohup fio fio.job & device_num=2 for ((i = 0; i <= 100000; i = i + 1)); do sleep 1 echo "blocked" >/sys/devices/virtual/block/md10/md/dev-sda1/state if [[ $((i % 2)) -eq 0 ]]; then device_num=2 else device_num=1800 fi mdadm --grow --force --raid-devices=$device_num /dev/md10 sleep 1 echo "-blocked" >/sys/devices/virtual/block/md10/md/dev-sda1/state done ``` The configuration of fio, file fio.job ``` [global] ioengine=libaio bs=4k numjobs=1 iodepth=128 direct=1 rate=1M,1M [md10] time_based runtime=-1 rw=randwrite filename=/dev/md10 ``` kernel crashed when trying to free r1bio: [ 116.977805] ? __die+0x23/0x70 [ 116.977962] ? page_fault_oops+0x181/0x470 [ 116.978148] ? exc_page_fault+0x71/0x180 [ 116.978331] ? asm_exc_page_fault+0x26/0x30 [ 116.978523] ? bio_put+0xe/0x130 [ 116.978672] raid_end_bio_io+0xa1/0xd0 [ 116.978854] raid1_end_write_request+0x111/0x350 [ 116.979063] blk_update_request+0x114/0x480 [ 116.979253] ? __ata_sff_port_intr+0x9c/0x160 [ 116.979452] scsi_end_request+0x27/0x1c0 [ 116.979633] scsi_io_completion+0x5a/0x6a0 [ 116.979822] blk_complete_reqs+0x3d/0x50 [ 116.980000] __do_softirq+0x113/0x3aa [ 116.980169] irq_exit_rcu+0x8e/0xb0 [ 116.980334] common_interrupt+0x86/0xa0 [ 116.980508] </IRQ> [ 116.980606] <TASK> [ 116.980704] asm_common_interrupt+0x26/0x40 [ 116.980897] RIP: 0010:default_idle+0xf/0x20 As far I know, when a request is allocated, request_queue::q_usage_counter is increased. When the io finished, the request_queue::q_usage_counter is decreased, use nvme driver as an example: nvme_complete_batch() blk_mq_end_request_batch() blk_mq_flush_tag_batch() percpu_ref_put_many(&q->q_usage_counter, nr_tags); So, when blk_mq_freeze_queue() is returned successfully, every in-flight io has returned from hardware, also new requests are blocked. Thanks, Hu