On Mon, Jul 03, 2023 at 07:19:35PM +0800, Yu Kuai wrote: > Hi, > > 在 2023/07/03 17:47, Xueshi Hu 写道: > > On Mon, Jul 03, 2023 at 09:44:03AM +0800, Yu Kuai wrote: > > > Hi, > > > > > > 在 2023/07/02 18:04, Xueshi Hu 写道: > > > > When a raid device is reshaped, in-flight bio may reference outdated > > > > r1conf::raid_disks and r1bio::poolinfo. This can trigger a bug in > > > > three possible paths: > > > > > > > > 1. In function "raid1d". If a bio fails to submit, it will be resent to > > > > raid1d for retrying the submission, which increases r1conf::nr_queued. > > > > If the reshape happens, the in-flight bio cannot be freed normally as > > > > the old mempool has been destroyed. > > > > 2. In raid1_write_request. If one raw device is blocked, the kernel will > > > > allow the barrier and wait for the raw device became ready, this makes > > > > the raid reshape possible. Then, the local variable "disks" before the > > > > label "retry_write" is outdated. Additionally, the kernel cannot reuse the > > > > old r1bio. > > > > 3. In raid_end_bio_io. The kernel must free the r1bio first and then > > > > allow the barrier. > > > > > > > > By freezing the queue, we can ensure that there are no in-flight bios > > > > during reshape. This prevents bio from referencing the outdated > > > > r1conf::raid_disks or r1bio::poolinfo. > > > > > > I didn't look into the details of the problem you described, but even if > > > the problem exist, freeze queue can't help at all, blk_mq_freeze_queue() > > > for bio-based device can't guarantee that threre are no in-flight bios. > > > > > > Thanks, > > > Kuai > > > > > > > > Signed-off-by: Xueshi Hu <xueshi.hu@xxxxxxxxxx> > > > > --- > > > > drivers/md/raid1.c | 3 +++ > > > > 1 file changed, 3 insertions(+) > > > > > > > > diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c > > > > index dd25832eb045..d8d6825d0af6 100644 > > > > --- a/drivers/md/raid1.c > > > > +++ b/drivers/md/raid1.c > > > > @@ -3247,6 +3247,7 @@ static int raid1_reshape(struct mddev *mddev) > > > > unsigned long flags; > > > > int d, d2; > > > > int ret; > > > > + struct request_queue *q = mddev->queue; > > > > memset(&newpool, 0, sizeof(newpool)); > > > > memset(&oldpool, 0, sizeof(oldpool)); > > > > @@ -3296,6 +3297,7 @@ static int raid1_reshape(struct mddev *mddev) > > > > return -ENOMEM; > > > > } > > > > + blk_mq_freeze_queue(q); > > > > freeze_array(conf, 0); > > > > /* ok, everything is stopped */ > > > > @@ -3333,6 +3335,7 @@ static int raid1_reshape(struct mddev *mddev) > > > > md_wakeup_thread(mddev->thread); > > > > mempool_exit(&oldpool); > > > > + blk_mq_unfreeze_queue(q); > > > > return 0; > > > > } > > > > > > > > Use this bash script, it's easy to trigger the bug > > 1. Firstly, start fio to make requests on raid device > > 2. Set one of the raw devices' state into "blocked" > > 3. Reshape the raid device and "-blocked" the raw device > > > > ``` > > parted -s /dev/sda -- mklabel gpt > > parted /dev/sda -- mkpart primary 0G 1G > > parted -s /dev/sdc -- mklabel gpt > > parted /dev/sdc -- mkpart primary 0G 1G > > > > yes | mdadm --create /dev/md10 --level=mirror \ > > --force --raid-devices=2 /dev/sda1 /dev/sdc1 > > mdadm --wait /dev/md10 > > > > nohup fio fio.job & > > > > device_num=2 > > for ((i = 0; i <= 100000; i = i + 1)); do > > sleep 1 > > echo "blocked" >/sys/devices/virtual/block/md10/md/dev-sda1/state > > if [[ $((i % 2)) -eq 0 ]]; then > > device_num=2 > > else > > device_num=1800 > > fi > > mdadm --grow --force --raid-devices=$device_num /dev/md10 > > sleep 1 > > echo "-blocked" >/sys/devices/virtual/block/md10/md/dev-sda1/state > > done > > ``` > > > > The configuration of fio, file fio.job > > ``` > > [global] > > ioengine=libaio > > bs=4k > > numjobs=1 > > iodepth=128 > > direct=1 > > rate=1M,1M > > > > [md10] > > time_based > > runtime=-1 > > rw=randwrite > > filename=/dev/md10 > > ``` > > > > kernel crashed when trying to free r1bio: > > > > [ 116.977805] ? __die+0x23/0x70 > > [ 116.977962] ? page_fault_oops+0x181/0x470 > > [ 116.978148] ? exc_page_fault+0x71/0x180 > > [ 116.978331] ? asm_exc_page_fault+0x26/0x30 > > [ 116.978523] ? bio_put+0xe/0x130 > > [ 116.978672] raid_end_bio_io+0xa1/0xd0 > > [ 116.978854] raid1_end_write_request+0x111/0x350 > > [ 116.979063] blk_update_request+0x114/0x480 > > [ 116.979253] ? __ata_sff_port_intr+0x9c/0x160 > > [ 116.979452] scsi_end_request+0x27/0x1c0 > > [ 116.979633] scsi_io_completion+0x5a/0x6a0 > > [ 116.979822] blk_complete_reqs+0x3d/0x50 > > [ 116.980000] __do_softirq+0x113/0x3aa > > [ 116.980169] irq_exit_rcu+0x8e/0xb0 > > [ 116.980334] common_interrupt+0x86/0xa0 > > [ 116.980508] </IRQ> > > [ 116.980606] <TASK> > > [ 116.980704] asm_common_interrupt+0x26/0x40 > > [ 116.980897] RIP: 0010:default_idle+0xf/0x20 > > This looks like freeze_array() doen't work as expected. > > > > > As far I know, when a request is allocated, > > request_queue::q_usage_counter is increased. When the io finished, the > > request_queue::q_usage_counter is decreased, use nvme driver as an > > example: > > > > nvme_complete_batch() > > blk_mq_end_request_batch() > > blk_mq_flush_tag_batch() > > percpu_ref_put_many(&q->q_usage_counter, nr_tags); > > > > > > So, when blk_mq_freeze_queue() is returned successfully, every in-flight > > io has returned from hardware, also new requests are blocked. > > This only works for rq-based device, not bio-based device. I get you point eventually, raid is bio-based device[^1], and it's io path diverges from rq-based device at __submit_bio(), bio-based device call the submit_bio() and put the refcount immediately, without waiting for the in-flight io to come back. Thank you for catching my mistake. [^1]: https://lwn.net/Articles/736534/ > > Thanks, > Kuai > > > > Thanks, > > Hu > > > > . > > Thanks, Hu