Hi Guoqing, Sorry for the delay, I missed it. On Sat, 12 Feb 2022 09:47:38 +0800 Guoqing Jiang <guoqing.jiang@xxxxxxxxx> wrote: > On 1/27/22 11:39 PM, Mariusz Tkaczyk wrote: > > Raid456 module had allowed to achieve failed state. It was fixed by > > fb73b357fb9 ("raid5: block failing device if raid will be failed"). > > This fix introduces a bug, now if raid5 fails during IO, it may > > result with a hung task without completion. Faulty flag on the > > device is necessary to process all requests and is checked many > > times, mainly in analyze_stripe(). > > Allow to set faulty on drive again and set MD_BROKEN if raid is > > failed. > > > > As a result, this level is allowed to achieve failed state again, > > but communication with userspace (via -EBUSY status) will be > > preserved. > > > > This restores possibility to fail array via #mdadm --set-faulty > > command and will be fixed by additional verification on mdadm side. > > > > Again, you better to send mdadm change along with the series. I understand your objections. Unfortunately, I was unable to handle it yet. I focused on kernel part first because mdadm was in freeze. In mdadm, I need to block manual removal which should be simple, there is already enough() function defined. It is on top of my TODO lst. > > > Reproduction steps: > > mdadm -CR imsm -e imsm -n 3 /dev/nvme[0-2]n1 > > mdadm -CR r5 -e imsm -l5 -n3 /dev/nvme[0-2]n1 --assume-clean > > mkfs.xfs /dev/md126 -f > > mount /dev/md126 /mnt/root/ > > > > fio --filename=/mnt/root/file --size=5GB --direct=1 --rw=randrw > > --bs=64k --ioengine=libaio --iodepth=64 --runtime=240 --numjobs=4 > > --time_based --group_reporting --name=throughput-test-job > > --eta-newline=1 & > > > > echo 1 > /sys/block/nvme2n1/device/device/remove > > echo 1 > /sys/block/nvme1n1/device/device/remove > > > > [ 1475.787779] Call Trace: > > [ 1475.793111] __schedule+0x2a6/0x700 > > [ 1475.799460] schedule+0x38/0xa0 > > [ 1475.805454] raid5_get_active_stripe+0x469/0x5f0 [raid456] > > [ 1475.813856] ? finish_wait+0x80/0x80 > > [ 1475.820332] raid5_make_request+0x180/0xb40 [raid456] > > [ 1475.828281] ? finish_wait+0x80/0x80 > > [ 1475.834727] ? finish_wait+0x80/0x80 > > [ 1475.841127] ? finish_wait+0x80/0x80 > > [ 1475.847480] md_handle_request+0x119/0x190 > > [ 1475.854390] md_make_request+0x8a/0x190 > > [ 1475.861041] generic_make_request+0xcf/0x310 > > [ 1475.868145] submit_bio+0x3c/0x160 > > [ 1475.874355] iomap_dio_submit_bio.isra.20+0x51/0x60 > > [ 1475.882070] iomap_dio_bio_actor+0x175/0x390 > > [ 1475.889149] iomap_apply+0xff/0x310 > > [ 1475.895447] ? iomap_dio_bio_actor+0x390/0x390 > > [ 1475.902736] ? iomap_dio_bio_actor+0x390/0x390 > > [ 1475.909974] iomap_dio_rw+0x2f2/0x490 > > [ 1475.916415] ? iomap_dio_bio_actor+0x390/0x390 > > [ 1475.923680] ? atime_needs_update+0x77/0xe0 > > [ 1475.930674] ? xfs_file_dio_aio_read+0x6b/0xe0 [xfs] > > [ 1475.938455] xfs_file_dio_aio_read+0x6b/0xe0 [xfs] > > [ 1475.946084] xfs_file_read_iter+0xba/0xd0 [xfs] > > [ 1475.953403] aio_read+0xd5/0x180 > > [ 1475.959395] ? _cond_resched+0x15/0x30 > > [ 1475.965907] io_submit_one+0x20b/0x3c0 > > [ 1475.972398] __x64_sys_io_submit+0xa2/0x180 > > [ 1475.979335] ? do_io_getevents+0x7c/0xc0 > > [ 1475.986009] do_syscall_64+0x5b/0x1a0 > > [ 1475.992419] entry_SYSCALL_64_after_hwframe+0x65/0xca > > [ 1476.000255] RIP: 0033:0x7f11fc27978d > > [ 1476.006631] Code: Bad RIP value. > > [ 1476.073251] INFO: task fio:3877 blocked for more than 120 > > seconds. > > Does it also happen to non imsm array? And did you try to reproduce > it with revert fb73b357fb? I suppose fb73b357fb9 introduced the > regression given it is fixed by this one. > It is reproducible on native and IMSM. Yes, we can fix it by revert. > > > * In both cases, &MD_BROKEN will be set in &mddev->flags. > > */ > > static void raid1_error(struct mddev *mddev, struct md_rdev *rdev) > > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c > > index 1240a5c16af8..bee953c8007f 100644 > > --- a/drivers/md/raid5.c > > +++ b/drivers/md/raid5.c > > @@ -686,17 +686,21 @@ int raid5_calc_degraded(struct r5conf *conf) > > return degraded; > > } > > > > -static int has_failed(struct r5conf *conf) > > +static bool has_failed(struct r5conf *conf) > > { > > - int degraded; > > + int degraded = conf->mddev->degraded; > > > > - if (conf->mddev->reshape_position == MaxSector) > > - return conf->mddev->degraded > conf->max_degraded; > > + if (test_bit(MD_BROKEN, &conf->mddev->flags)) > > + return true; > > If one member disk was set Faulty which caused BROKEN was set, is it > possible to re-add the same member disk again? > Is possible to re-add drive to failed raid5 array now? From my understanding of raid5_add_disk it is not possible. > [root@vm ~]# echo faulty > /sys/block/md0/md/dev-loop1/state > [root@vm ~]# cat /proc/mdstat > Personalities : [raid6] [raid5] [raid4] > md0 : active raid5 loop2[2] loop1[0](F) > 1046528 blocks super 1.2 level 5, 512k chunk, algorithm 2 > [2/1] [_U] bitmap: 0/1 pages [0KB], 65536KB chunk > > unused devices: <none> > [root@vm ~]# echo re-add > /sys/block/md0/md/dev-loop1/state > [root@vm ~]# cat /proc/mdstat > Personalities : [raid6] [raid5] [raid4] > md0 : active raid5 loop2[2] loop1[0] > 1046528 blocks super 1.2 level 5, 512k chunk, algorithm 2 > [2/2] [UU] bitmap: 0/1 pages [0KB], 65536KB chunk > > unused devices: <none> > > And have you run mdadm test against the series? > I run imsm test suite and our internal IMSM scope. I will take the challenge and will verify with native. Thanks for suggestion. > > - degraded = raid5_calc_degraded(conf); > > - if (degraded > conf->max_degraded) > > - return 1; > > - return 0; > > + if (conf->mddev->reshape_position != MaxSector) > > + degraded = raid5_calc_degraded(conf); > > + > > + if (degraded > conf->max_degraded) { > > + set_bit(MD_BROKEN, &conf->mddev->flags); > > Why not set BROKEN flags in err handler to align with other levels? Or > do it in md_error only. https://lore.kernel.org/linux-raid/3da9324e-01e7-2a07-4bcd-14245db56693@xxxxxxxxx/ You suggested that. Other levels doesn't have dedicates has_failed() routines. For raid5 it is reasonable to set it in has_failed(). I can't do that in md_error because I don't have such information in all cases. !test_bit("Faulty", rdev->flags) result varies. > > > + return true; > > + } > > + return false; > > } > > > > struct stripe_head * > > @@ -2877,34 +2881,29 @@ static void raid5_error(struct mddev > > *mddev, struct md_rdev *rdev) unsigned long flags; > > pr_debug("raid456: error called\n"); > > > > + pr_crit("md/raid:%s: Disk failure on %s, disabling > > device.\n", > > + mdname(mddev), bdevname(rdev->bdev, b)); > > + > > spin_lock_irqsave(&conf->device_lock, flags); > > + set_bit(Faulty, &rdev->flags); > > + clear_bit(In_sync, &rdev->flags); > > + mddev->degraded = raid5_calc_degraded(conf); > > > > - if (test_bit(In_sync, &rdev->flags) && > > - mddev->degraded == conf->max_degraded) { > > - /* > > - * Don't allow to achieve failed state > > - * Don't try to recover this device > > - */ > > + if (has_failed(conf)) { > > conf->recovery_disabled = > > mddev->recovery_disabled; > > - spin_unlock_irqrestore(&conf->device_lock, flags); > > - return; > > Ok, I think commit fb73b357fb985cc652a72a41541d25915c7f9635 is > effectively reverted by this hunk. So I would prefer to separate the > revert part from this patch, just my 0.02$. Song, do you want revert? Thanks, Mariusz