On 2019/7/26 18:21, Guoqing Jiang wrote:
On 7/26/19 8:00 AM, Yufen Yu wrote:
We get a NULL pointer dereference oops when test raid1 as follow:
mdadm -CR /dev/md1 -l 1 -n 2 /dev/sd[ab]
mdadm /dev/md1 -f /dev/sda
mdadm /dev/md1 -r /dev/sda
mdadm /dev/md1 -a /dev/sda
sleep 5
mdadm /dev/md1 -f /dev/sdb
mdadm /dev/md1 -r /dev/sdb
mdadm /dev/md1 -a /dev/sdb
After a disk(/dev/sda) has been removed, we add the disk to raid
array again, which would trigger recovery action. Since the rdev
current state is 'spare', read/write bio can be issued to the disk.
Then we set the other disk (/dev/sdb) faulty. Since the raid array
is now in degraded state and /dev/sdb is the only 'In_sync' disk,
raid1_error() will return but without set faulty success.
I don't think you can set sdb to faulty since it is the last working
device in raid1 per the comment, unless you did some internal change
already.
/*
* If it is not operational, then we have already marked it as dead
* else if it is the last working disks, ignore the error, let the
* next level up know.
* else mark the drive as failed
*/
Yes, we can not set sdb faulty *Since the raid array is now in degraded state
and /dev/sdb is the only In_sync disk*. And raid1_error() will return in follow,
*but without set faulty success*:
if (test_bit(In_sync, &rdev->flags)
&& (conf->raid_disks - mddev->degraded) == 1) {
/*
* Don't fail the drive, act as though we were just a
* normal single drive.
* However don't try a recovery from this drive as
* it is very likely to fail.
*/
conf->recovery_disabled = mddev->recovery_disabled;
spin_unlock_irqrestore(&conf->device_lock, flags);
return;
}
Here, 'conf->recovery_disabled' is assigned value as 'mddev->recovery_disabled',
and 'mddev' will be set 'MD_RECOVERY_INTR' flag in md_error().
Then, md_check_recovery() can go to remove_and_add_spares() and try to remove the disk.
raid1_remove_disk()
if (test_bit(In_sync, &rdev->flags) ||
atomic_read(&rdev->nr_pending)) {
err = -EBUSY;
goto abort;
}
/* Only remove non-faulty devices if recovery
* is not possible.
*/
if (!test_bit(Faulty, &rdev->flags) &&
mddev->recovery_disabled != conf->recovery_disabled &&
mddev->degraded < conf->raid_disks) {
err = -EBUSY;
goto abort;
}
p->rdev = NULL;
The function cannot 'goto abort', and try to set 'p->rdev = NULL', which cause the race
conditions in the patch as a result.
BTW, we have not changed any internal implementation of raid.
However, that can interrupt the recovery action and
md_check_recovery will try to call remove_and_add_spares() to
remove the spare disk. And the race condition between
remove_and_add_spares() and raid1_write_request() in follow
can cause NULL pointer dereference for conf->mirrors[i].rdev:
raid1_write_request() md_check_recovery raid1_error()
rcu_read_lock()
rdev != NULL
!test_bit(Faulty, &rdev->flags)
conf->recovery_disabled=
mddev->recovery_disabled;
return busy
remove_and_add_spares
raid1_remove_disk
I am not quite follow here, raid1_remove_disk is called under some conditions:
if ((this == NULL || rdev == this) &&
rdev->raid_disk >= 0 &&
!test_bit(Blocked, &rdev->flags) &&
((test_bit(RemoveSynchronized, &rdev->flags) ||
(!test_bit(In_sync, &rdev->flags) &&
!test_bit(Journal, &rdev->flags))) &&
atomic_read(&rdev->nr_pending)==0))
So rdev->raid_disk should not be negative value, and for the spare disk showed
as 'S' should meet the requirement:
if (rdev->raid_disk < 0)
seq_printf(seq, "(S)"); /* spare */
Maybe I have not expressed clearly. I observe the 'spare' state by
'/sys/block/md1/md/dev-sda/state'.
state_show()
if (!test_bit(Faulty, &flags) &&
!test_bit(Journal, &flags) &&
!test_bit(In_sync, &flags))
len += sprintf(page+len, "spare%s", sep);
When /dev/sda has been added to the array again, hot_add_disk() just add it to mddev->disks list.
Then, md_check_recovery() will invoke remove_and_add_spares() to add the rdev in conf->mirrors[i]
array actually.
After that, rdev->raid_disk is not be negative value and we can issue bio to the disk.