Re: [PATCH] md/raid1: fix NULL pointer dereference

NeilBrown <neilb@xxxxxxxx> · Sat, 24 Feb 2018 07:05:17 +1100

On Fri, Feb 23 2018, yuyufen wrote:

> We have a raid1 BUG, as follow:
>
> [  212.790981] BUG: unable to handle kernel NULL pointer dereference at 
> 0000000000000140
> [  212.791802] IP: [<ffffffff815bbbbd>] raid1d+0x2bd/0xfc0
> [  212.792334] PGD 12ab52067 PUD 12f587067 PMD 0
> [  212.792816] Oops: 0000 [#1] SMP
> [  212.793151] Modules linked in:
> [  212.793463] CPU: 1 PID: 2008 Comm: md3_raid1 Not tainted 4.1.44+ #130
> [  212.794084] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), 
> BIOS 1.10.2-1.fc26 04/01/2014
> [  212.794916] task: ffff8800b0820000 ti: ffff8800b5424000 task.ti: 
> ffff8800b5424000
> [  212.795621] RIP: 0010:[<ffffffff815bbbbd>] [<ffffffff815bbbbd>] 
> raid1d+0x2bd/0xfc0
> [  212.796361] RSP: 0018:ffff8800b5427d28  EFLAGS: 00010246
> [  212.796869] RAX: 0000000000000008 RBX: ffff8800bacfee40 RCX: 
> 0000000000000000
> [  212.798968] RDX: 0000000000000000 RSI: ffff88013fc4d528 RDI: 
> ffff88013fc4d528
> [  212.800965] RBP: ffff8800b5427e38 R08: 000000000000000a R09: 
> 00000000000006b0
> [  212.800965] R10: 00000000000006b0 R11: ffffffff81e7d86d R12: 
> 0000000000000008
> [  212.800965] R13: 0000000000000008 R14: 0000000000000000 R15: 
> ffff8801353ad040
> [  212.800965] FS:  0000000000000000(0000) GS:ffff88013fc40000(0000) 
> knlGS:0000000000000000
> [  212.800965] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [  212.800965] CR2: 0000000000000140 CR3: 0000000096827000 CR4: 
> 00000000000006e0
> [  212.800965] Stack:
> [  212.800965]  0000000000000000 ffff8800b0820000 0000000000000001 
> ffff8800bacfee78
> [  212.800965]  ffff8800bacfee40 ffff8801353ad000 ffff880136f91800 
> 00000000000000e0
> [  212.800965]  000000012f54d970 ffff8800bacfee40 ffff880000000007 
> 0000000000000001
> [  212.800965] Call Trace:
> [  212.800965]  [<ffffffff81744677>] ? schedule+0x37/0x90
> [  212.800965]  [<ffffffff810da063>] ? prepare_to_wait_event+0x83/0xf0
> [  212.800965]  [<ffffffff815daa14>] md_thread+0x144/0x150
> [  212.800965]  [<ffffffff810da3a0>] ? wake_atomic_t_function+0x70/0x70
> [  212.800965]  [<ffffffff815da8d0>] ? md_start_sync+0xf0/0xf0
> [  212.800965]  [<ffffffff810b5508>] kthread+0xd8/0xf0
> [  212.800965]  [<ffffffff810b5430>] ? kthread_worker_fn+0x160/0x160
> [  212.800965]  [<ffffffff81748142>] ret_from_fork+0x42/0x70
> [  212.800965]  [<ffffffff810b5430>] ? kthread_worker_fn+0x160/0x160
> [  212.800965] Code: e8 48 c1 e1 05 45 8b 67 d8 48 89 8d 28 ff ff ff 48 
> 8b 18 48 89 85 50 ff ff ff 48 8b 43 08 4c 8b 34 08 4d 85
> f6 0f 84 a2 0b 00 00 <41> 8b 8e 40 01 00 00 85 c9 0f 88 2e 02 00 00 49 
> 8b 46 30 48 8b
> [  212.800965] RIP  [<ffffffff815bbbbd>] raid1d+0x2bd/0xfc0
> [  212.800965]  RSP <ffff8800b5427d28>
> [  212.800965] CR2: 0000000000000140
> [  212.800965] ---[ end trace 44e58fa3f2b31612 ]---
> [  212.800965] Kernel panic - not syncing: Fatal exception
>
> (gdb) l *(raid1d+0x2bd)
> 0xffffffff815bbbbd is in raid1d (drivers/md/raid1.c:2225).
> 2225            if (rdev->badblocks.shift < 0)
> 2226                    return 0;
> 2227
> 2228            block_sectors = roundup(1 << rdev->badblocks.shift,
> 2229 bdev_logical_block_size(rdev->bdev) >> 9);
>
> At line 2225, rdev is NULL, it causes a NULL pointer dereference BUG.
> Call path: raid1d()->handle_write_finished()->narrow_write_error().
> In handle_write_finished(), if r1_bio->bios[m] != NULL, it thinks
> conf->mirrors[m].rdev is also not NULL. But it is not always true.
>
> Even if some io hold replacement rdev(i.e. rdev->nr_pending.count > 0),
> raid1_remove_disk() can also set replacement rdev as NULL.
> So, when r1bio->bios[i] is not NULL and hold the rdev, but, it's 
> corresponding
> conf->mirrors[i].rdev is NULL, will cause NULL pointer dereference.
>
> Signed-off-by: Yufen Yu <yuyufen@xxxxxxxxxx>
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 6df398e3a008..f1d2684f62e5 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -1798,6 +1798,16 @@ static int raid1_remove_disk(struct mddev *mddev, 
> struct md_rdev *rdev)
>                          struct md_rdev *repl =
>                                  conf->mirrors[conf->raid_disks + 
> number].rdev;
>                          freeze_array(conf, 0);
> +                       if (atomic_read(&repl->nr_pending)) {
> +                               /* It means that some queued IO into 
> retry_list hold repl.
> +                                * Thus, we cannot set replacement as 
> NULL, avoiding
> +                                * rdev NULL pointer dereference in 
> sync_request_write
> +                                * and handle_write_finished.
> +                                */
> +                               err = -EBUSY;
> +                               unfreeze_array(conf);
> +                               goto abort;
> +                       }
>                          clear_bit(Replacement, &repl->flags);
>                          p->rdev = repl;
>                          conf->mirrors[conf->raid_disks + number].rdev = 
> NULL;
>

Thanks for the report and the patch.
I agree that something is needed and your patch is probably the best
solution.
I'm beginning to think that the current protocol for clearing
->mirrors[x]->rdev is less than ideal.  It seems too easy to get it
wrong.
I'm not sure yet what a better approach would look like.
Anyway:

  Reviewed-by: NeilBrown <neilb@xxxxxxxx>

Thanks,
NeilBrown

Attachment:
signature.asc

Description: PGP signature