Re: dm mirror: fix crash caused by NULL-pointer dereference

Mike Snitzer <snitzer@xxxxxxxxxx> · Mon, 26 Jun 2017 10:37:24 -0400

On Mon, Jun 26 2017 at  9:47am -0400,
Eric Ren <zren@xxxxxxxx> wrote:

> Hi,
> 
> On 06/26/2017 06:55 PM, Eric Ren wrote:
> >Hi Zdenek,
> >
> >
> >On 06/26/2017 05:46 PM, Zdenek Kabelac wrote:
> >>Dne 26.6.2017 v 11:08 Eric Ren napsal(a):
> >>>
> >>[... snip...]
> >>
> >>Hi
> >>
> >>Which kernel version is this ?
> >>
> >>I'd thought we've already fixed this BZ for old mirrors:
> >>https://bugzilla.redhat.com/show_bug.cgi?id=1382382
> >>
> >>There similar BZ for md-raid based mirrors (--type raid1)
> >>https://bugzilla.redhat.com/show_bug.cgi?id=1416099
> >My base kernel version is 4.4.68, but with this 2 latest fixes applied:
> >
> >"""
> >Revert "dm mirror: use all available legs on multiple failures"
> >dm io: fix duplicate bio completion due to missing ref count
> 
> I have a confusion about this "dm io..." fix. The fix itself is good.
> 
> Without it, a mkfs.ext4 on a mirrored dev whose primary mirror dev
> has failed, will crash the kernel with the discard operation from mkfs.ext4.
> However, mkfs.ext4 can succeed on a healthy mirrored device. This
> is the thing I don't understand, because no matter the mirrored device is
> good or not, there's always a duplicate bio completion before having this
> this fix, thus write_callback() will be called twice, crashing will
> occur on the
> second write_callback():

No, there is only a duplicate bio completion if the error path is taken
(e.g. underlying device doesn't support discard).

> """
> static void write_callback(unsigned long error, void *context)
> {
>         unsigned i;
>         struct bio *bio = (struct bio *) context;
>         struct mirror_set *ms;
>         int should_wake = 0;
>         unsigned long flags;
> 
>         ms = bio_get_m(bio)->ms;        ====> NULL pointer at the
> duplicate completion
>         bio_set_m(bio, NULL);
> """
> 
> If no this fix, I expected the DISCARD IO would always crash the
> kernel, but it's not true when
> the mirrored device is good. Hope someone happen to know the reason
> can give some hints ;-P

If the mirror is healthy then only one completion is returned to
dm-mirror (via write_callback).  The problem was the error patch wasn't
managing the reference count as needed.  Whereas dm-io's normal discard
IO path does.

Mike

--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/dm-devel