Re: [question] solution for raid10 configuration concurrent with io

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Apr 27, 2023 at 3:13 PM Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote:
>
> Hi,
>
> 在 2023/04/27 14:53, Xiao Ni 写道:
> >> for example, null-ptr-dereference:
> >>
> >> t1:                             t2:
> >> raid10_write_request:
> >>
> >>    // read rdev
> >>    rdev = conf->mirros[].rdev;
> >>                                  raid10_remove_disk
> >>                                   p = conf->mirros + number;
> >>                                   rdevp = &p->rdev;
> >>                                   // reset rdev
> >>                                   *rdevp = NULL
> >>    raid10_write_one_disk
> >>     // reread rdev got NULL
> >>     rdev = conf->mirrors[devnum].rdev
> >>       // null-ptr-dereference
> >>      mbio = bio_alloc_clone(rdev->bdev...)
> >>                                   synchronize_rcu()
> >
> > Hi Yu kuai
> >
> > raid10_write_request adds the rdev->nr_pending with rcu lock
> > protection. Can this case happen? After adding ->nr_pending, the rdev
> > can't be removed.
>
> The current rcu protection really is a mess, many places access rdev
> after rcu_read_unlock()...

Hi Yu Kuai

It looks like a problem that is it safe to access rdev after adding
rdev->nr_pending and rcu_read_unlock. Because besides raid10, other
personalities still do in the same way. After reading the related
codes, I have the same question.

>
> For the above case, noted that raid10_remove_disk is called before
> nr_pending is increased, and raid10_write_one_disk() is called after
> rcu_read_unlock().
>
> t1:                             t2:
>
> raid10_write_request
>   rcu_read_lock
>   rdev = conf->mirros[].rdev
>                                 raid10_remove_disk
>                                  ......
>                                  // nr_pending is 0, remove disk
>   // read inside rcu
>   rcu_read_unlock
>
>   raid10_write_one_disk
>   // trigger null-ptr-dereference
>                                 synchronize_rcu()

Function remove_and_add_spares sets RemoveSynchronized, calls
synchronize_rcu and calls raid10_remove_disk. So the right position of
synchronize_rcu in your case should be before raid10_remove_disk?

But it still can't resolve the problem. raid10_remove_disk can set
rdev to NULL between rcu_dereference and adding ->nr_pending

raid10_write_request                     remove_and_add_spares

                                                       set RemoveSynchronized
                                                       synchronize_rcu
rcu_read_lock
rdev = rcu_dereference
                                                     <-------     *rdevp = NULL
atomic_inc rdev->nr_pending
rcu_read_unlock

raid10_write_one_disk
// trigger null-ptr-dereference

Can we check RemoveSynchronized in pers->make_request? It can't submit
bio to this rdev after synchronize_rcu. For the
pers->hot_remove_disk(raid10_remove_disk) side, as you said in the
second solution, maybe it needs a new method to know all
pers->make_request(raid10_make_request) exit.


Best Regards
Xiao Ni

>
> Thanks,
> Kuai
> >
> >>
> >> for example, data loss:
> >>
> >> t1:
> >> // assum that rdev is NULL, and replacement is not NULL
> >
> > How can trigger this? Could you give the detailed commands?
> >
> > Best Regards
> > Xiao Ni
>





[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux