Re: [question] solution for raid10 configuration concurrent with io

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

在 2023/04/28 17:11, Xiao Ni 写道:
On Thu, Apr 27, 2023 at 3:13 PM Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote:

Hi,

在 2023/04/27 14:53, Xiao Ni 写道:
for example, null-ptr-dereference:

t1:                             t2:
raid10_write_request:

    // read rdev
    rdev = conf->mirros[].rdev;
                                  raid10_remove_disk
                                   p = conf->mirros + number;
                                   rdevp = &p->rdev;
                                   // reset rdev
                                   *rdevp = NULL
    raid10_write_one_disk
     // reread rdev got NULL
     rdev = conf->mirrors[devnum].rdev
       // null-ptr-dereference
      mbio = bio_alloc_clone(rdev->bdev...)
                                   synchronize_rcu()

Hi Yu kuai

raid10_write_request adds the rdev->nr_pending with rcu lock
protection. Can this case happen? After adding ->nr_pending, the rdev
can't be removed.

The current rcu protection really is a mess, many places access rdev
after rcu_read_unlock()...

Hi Yu Kuai

It looks like a problem that is it safe to access rdev after adding
rdev->nr_pending and rcu_read_unlock. Because besides raid10, other
personalities still do in the same way. After reading the related
codes, I have the same question.

I think it's not safe, that's basically what the first solution try to
avoid, but it's just in raid10, and I'm not quite familiar with all the
personality but I think they can be fixed likewise. I'll keep looking
into the details. 😉



For the above case, noted that raid10_remove_disk is called before
nr_pending is increased, and raid10_write_one_disk() is called after
rcu_read_unlock().

t1:                             t2:

raid10_write_request
   rcu_read_lock
   rdev = conf->mirros[].rdev
                                 raid10_remove_disk
                                  ......
                                  // nr_pending is 0, remove disk
   // read inside rcu
   rcu_read_unlock

   raid10_write_one_disk
   // trigger null-ptr-dereference
                                 synchronize_rcu()

Function remove_and_add_spares sets RemoveSynchronized, calls
synchronize_rcu and calls raid10_remove_disk. So the right position of
synchronize_rcu in your case should be before raid10_remove_disk?

But it still can't resolve the problem. raid10_remove_disk can set
rdev to NULL between rcu_dereference and adding ->nr_pending

raid10_write_request                     remove_and_add_spares

                                                        set RemoveSynchronized
                                                        synchronize_rcu
rcu_read_lock
rdev = rcu_dereference
                                                      <-------     *rdevp = NULL
atomic_inc rdev->nr_pending
rcu_read_unlock

raid10_write_one_disk
// trigger null-ptr-dereference

Can we check RemoveSynchronized in pers->make_request? It can't submit
bio to this rdev after synchronize_rcu. For the
pers->hot_remove_disk(raid10_remove_disk) side, as you said in the
second solution, maybe it needs a new method to know all
pers->make_request(raid10_make_request) exit.


First of all, this is just one example for many existing problems. Then,
there are some contions to set RemoveSynchronized in
remove_and_add_spares, check the flag can't cover all the case.

I do prefer second solution, however, there will be a lot of changes,
I'll wait for more response before I start working on that.

Thanks,
Kuai




[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux