Hi,
在 2023/04/28 17:11, Xiao Ni 写道:
On Thu, Apr 27, 2023 at 3:13 PM Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote:
Hi,
在 2023/04/27 14:53, Xiao Ni 写道:
for example, null-ptr-dereference:
t1: t2:
raid10_write_request:
// read rdev
rdev = conf->mirros[].rdev;
raid10_remove_disk
p = conf->mirros + number;
rdevp = &p->rdev;
// reset rdev
*rdevp = NULL
raid10_write_one_disk
// reread rdev got NULL
rdev = conf->mirrors[devnum].rdev
// null-ptr-dereference
mbio = bio_alloc_clone(rdev->bdev...)
synchronize_rcu()
Hi Yu kuai
raid10_write_request adds the rdev->nr_pending with rcu lock
protection. Can this case happen? After adding ->nr_pending, the rdev
can't be removed.
The current rcu protection really is a mess, many places access rdev
after rcu_read_unlock()...
Hi Yu Kuai
It looks like a problem that is it safe to access rdev after adding
rdev->nr_pending and rcu_read_unlock. Because besides raid10, other
personalities still do in the same way. After reading the related
codes, I have the same question.
I think it's not safe, that's basically what the first solution try to
avoid, but it's just in raid10, and I'm not quite familiar with all the
personality but I think they can be fixed likewise. I'll keep looking
into the details. 😉
For the above case, noted that raid10_remove_disk is called before
nr_pending is increased, and raid10_write_one_disk() is called after
rcu_read_unlock().
t1: t2:
raid10_write_request
rcu_read_lock
rdev = conf->mirros[].rdev
raid10_remove_disk
......
// nr_pending is 0, remove disk
// read inside rcu
rcu_read_unlock
raid10_write_one_disk
// trigger null-ptr-dereference
synchronize_rcu()
Function remove_and_add_spares sets RemoveSynchronized, calls
synchronize_rcu and calls raid10_remove_disk. So the right position of
synchronize_rcu in your case should be before raid10_remove_disk?
But it still can't resolve the problem. raid10_remove_disk can set
rdev to NULL between rcu_dereference and adding ->nr_pending
raid10_write_request remove_and_add_spares
set RemoveSynchronized
synchronize_rcu
rcu_read_lock
rdev = rcu_dereference
<------- *rdevp = NULL
atomic_inc rdev->nr_pending
rcu_read_unlock
raid10_write_one_disk
// trigger null-ptr-dereference
Can we check RemoveSynchronized in pers->make_request? It can't submit
bio to this rdev after synchronize_rcu. For the
pers->hot_remove_disk(raid10_remove_disk) side, as you said in the
second solution, maybe it needs a new method to know all
pers->make_request(raid10_make_request) exit.
First of all, this is just one example for many existing problems. Then,
there are some contions to set RemoveSynchronized in
remove_and_add_spares, check the flag can't cover all the case.
I do prefer second solution, however, there will be a lot of changes,
I'll wait for more response before I start working on that.
Thanks,
Kuai