On Thu, Apr 27, 2023 at 3:13 PM Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote: > > Hi, > > 在 2023/04/27 14:53, Xiao Ni 写道: > >> for example, null-ptr-dereference: > >> > >> t1: t2: > >> raid10_write_request: > >> > >> // read rdev > >> rdev = conf->mirros[].rdev; > >> raid10_remove_disk > >> p = conf->mirros + number; > >> rdevp = &p->rdev; > >> // reset rdev > >> *rdevp = NULL > >> raid10_write_one_disk > >> // reread rdev got NULL > >> rdev = conf->mirrors[devnum].rdev > >> // null-ptr-dereference > >> mbio = bio_alloc_clone(rdev->bdev...) > >> synchronize_rcu() > > > > Hi Yu kuai > > > > raid10_write_request adds the rdev->nr_pending with rcu lock > > protection. Can this case happen? After adding ->nr_pending, the rdev > > can't be removed. > > The current rcu protection really is a mess, many places access rdev > after rcu_read_unlock()... Hi Yu Kuai It looks like a problem that is it safe to access rdev after adding rdev->nr_pending and rcu_read_unlock. Because besides raid10, other personalities still do in the same way. After reading the related codes, I have the same question. > > For the above case, noted that raid10_remove_disk is called before > nr_pending is increased, and raid10_write_one_disk() is called after > rcu_read_unlock(). > > t1: t2: > > raid10_write_request > rcu_read_lock > rdev = conf->mirros[].rdev > raid10_remove_disk > ...... > // nr_pending is 0, remove disk > // read inside rcu > rcu_read_unlock > > raid10_write_one_disk > // trigger null-ptr-dereference > synchronize_rcu() Function remove_and_add_spares sets RemoveSynchronized, calls synchronize_rcu and calls raid10_remove_disk. So the right position of synchronize_rcu in your case should be before raid10_remove_disk? But it still can't resolve the problem. raid10_remove_disk can set rdev to NULL between rcu_dereference and adding ->nr_pending raid10_write_request remove_and_add_spares set RemoveSynchronized synchronize_rcu rcu_read_lock rdev = rcu_dereference <------- *rdevp = NULL atomic_inc rdev->nr_pending rcu_read_unlock raid10_write_one_disk // trigger null-ptr-dereference Can we check RemoveSynchronized in pers->make_request? It can't submit bio to this rdev after synchronize_rcu. For the pers->hot_remove_disk(raid10_remove_disk) side, as you said in the second solution, maybe it needs a new method to know all pers->make_request(raid10_make_request) exit. Best Regards Xiao Ni > > Thanks, > Kuai > > > >> > >> for example, data loss: > >> > >> t1: > >> // assum that rdev is NULL, and replacement is not NULL > > > > How can trigger this? Could you give the detailed commands? > > > > Best Regards > > Xiao Ni >