Re: [PATCH-next] block: fix null-deref in percpu_ref_put

Ming Lei <ming.lei@xxxxxxxxxx> · Thu, 8 Dec 2022 16:55:26 +0800

On Wed, Dec 7, 2022 at 9:08 AM Dennis Zhou <dennis@xxxxxxxxxx> wrote:
>
> Hello,
>
> On Tue, Dec 06, 2022 at 05:09:39PM +0800, Zhong Jinghua wrote:
> > A problem was find in stable 5.10 and the root cause of it like below.
> >
> > In the use of q_usage_counter of request_queue, blk_cleanup_queue using
> > "wait_event(q->mq_freeze_wq, percpu_ref_is_zero(&q->q_usage_counter))"
> > to wait q_usage_counter becoming zero. however, if the q_usage_counter
> > becoming zero quickly, and percpu_ref_exit will execute and ref->data
> > will be freed, maybe another process will cause a null-defef problem
> > like below:
> >
> >       CPU0                             CPU1
> > blk_mq_destroy_queue
> >  blk_freeze_queue
> >   blk_mq_freeze_queue_wait
> >                               scsi_end_request
> >                                percpu_ref_get
> >                                ...
> >                                percpu_ref_put
> >                                 atomic_long_sub_and_test
> >  blk_put_queue
> >   kobject_put
> >    kref_put
> >     blk_release_queue
> >      percpu_ref_exit
> >       ref->data -> NULL
> >                                  ref->data->release(ref) -> null-deref
> >
>
> I remember thinking about this a while ago. I don't think this fix works
> as nicely as it may seem. Please correct me if I'm wrong.
>
> q->q_usage_counter has the oddity that the lifetime of the percpu_ref
> object isn't managed by the release function. The freeing is handled by
> a separate path where it depends on the percpu_ref hitting 0. So here we
> have 2 concurrent paths racing to run with 1 destroying the object. We
> probably need blk_release_queue() to wait on percpu_ref's release
> finishing, not starting.
>
> I think the above works in this specific case because there is a
> call_rcu() in blk_release_queue(). If there wasn't a call_rcu(),
> then by the same logic we could delay ref->data->release(ref) further
> and that could potentially lead to a use after free.
>
> Ideally, I think fixing the race in q->q_usage_counter's pattern is
> better than masking it here as I think we're being saved by the
> call_rcu() call further down the object release path.

The problem is actually in percpu_ref_is_zero(), which can return true
before ->release() is called. And any pattern of wait_event(percpu_ref_is_zero)
may imply such risk.

It may be not easy to fix the issue in block layer cleanly, but can be
solved in percpu-refcount simply by adding ->release_lock(spin lock)
in the counter for draining atomic_long_sub_and_test() & ->release()
in percpu_ref_exit(). Or simply use percpu_ref_switch_lock.

Thanks,
Ming