Re: [PATCH] ublk_drv: fix NULL pointer dereference in ublk_ctrl_start_recovery()

Ming Lei <ming.lei@xxxxxxxxxx> · Wed, 5 Jun 2024 17:47:53 +0800

On Wed, Jun 05, 2024 at 03:20:34PM +0800, Changhui Zhong wrote:
> On Wed, Jun 5, 2024 at 9:41 AM Li Nan <linan666@xxxxxxxxxxxxxxx> wrote:
> >
> >
> >
> > 在 2024/6/4 9:32, Changhui Zhong 写道:
> > > On Mon, Jun 3, 2024 at 10:20 AM Li Nan <linan666@xxxxxxxxxxxxxxx> wrote:
> > >>
> > >>
> > >>
> > >> 在 2024/6/3 8:39, Ming Lei 写道:
> > >>
> > >> [...]
> > >>
> > >>>> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> > >>>> index 4e159948c912..99b621b2d40f 100644
> > >>>> --- a/drivers/block/ublk_drv.c
> > >>>> +++ b/drivers/block/ublk_drv.c
> > >>>> @@ -2630,7 +2630,8 @@ static void ublk_queue_reinit(struct ublk_device *ub, struct ublk_queue *ubq)
> > >>>>    {
> > >>>>       int i;
> > >>>>
> > >>>> -    WARN_ON_ONCE(!(ubq->ubq_daemon && ubq_daemon_is_dying(ubq)));
> > >>>> +    if (WARN_ON_ONCE(!(ubq->ubq_daemon && ubq_daemon_is_dying(ubq))))
> > >>>> +            return;
> > >>>
> > >>> Yeah, it is one bug. However, it could be addressed by adding the check in
> > >>> ublk_ctrl_start_recovery() and return immediately in case of NULL ubq->ubq_daemon,
> > >>> what do you think about this way?
> > >>>
> > >>
> > >> Check ub->nr_queues_ready seems better. How about:
> > >>
> > >> @@ -2662,6 +2662,8 @@ static int ublk_ctrl_start_recovery(struct
> > >> ublk_device *ub,
> > >>           mutex_lock(&ub->mutex);
> > >>           if (!ublk_can_use_recovery(ub))
> > >>                   goto out_unlock;
> > >> +       if (!ub->nr_queues_ready)
> > >> +               goto out_unlock;
> > >>           /*
> > >>            * START_RECOVERY is only allowd after:
> > >>            *
> > >>
> > >>>
> > >>> Thanks,
> > >>> Ming
> > >>
> > >> --
> > >> Thanks,
> > >> Nan
> > >>
> > >
> > >
> > > Hi,Nan
> > >
> > > After applying your new patch, I did not trigger "NULL pointer
> > > dereference" and "Warning",
> > > but hit task hung "Call Trace" info, please check
> > >
> > > [13617.812306] running generic/004
> > > [13622.293674] blk_print_req_error: 91 callbacks suppressed
> > > [13622.293681] I/O error, dev ublkb4, sector 233256 op 0x1:(WRITE)
> > > flags 0x8800 phys_seg 1 prio class 0
> > > [13622.308145] I/O error, dev ublkb4, sector 233256 op 0x0:(READ)
> > > flags 0x0 phys_seg 2 prio class 0
> > > [13622.316923] I/O error, dev ublkb4, sector 233264 op 0x1:(WRITE)
> > > flags 0x8800 phys_seg 1 prio class 0
> > > [13622.326048] I/O error, dev ublkb4, sector 233272 op 0x0:(READ)
> > > flags 0x0 phys_seg 1 prio class 0
> > > [13622.334828] I/O error, dev ublkb4, sector 233272 op 0x1:(WRITE)
> > > flags 0x8800 phys_seg 1 prio class 0
> > > [13622.343954] I/O error, dev ublkb4, sector 233312 op 0x0:(READ)
> > > flags 0x0 phys_seg 1 prio class 0
> > > [13622.352733] I/O error, dev ublkb4, sector 233008 op 0x0:(READ)
> > > flags 0x0 phys_seg 1 prio class 0
> > > [13622.361514] I/O error, dev ublkb4, sector 233112 op 0x0:(READ)
> > > flags 0x0 phys_seg 1 prio class 0
> > > [13622.370292] I/O error, dev ublkb4, sector 233192 op 0x1:(WRITE)
> > > flags 0x8800 phys_seg 1 prio class 0
> > > [13622.379419] I/O error, dev ublkb4, sector 233120 op 0x0:(READ)
> > > flags 0x0 phys_seg 1 prio class 0
> > > [13641.069695] INFO: task fio:174413 blocked for more than 122 seconds.
> > > [13641.076061]       Not tainted 6.10.0-rc1+ #1
> > > [13641.080338] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > > disables this message.
> > > [13641.088164] task:fio             state:D stack:0     pid:174413
> > > tgid:174413 ppid:174386 flags:0x00004002
> > > [13641.088168] Call Trace:
> > > [13641.088170]  <TASK>
> > > [13641.088171]  __schedule+0x221/0x670
> > > [13641.088177]  schedule+0x23/0xa0
> > > [13641.088179]  io_schedule+0x42/0x70
> > > [13641.088181]  blk_mq_get_tag+0x118/0x2b0
> > > [13641.088185]  ? gup_fast_pgd_range+0x280/0x370
> > > [13641.088188]  ? __pfx_autoremove_wake_function+0x10/0x10
> > > [13641.088192]  __blk_mq_alloc_requests+0x194/0x3a0
> > > [13641.088194]  blk_mq_submit_bio+0x241/0x6c0
> > > [13641.088196]  __submit_bio+0x8a/0x1f0
> > > [13641.088199]  submit_bio_noacct_nocheck+0x168/0x250
> > > [13641.088201]  ? submit_bio_noacct+0x45/0x560
> > > [13641.088203]  __blkdev_direct_IO_async+0x167/0x1a0
> > > [13641.088206]  blkdev_write_iter+0x1c8/0x270
> > > [13641.088208]  aio_write+0x11c/0x240
> > > [13641.088212]  ? __rq_qos_issue+0x21/0x40
> > > [13641.088214]  ? blk_mq_start_request+0x34/0x1a0
> > > [13641.088216]  ? io_submit_one+0x68/0x380
> > > [13641.088218]  ? kmem_cache_alloc_noprof+0x4e/0x320
> > > [13641.088221]  ? fget+0x7c/0xc0
> > > [13641.088224]  ? io_submit_one+0xde/0x380
> > > [13641.088226]  io_submit_one+0xde/0x380
> > > [13641.088228]  __x64_sys_io_submit+0x80/0x160
> > > [13641.088229]  do_syscall_64+0x79/0x150
> > > [13641.088233]  ? syscall_exit_to_user_mode+0x6c/0x1f0
> > > [13641.088237]  ? do_io_getevents+0x8b/0xe0
> > > [13641.088238]  ? syscall_exit_work+0xf3/0x120
> > > [13641.088241]  ? syscall_exit_to_user_mode+0x6c/0x1f0
> > > [13641.088243]  ? do_syscall_64+0x85/0x150
> > > [13641.088245]  ? do_syscall_64+0x85/0x150
> > > [13641.088247]  ? blk_mq_flush_plug_list.part.0+0x108/0x160
> > > [13641.088249]  ? rseq_get_rseq_cs+0x1d/0x220
> > > [13641.088252]  ? rseq_ip_fixup+0x6d/0x1d0
> > > [13641.088254]  ? blk_finish_plug+0x24/0x40
> > > [13641.088256]  ? syscall_exit_to_user_mode+0x6c/0x1f0
> > > [13641.088258]  ? do_syscall_64+0x85/0x150
> > > [13641.088260]  ? syscall_exit_to_user_mode+0x6c/0x1f0
> > > [13641.088262]  ? do_syscall_64+0x85/0x150
> > > [13641.088264]  ? syscall_exit_to_user_mode+0x6c/0x1f0
> > > [13641.088266]  ? do_syscall_64+0x85/0x150
> > > [13641.088268]  ? do_syscall_64+0x85/0x150
> > > [13641.088270]  ? do_syscall_64+0x85/0x150
> > > [13641.088272]  ? clear_bhb_loop+0x45/0xa0
> > > [13641.088275]  ? clear_bhb_loop+0x45/0xa0
> > > [13641.088277]  ? clear_bhb_loop+0x45/0xa0
> > > [13641.088279]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > > [13641.088281] RIP: 0033:0x7ff92150713d
> > > [13641.088283] RSP: 002b:00007ffca1ef81f8 EFLAGS: 00000246 ORIG_RAX:
> > > 00000000000000d1
> > > [13641.088285] RAX: ffffffffffffffda RBX: 00007ff9217e2f70 RCX: 00007ff92150713d
> > > [13641.088286] RDX: 000055863b694fe0 RSI: 0000000000000010 RDI: 00007ff92164d000
> > > [13641.088287] RBP: 00007ff92164d000 R08: 00007ff91936d000 R09: 0000000000000180
> > > [13641.088288] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000010
> > > [13641.088289] R13: 0000000000000000 R14: 000055863b694fe0 R15: 000055863b6970c0
> > > [13641.088291]  </TASK>
> > >
> > > Thanks，
> > > Changhui
> > >
> >
> > After applying the previous patch, will the test environment continue to
> > execute test cases after WARN?
> 
> a few days ago，test with the previous patch, the test environment
> continued to execute test cases after WARN,
> and I terminated the test when I observed a WARN，so I did not observe
> the subsequent situation.
> 
> > I am not sure whether this issue has always
> > existed but was not tested becasue of WARN, or whether the new patch
> > introduced it.
> 
> today， I re-test previous patch， and let it run for a long time，I
> observed WARN and task hung，
> looks this issue already existed and not introduced by new patch.

Hi Changhui,

The hang is actually expected because recovery fails.

Please pull the latest ublksrv and check if the issue can still be
reproduced:

https://github.com/ublk-org/ublksrv

BTW, one ublksrv segfault and two test cleanup issues are fixed.

Thanks,
Ming