Re: [PATCH V3 0/5] blk-mq: improvement on handling IO during CPU hotplug

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Oct 08, 2019 at 06:15:52PM +0100, John Garry wrote:
> On 08/10/2019 10:06, John Garry wrote:
> > On 08/10/2019 05:18, Ming Lei wrote:
> > > Hi,
> > > 
> > > Thomas mentioned:
> > >     "
> > >      That was the constraint of managed interrupts from the very
> > > beginning:
> > > 
> > >       The driver/subsystem has to quiesce the interrupt line and the
> > > associated
> > >       queue _before_ it gets shutdown in CPU unplug and not fiddle
> > > with it
> > >       until it's restarted by the core when the CPU is plugged in again.
> > >     "
> > > 
> > > But no drivers or blk-mq do that before one hctx becomes dead(all
> > > CPUs for one hctx are offline), and even it is worse, blk-mq stills tries
> > > to run hw queue after hctx is dead, see blk_mq_hctx_notify_dead().
> > > 
> > > This patchset tries to address the issue by two stages:
> > > 
> > > 1) add one new cpuhp state of CPUHP_AP_BLK_MQ_ONLINE
> > > 
> > > - mark the hctx as internal stopped, and drain all in-flight requests
> > > if the hctx is going to be dead.
> > > 
> > > 2) re-submit IO in the state of CPUHP_BLK_MQ_DEAD after the hctx
> > > becomes dead
> > > 
> > > - steal bios from the request, and resubmit them via
> > > generic_make_request(),
> > > then these IO will be mapped to other live hctx for dispatch
> > > 
> > > Please comment & review, thanks!
> > > 
> > > John, I don't add your tested-by tag since V3 have some changes,
> > > and I appreciate if you may run your test on V3.
> > > 
> > 
> > Will do, Thanks
> 
> Hi Ming,
> 
> I got this warning once:
> 
> [  162.558185] CPU10: shutdown
> [  162.560994] psci: CPU10 killed.
> [  162.593939] CPU9: shutdown
> [  162.596645] psci: CPU9 killed.
> [  162.625838] CPU8: shutdown
> [  162.628550] psci: CPU8 killed.
> [  162.685790] CPU7: shutdown
> [  162.688496] psci: CPU7 killed.
> [  162.725771] CPU6: shutdown
> [  162.728486] psci: CPU6 killed.
> [  162.753884] CPU5: shutdown
> [  162.756591] psci: CPU5 killed.
> [  162.785584] irq_shutdown
> [  162.788277] IRQ 800: no longer affine to CPU4
> [  162.793267] CPU4: shutdown
> [  162.795975] psci: CPU4 killed.
> [  162.849680] run queue from wrong CPU 13, hctx active
> [  162.849692] CPU3: shutdown
> [  162.854649] CPU: 13 PID: 874 Comm: kworker/3:2H Not tainted
> 5.4.0-rc1-00012-gad025dd3d001 #1098
> [  162.854653] Hardware name: Huawei D06 /D06, BIOS Hisilicon D06 UEFI RC0 -
> V1.16.01 03/15/2019
> [  162.857362] psci: CPU3 killed.
> [  162.866039] Workqueue: kblockd blk_mq_run_work_fn
> [  162.882281] Call trace:
> [  162.884716]  dump_backtrace+0x0/0x150
> [  162.888365]  show_stack+0x14/0x20
> [  162.891668]  dump_stack+0xb0/0xf8
> [  162.894970]  __blk_mq_run_hw_queue+0x11c/0x128
> [  162.899400]  blk_mq_run_work_fn+0x1c/0x28
> [  162.903397]  process_one_work+0x1e0/0x358
> [  162.907393]  worker_thread+0x40/0x488
> [  162.911042]  kthread+0x118/0x120
> [  162.914257]  ret_from_fork+0x10/0x18

What is the HBA? If it is Hisilicon SAS, it isn't strange, because
this patch can't fix single hw queue with multiple private reply queue
yet, that can be one follow-up job of this patchset.

Thanks,
Ming



[Index of Archives]     [Linux RAID]     [Linux SCSI]     [Linux ATA RAID]     [IDE]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Device Mapper]

  Powered by Linux