On Wed, May 29, 2019 at 6:11 PM Ming Lei <ming.lei@xxxxxxxxxx> wrote: > > On Wed, May 29, 2019 at 10:42:00AM +0100, John Garry wrote: > > On 29/05/2019 03:42, Ming Lei wrote: > > > On Wed, May 29, 2019 at 10:28:52AM +0800, Ming Lei wrote: > > > > On Tue, May 28, 2019 at 05:50:40PM +0100, John Garry wrote: > > > > > On 27/05/2019 16:02, Ming Lei wrote: > > > > > > Managed interrupts can not migrate affinity when their CPUs are offline. > > > > > > If the CPU is allowed to shutdown before they're returned, commands > > > > > > dispatched to managed queues won't be able to complete through their > > > > > > irq handlers. > > > > > > > > > > > > Wait in cpu hotplug handler until all inflight requests on the tags > > > > > > are completed or timeout. Wait once for each tags, so we can save time > > > > > > in case of shared tags. > > > > > > > > > > > > Based on the following patch from Keith, and use simple delay-spin > > > > > > instead. > > > > > > > > > > > > https://lore.kernel.org/linux-block/20190405215920.27085-1-keith.busch@xxxxxxxxx/ > > > > > > > > > > > > Some SCSI devices may have single blk_mq hw queue and multiple private > > > > > > completion queues, and wait until all requests on the private completion > > > > > > queue are completed. > > > > > > > > > > Hi Ming, > > > > > > > > > > I'm a bit concerned that this approach won't work due to ordering: it seems > > > > > that the IRQ would be shutdown prior to the CPU dead notification for the > > > > > > > > Managed IRQ shutdown is run in irq_migrate_all_off_this_cpu(), which is > > > > called in the callback of takedown_cpu(). And the CPU dead notification > > > > is always sent after that CPU becomes offline, see cpuhp_invoke_callback(). > > > > > > Hammm, looks we both say same thing. > > > > > > Yeah, it is too late to drain requests in the cpu hotplug DEAD handler, > > > maybe we can try to move managed IRQ shutdown after sending the dead > > > notification. > > > > > > > Even if the IRQ is shutdown later, all CPUs would still be dead, so none > > available to receive the interrupt or do the work for draining the queue. > > > > > I need to think of it further. > > > > It would seem that we just need to be informed of CPU offlining earlier, and > > plug the drain in there. > > Yes, looks blk-mq has to be notified before unplugging CPU for this > issue. > > And we should be careful to handle the multiple reply queue case, given the queue > shouldn't be stopped or quieseced because other reply queues are still active. > > The new CPUHP state for blk-mq should be invoked after the to-be-offline > CPU is quiesced and before it becomes offline. Hi John, Thinking of this issue further, so far, one doable solution is to expose reply queues as blk-mq hw queues, as done by the following patchset: https://lore.kernel.org/linux-block/20180205152035.15016-1-ming.lei@xxxxxxxxxx/ In which global host-wide tags are shared for all blk-mq hw queues. Also we can remove all the reply_map stuff in drivers, then solve the problem of draining in-flight requests during unplugging CPU in a generic approach. Last time, it was reported that the patchset causes performance regression, which is actually caused by duplicated io accounting in blk_mq_queue_tag_busy_iter(), which should be fixed easily. What do you think of this approach? Thanks, Ming Lei