On 2/17/20 12:55 PM, Kashyap Desai wrote: > High performance HBAs under scsi layer can reach more than 3.0M IOPs. > MegaRaid Aero controller can achieve to 3.3M IOPs.In future there may be requirement to reach 6.0+ M IOPs. > One of the key bottlenecks is serving interrupts for each IO completion. > Block layer has interface blk_poll which can be used as zero interrupt poll queue. > Extending blk_poll to scsi mid layer helps and I was able to get max IOPs same as nvme <poll_queues> interface. > > blk_poll is currently merged with io_uring interface and it requires application change to adopt blk_poll. > > This RFC covers the logic of handling irq polling in driver using threaded ISR interface. > Changes in this RFC is described as below - > > - Use Threaded ISR interface. > - Primary ISR handler runs from h/w interrupt context. > - Secondary ISR handler runs from thread context. > - Driver will drain reply queue from Primary ISR handler for every interrupt it receives. > - Primary handler will decide to call Secondary handler or not. > This interface can be optimized later, if driver or block layer keep submission and completion stats per each h/w queue. > Current megaraid_sas driver is single h/w queue based, so I have picked below decision maker. > If per scsi device has outstanding command more than 8, mark that msix index as “attempt_irq_poll”. > - Every time secondary ISR handler runs, primary handler will disable IRQ. > Once secondary handler completes the task, it will re-enable IRQ. > If there is no completion, let's wait for some time and retry polling as enable/disable irq is expensive operation. > Without this wait in threaded IRQ polling, we will not allow submitter to use CPU and pump more IO. > > NVME driver is also trying something similar to reduce ISR overhead. > Discussion started in Dec-2019. > https://lore.kernel.org/linux-nvme/20191209175622.1964-1-kbusch@xxxxxxxxxx/ > I actually would like to have something more generic; threaded irq polling looks like something where most high-performance drivers would benefit from. So I think it might be worthwhile posting a topic for LSF/MM to have a broader discussion. Thing is, I wonder if it wouldn't be more efficient for high-performance devices to first try for completions in-line, ie start with polling _first_, then enable interrupt handler, and then shift to polling for more completions. But this will involve timeouts which probably would be need to be tweaked per hardware/driver; one could even look into disable individual functionality completely (if you disable the first and the last part you're back to the original implementation, if you disable the first it's the algorithm you proposed). But as I said, that probably warrants a wider discussion. Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@xxxxxxx +49 911 74053 688 SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), GF: Felix Imendörffer