On Mon, Sep 09, 2019 at 08:10:07PM -0700, Sagi Grimberg wrote: > Hey Ming, > > > > > Ok, so the real problem is per-cpu bounded tasks. > > > > > > > > I share Thomas opinion about a NAPI like approach. > > > > > > We already have that, its irq_poll, but it seems that for this > > > use-case, we get lower performance for some reason. I'm not > > > entirely sure why that is, maybe its because we need to mask interrupts > > > because we don't have an "arm" register in nvme like network devices > > > have? > > > > Long observed that IOPS drops much too by switching to threaded irq. If > > softirqd is waken up for handing softirq, the performance shouldn't > > be better than threaded irq. > > Its true that it shouldn't be any faster, but what irqpoll already has > and we don't need to reinvent is a proper budgeting mechanism that > needs to occur when multiple devices map irq vectors to the same cpu > core. > > irqpoll already maintains a percpu list and dispatch the ->poll with > a budget that the backend enforces and irqpoll multiplexes between them. > Having this mechanism in irq (hard or threaded) context sounds > unnecessary a bit. > > It seems like we're attempting to stay in irq context for as long as we > can instead of scheduling to softirq/thread context if we have more than > a minimal amount of work to do. Without at least understanding why > softirq/thread degrades us so much this code seems like the wrong > approach to me. Interrupt context will always be faster, but it is > not a sufficient reason to spend as much time as possible there, is it? If extra latency is added in IO completion path, this latency will be introduced in the submission path, because the hw queue depth is fixed, which is often small. Especially in case of multiple submission vs. single(shared) completion, the whole hw queue tags can be exhausted easily. I guess no such effect for networking IO. > > We should also keep in mind, that the networking stack has been doing > this for years, I would try to understand why this cannot work for nvme > before dismissing. The above may be one reason. Thanks, Ming