>Thanks for the clarification. > >The problem with what Ming is proposing in my mind (and its an existing >problem that exists today), is that nvme is taking precedence over anything >else until it absolutely cannot hog the cpu in hardirq. > >In the thread Ming referenced a case where today if the cpu core has a net >softirq activity it cannot make forward progress. So with Ming's suggestion, >net softirq will eventually make progress, but it creates an inherent fairness >issue. Who said that nvme completions should come faster then the net rx/tx >or another I/O device (or hrtimers or sched events...)? > >As much as I'd like nvme to complete as soon as possible, I might have other >activities in the system that are as important if not more. So I don't think we >can solve this with something that is not cooperative or fair with the rest of >the system. > >>> If we are context switching too much, it means the soft-irq operation >>> is not efficient, not necessarily the fact that the completion path >>> is running in soft- irq.. >>> >>> Is your kernel compiled with full preemption or voluntary preemption? >> >> The tests are based on Ubuntu 18.04 kernel configuration. Here are the >parameters: >> >> # CONFIG_PREEMPT_NONE is not set >> CONFIG_PREEMPT_VOLUNTARY=y >> # CONFIG_PREEMPT is not set > >I see, so it still seems that irq_poll_softirq is still not efficient in reaping >completions. reaping the completions on its own is pretty much the same in >hard and soft irq, so its really the scheduling part that is creating the overhead >(which does not exist in hard irq). > >Question: >when you test with without the patch (completions are coming in hard-irq), >do the fio threads that run on the cpu cores that are assigned to the cores that >are handling interrupts get substantially lower throughput than the rest of the >fio threads? I would expect that the fio threads that are running on the first 32 >cores to get very low iops (overpowered by the nvme interrupts) and the rest >doing much more given that nvme has almost no limits to how much time it >can spend on processing completions. > >If need_resched() is causing us to context switch too aggressively, does >changing that to local_softirq_pending() make things better? >-- >diff --git a/lib/irq_poll.c b/lib/irq_poll.c index d8eab563fa77..05d524fcaf04 >100644 >--- a/lib/irq_poll.c >+++ b/lib/irq_poll.c >@@ -116,7 +116,7 @@ static void __latent_entropy irq_poll_softirq(struct >softirq_action *h) > /* > * If softirq window is exhausted then punt. > */ >- if (need_resched()) >+ if (local_softirq_pending()) > break; > } >-- > >Although, this can potentially cause other threads from making forward >progress.. If it is better, perhaps we also need a time limit as well. Thanks for this patch. The IOPS was about the same. (it tends to fluctuate more but within 3% variation) I captured the following from one of the CPUs. All CPUs tend to have similar numbers. The following numbers are captured during 5 seconds and averaged: Context switches/s: Without any patch: 5 With the previous patch: 640 With this patch: 522 Process migrated/s: Without any patch: 0.6 With the previous patch: 104 With this patch: 121 > >Perhaps we should add statistics/tracing on how many completions we are >reaping per invocation... I'll look into a bit more on completion. From the numbers I think the increased number of context switches/migrations are hurting most on performance. Thanks Long