RE: Affinity managed interrupts vs non-managed interrupts

Kashyap Desai <kashyap.desai@xxxxxxxxxxxx> · Mon, 3 Sep 2018 11:04:44 +0530

> On Fri, 31 Aug 2018, Kashyap Desai wrote:
> > > Ok. I misunderstood the whole thing a bit. So your real issue is
that you
> > > want to have reply queues which are instantaneous, the per cpu ones,
and
> > > then the extra 16 which do batching and are shared over a set of
CPUs,
> > > right?
> >
> > Yes that is correct.  Extra 16 or whatever should be shared over set
of
> > CPUs of *local* numa node of the PCI device.
>
> Why restricting it to the local NUMA node of the device? That doesn't
> really make sense if you queue lots of requests from CPUs on a different
> node.

Penalty of cross numa node is minimal with higher interrupt coalescing
used in h/w.  We see penalty of cross numa traffic for lower IOPs type
work load.
In this particular case we are taking care cross numa traffic via higher
interrupt coalescing.

>
> Why don't you spread these extra interrupts accross all nodes and keep
the
> locality for the request/reply?

I assuming you are refereeing spreading msix to all numa node the way
"pci_alloc_irq_vectors" does.

Having extra 16 reply queue spread across nodes will have negative impact.
Take example of 8 node system (total 128 logical cpus).
If  16 reply queue are spread across numa node, there will be total 8
logical cpu mapped to 1 reply queue (eventually one numa node will have
only 2 reply queue mapped).

Running IO from one numa node will only consume 2 reply queues.
Performance dropped drastically in such case.  This is typical problem
with cpu-msix mapping goes to N:1 where msix is less than online cpus.

Mapping extra 16 reply queue to local numa node will always make sure that
driver will round robin all 16 reply queue irrespective of originated cpu.
We validated this method sending IOs from remote node and did not observed
performance penalty.

>
> That also would allow to make them properly managed interrupts as you
> could
> shutdown the per node batching interrupts when all CPUs of that node are
> offlined and you'd avoid the whole affinity hint irq balancer hackery.

One more clarification -

I am using " for-4.19/block " and this particular patch "a0c9259
irq/matrix: Spread interrupts on allocation" is included.
I can see that 16 extra reply queues via pre_vectors are still assigned to
CPU 0 (effective affinity ).

irq 33, cpu list 0-71
irq 34, cpu list 0-71
irq 35, cpu list 0-71
irq 36, cpu list 0-71
irq 37, cpu list 0-71
irq 38, cpu list 0-71
irq 39, cpu list 0-71
irq 40, cpu list 0-71
irq 41, cpu list 0-71
irq 42, cpu list 0-71
irq 43, cpu list 0-71
irq 44, cpu list 0-71
irq 45, cpu list 0-71
irq 46, cpu list 0-71
irq 47, cpu list 0-71
irq 48, cpu list 0-71

# cat /sys/kernel/debug/irq/irqs/34
handler:  handle_edge_irq
device:   0000:86:00.0
status:   0x00004000
istate:   0x00000000
ddepth:   0
wdepth:   0
dstate:   0x01608200
            IRQD_ACTIVATED
            IRQD_IRQ_STARTED
            IRQD_SINGLE_TARGET
            IRQD_MOVE_PCNTXT
            IRQD_AFFINITY_MANAGED
node:     0
affinity: 0-71
effectiv: 0
pending:
domain:  INTEL-IR-MSI-1-2
 hwirq:   0x4300001
 chip:    IR-PCI-MSI
  flags:   0x10
             IRQCHIP_SKIP_SET_WAKE
 parent:
    domain:  INTEL-IR-1
     hwirq:   0x40000
     chip:    INTEL-IR
      flags:   0x0
     parent:
        domain:  VECTOR
         hwirq:   0x22
         chip:    APIC
          flags:   0x0
         Vector:    46
         Target:     0
         move_in_progress: 0
         is_managed:       1
         can_reserve:      0
         has_reserved:     0
         cleanup_pending:  0

#cat /sys/kernel/debug/irq/irqs/35
handler:  handle_edge_irq
device:   0000:86:00.0
status:   0x00004000
istate:   0x00000000
ddepth:   0
wdepth:   0
dstate:   0x01608200
            IRQD_ACTIVATED
            IRQD_IRQ_STARTED
            IRQD_SINGLE_TARGET
            IRQD_MOVE_PCNTXT
            IRQD_AFFINITY_MANAGED
node:     0
affinity: 0-71
effectiv: 0
pending:
domain:  INTEL-IR-MSI-1-2
 hwirq:   0x4300002
 chip:    IR-PCI-MSI
  flags:   0x10
             IRQCHIP_SKIP_SET_WAKE
 parent:
    domain:  INTEL-IR-1
     hwirq:   0x50000
     chip:    INTEL-IR
      flags:   0x0
     parent:
        domain:  VECTOR
         hwirq:   0x23
         chip:    APIC
          flags:   0x0
         Vector:    47
         Target:     0
         move_in_progress: 0
         is_managed:       1
         can_reserve:      0
         has_reserved:     0
         cleanup_pending:  0

Ideally, what we are looking for 16 extra pre_vector reply queue is
"effective affinity" to be within local numa node as long as that numa
node has online CPUs. If not, we are ok to have effective cpu from any
node.

>
> Thanks,
>
> 	tglx
>
>