> > > > It is not yet finalized, but it can be based on per sdev outstanding, > > shost_busy etc. > > We want to use special 16 reply queue for IO acceleration (these queues are > > working interrupt coalescing mode. This is a h/w feature) > > TBH, this does not make any sense whatsoever. Why are you trying to have > extra interrupts for coalescing instead of doing the following: Thomas, We are using this feature mainly for performance and not for CPU hotplug issues. I read your below #1 to #4 points are more of addressing CPU hotplug stuffs. Right ? We also want to make sure if we convert megaraid_sas driver from managed to non-managed interrupt, we can still achieve CPU hotplug requirement. If we use " pci_enable_msix_range" and manually set affinity in driver using irq_set_affinity_hint, cpu hotplug feature works as expected. <irqbalancer> is able to retain older mapping and whenever offlined cpu comes back, irqbalancer restore the same old mapping. If we use all 72 reply queue (all are in interrupt coalescing mode) without any extra reply queues, we don't have any issue with cpu-msix mapping and cpu hotplug issues. Our major problem with that method is latency is very bad on lower QD and/or single worker case. To solve that problem we have added extra 16 reply queue (this is a special h/w feature for performance only) which can be worked in interrupt coalescing mode vs existing 72 reply queue will work without any interrupt coalescing. Best way to map additional 16 reply queue is map it to the local numa node. I understand that, it is unique requirement but at the same time we may be able to do it gracefully (in irq sub system) as you mentioned " irq_set_affinity_hint" should be avoided in low level driver. > > 1) Allocate 72 reply queues which get nicely spread out to every CPU on the > system with affinity spreading. > > 2) Have a configuration for your reply queues which allows them to be > grouped, e.g. by phsyical package. > > 3) Have a mechanism to mark a reply queue offline/online and handle that on > CPU hotplug. That means on unplug you have to wait for the reply queue > which is associated to the outgoing CPU to be empty and no new requests > to be queued, which has to be done for the regular per CPU reply queues > anyway. > > 4) On queueing the request, flag it 'coalescing' which causes the > hard/firmware to direct the reply to the first online reply queue in the > group. > > If the last CPU of a group goes offline, then the normal hotplug mechanism > takes effect and the whole thing is put 'offline' as well. This works > nicely for all kind of scenarios even if you have more CPUs than queues. No > extras, no magic affinity hints, it just works. > > Hmm? > > > Yes. We did not used " pci_alloc_irq_vectors_affinity". > > We used " pci_enable_msix_range" and manually set affinity in driver using > > irq_set_affinity_hint. > > I still regret the day when I merged that abomination. Is it possible to have similar mapping in managed interrupt case as below ? for (i = 0; i < 16 ; i++) irq_set_affinity_hint (pci_irq_vector(instance->pdev, cpumask_of_node(local_numa_node)); Currently we always see managed interrupts for pre-vectors are 0-71 and effective cpu is always 0. We want some changes in current API which can allow us to pass flags (like *local numa affinity*) and cpu-msix mapping are from local numa node + effective cpu are spread across local numa node. > > Thanks, > > tglx