On 19/07/2018 18:28, Radim Krčmář wrote: >> + >> + kvm_hypercall3(KVM_HC_SEND_IPI, ipi_bitmap_low, ipi_bitmap_high, vector); > and > > kvm_hypercall3(KVM_HC_SEND_IPI, ipi_bitmap[0], ipi_bitmap[1], vector); > > Still, the main problem is that we can only address 128 APICs. > > A simple improvement would reuse the vector field (as we need only 8 > bits) and put a 'offset' in the rest. The offset would say which > cluster of 128 are we addressing. 24 bits of offset results in 2^31 > total addressable CPUs (we probably should even use that many bits). > The downside of this is that we can only address 128 at a time. > > It's basically the same as x2apic cluster mode, only with 128 cluster > size instead of 16, so the code should be a straightforward port. > And because x2apic code doesn't seem to use any division by the cluster > size, we could even try to use kvm_hypercall4, add ipi_bitmap[2], and > make the cluster size 192. :) I did suggest an offset earlier in the discussion. The main problem is that consecutive CPU ids do not map to consecutive APIC ids. But still, we could do an hypercall whenever the total range exceeds 64. Something like u64 ipi_bitmap = 0; for_each_cpu(cpu, mask) if (!ipi_bitmap) { min = max = cpu; } else if (cpu < min && max - cpu < 64) { ipi_bitmap <<= min - cpu; min = cpu; } else if (id < min + 64) { max = cpu < max ? max : cpu; } else { /* ... send hypercall... */ min = max = cpu; ipi_bitmap = 0; } __set_bit(ipi_bitmap, cpu - min); } if (ipi_bitmap) { /* ... send hypercall... */ } We could keep the cluster size of 128, but it would be more complicated to do the left shift in the first "else if". If the limit is 64, you can keep the two arguments in the hypercall, and just pass 0 as the "high" bitmap on 64-bit kernels. Paolo > But because it is very similar to x2apic, I'd really need some real > performance data to see if this benefits a real workload. > Hardware could further optimize LAPIC (apicv, vapic) in the future, > which we'd lose by using paravirt. > > e.g. AMD's acceleration should be superior to this when using < 8 VCPUs > as they can use logical xAPIC and send without VM exits (when all VCPUs > are running). >