Re: [Bug 217379] New: Latency issues in irq_bypass_register_consumer

Like Xu <like.xu.linux@xxxxxxxxx> · Mon, 17 Jul 2023 19:58:34 +0800

On 2/5/2023 12:51 am, Sean Christopherson wrote:
On Fri, Apr 28, 2023, bugzilla-daemon@xxxxxxxxxx wrote:
https://bugzilla.kernel.org/show_bug.cgi?id=217379

             Bug ID: 217379
            Summary: Latency issues in irq_bypass_register_consumer
            Product: Virtualization
            Version: unspecified
           Hardware: Intel
                 OS: Linux
             Status: NEW
           Severity: normal
           Priority: P3
          Component: kvm
           Assignee: virtualization_kvm@xxxxxxxxxxxxxxxxxxxx
           Reporter: zhuangel570@xxxxxxxxx
         Regression: No

We found some latency issue in high-density and high-concurrency scenarios,
we are using cloud hypervisor as vmm for lightweight VM, using VIRTIO net and
block for VM. In our test, we got about 50ms to 100ms+ latency in creating VM
and register irqfd, after trace with funclatency (a tool of bcc-tools,
https://github.com/iovisor/bcc), we found the latency introduced by following
functions:

- irq_bypass_register_consumer introduce more than 60ms per VM.
   This function was called when registering irqfd, the function will register
   irqfd as consumer to irqbypass, wait for connecting from irqbypass producers,
   like VFIO or VDPA. In our test, one irqfd register will get about 4ms
   latency, and 5 devices with total 16 irqfd will introduce more than 60ms
   latency.

Here is a simple case, which can emulate the latency issue (the real latency
is lager). The case create 800 VM as background do nothing, then repeatedly
create 20 VM then destroy them after 400ms, every VM will do simple thing,
create in kernel irq chip, and register 15 riqfd (emulate 5 devices and every
device has 3 irqfd), just trace the "irq_bypass_register_consumer" latency, you
will reproduce such kind latency issue. Here is a trace log on Xeon(R) Platinum
8255C server (96C, 2 sockets) with linux 6.2.20.

Reproduce Case
https://github.com/zhuangel/misc/blob/main/test/kvm_irqfd_fork/kvm_irqfd_fork.c
Reproduce log
https://github.com/zhuangel/misc/blob/main/test/kvm_irqfd_fork/test.log

To fix these latencies, I didn't have a graceful method, just simple ideas
is give user a chance to avoid these latencies, like new flag to disable
irqbypass for each irqfd.

Any suggestion to fix the issue if welcomed.

Looking at the code, it's not surprising that irq_bypass_register_consumer() can
exhibit high latencies.  The producers and consumers are stored in simple linked
lists, and a single mutex is held while traversing the lists *and* connecting
a consumer to a producer (and vice versa).

There are two obvious optimizations that can be done to reduce latency in
irq_bypass_register_consumer():

    - Use a different data type to track the producers and consumers so that lookups
      don't require a linear walk.  AIUI, the "tokens" used to match producers and
      consumers are just kernel pointers, so I _think_ XArray would perform reasonably
      well.
My measurements show that there is little performance gain from optimizing lookups.

    - Connect producers and consumers outside of a global mutex.

In usage scenarios where a large number of VMs are created, it is very awful to 
have races
on this global mutex, especially when users on different NUMA nodes are concurrently
walking this critical path.

Wait time to acquire this lock (on 2.70GHz ICX):
- avg = 117.855314 ms
- min = 20 ns
- max = 11428.340.858 ms

Before we optimize this path using rewrites, could we first adopt a conservative 
approach:

- introduce the KVM_IRQFD_FLAG_NO_BYPASS [*], or
- introduce module_param_cb(kvm_irq_bypass ...) (644abbb254b1), or
- introduce extra Kconfig knob for "select IRQ_BYPASS_MANAGER";

[*] 
https://lore.kernel.org/kvm/bug-217379-28872-KU8tTDkhtT@xxxxxxxxxxxxxxxxxxxxxxxxx%2F/

Any better move ?

Unfortunately, because .add_producer() and .add_consumer() can fail, and because
connections can be established by adding a consumer _or_ a producer, getting the
locking right without a global mutex is quite difficult.  It's certainly doable
to move the (dis)connect logic out of a global lock, but it's going to require a
dedicated effort, i.e. not something that can be sketched out in a few minutes
(I played around with the code for the better part of an hour trying to do just
that and kept running into edge case race conditions).