On Mon, 13 Jun 2022 07:55:47 +0100, "zhangfei.gao@xxxxxxxxxxx" <zhangfei.gao@xxxxxxxxxxx> wrote: > > Hi, Paul > > On 2022/6/13 下午12:16, Paul E. McKenney wrote: > > On Sun, Jun 12, 2022 at 08:57:11PM -0700, Paul E. McKenney wrote: > >> On Mon, Jun 13, 2022 at 11:04:39AM +0800, zhangfei.gao@xxxxxxxxxxx wrote: > >>> Hi, Paul > >>> > >>> On 2022/6/13 上午2:49, Paul E. McKenney wrote: > >>>> On Sun, Jun 12, 2022 at 07:29:30PM +0200, Paolo Bonzini wrote: > >>>>> On 6/12/22 18:40, Paul E. McKenney wrote: > >>>>>>> Do these reserved memory regions really need to be allocated separately? > >>>>>>> (For example, are they really all non-contiguous? If not, that is, if > >>>>>>> there are a lot of contiguous memory regions, could you sort the IORT > >>>>>>> by address and do one ioctl() for each set of contiguous memory regions?) > >>>>>>> > >>>>>>> Are all of these reserved memory regions set up before init is spawned? > >>>>>>> > >>>>>>> Are all of these reserved memory regions set up while there is only a > >>>>>>> single vCPU up and running? > >>>>>>> > >>>>>>> Is the SRCU grace period really needed in this case? (I freely confess > >>>>>>> to not being all that familiar with KVM.) > >>>>>> Oh, and there was a similar many-requests problem with networking many > >>>>>> years ago. This was solved by adding a new syscall/ioctl()/whatever > >>>>>> that permitted many requests to be presented to the kernel with a single > >>>>>> system call. > >>>>>> > >>>>>> Could a new ioctl() be introduced that requested a large number > >>>>>> of these memory regions in one go so as to make each call to > >>>>>> synchronize_rcu_expedited() cover a useful fraction of your 9000+ > >>>>>> requests? Adding a few of the KVM guys on CC for their thoughts. > >>>>> Unfortunately not. Apart from this specific case, in general the calls to > >>>>> KVM_SET_USER_MEMORY_REGION are triggered by writes to I/O registers in the > >>>>> guest, and those writes then map to a ioctl. Typically the guest sets up a > >>>>> device at a time, and each setup step causes a synchronize_srcu()---and > >>>>> expedited at that. > >>>> I was afraid of something like that... > >>>> > >>>>> KVM has two SRCUs: > >>>>> > >>>>> 1) kvm->irq_srcu is hardly relying on the "sleepable" part; it has readers > >>>>> that are very very small, but it needs extremely fast detection of grace > >>>>> periods; see commit 719d93cd5f5c ("kvm/irqchip: Speed up > >>>>> KVM_SET_GSI_ROUTING", 2014-05-05) which split it off kvm->srcu. Readers are > >>>>> not so frequent. > >>>>> > >>>>> 2) kvm->srcu is nastier because there are readers all the time. The > >>>>> read-side critical section are still short-ish, but they need the sleepable > >>>>> part because they access user memory. > >>>> Which one of these two is in play in this case? > >>>> > >>>>> Writers are not frequent per se; the problem is they come in very large > >>>>> bursts when a guest boots. And while the whole boot path overall can be > >>>>> quadratic, O(n) expensive calls to synchronize_srcu() can have a larger > >>>>> impact on runtime than the O(n^2) parts, as demonstrated here. > >>>>> > >>>>> Therefore, we operated on the assumption that the callers of > >>>>> synchronized_srcu_expedited were _anyway_ busy running CPU-bound guest code > >>>>> and the desire was to get past the booting phase as fast as possible. If > >>>>> the guest wants to eat host CPU it can "for(;;)" as much as it wants; > >>>>> therefore, as long as expedited GPs didn't eat CPU *throughout the whole > >>>>> system*, a preemptable busy wait in synchronize_srcu_expedited() were not > >>>>> problematic. > >>>>> > >>>>> This assumptions did match the SRCU code when kvm->srcu and kvm->irq_srcu > >>>>> were was introduced (respectively in 2009 and 2014). But perhaps they do > >>>>> not hold anymore now that each SRCU is not as independent as it used to be > >>>>> in those years, and instead they use workqueues instead? > >>>> The problem was not internal to SRCU, but rather due to the fact > >>>> that kernel live patching (KLP) had problems with the CPU-bound tasks > >>>> resulting from repeated synchronize_rcu_expedited() invocations. So I > >>>> added heuristics to get the occasional sleep in there for KLP's benefit. > >>>> Perhaps these heuristics need to be less aggressive about adding sleep. > >>>> > >>>> These heuristics have these aspects: > >>>> > >>>> 1. The longer readers persist in an expedited SRCU grace period, > >>>> the longer the wait between successive checks of the reader > >>>> state. Roughly speaking, we wait as long as the grace period > >>>> has currently been in effect, capped at ten jiffies. > >>>> > >>>> 2. SRCU grace periods have several phases. We reset so that each > >>>> phase starts by not waiting (new phase, new set of readers, > >>>> so don't penalize this set for the sins of the previous set). > >>>> But once we get to the point of adding delay, we add the > >>>> delay based on the beginning of the full grace period. > >>>> > >>>> Right now, the checking for grace-period length does not allow for the > >>>> possibility that a grace period might start just before the jiffies > >>>> counter gets incremented (because I didn't realize that anyone cared), > >>>> so that is one possible thing to change. I can also allow more no-delay > >>>> checks per SRCU grace-period phase. > >>>> > >>>> Zhangfei, does something like the patch shown below help? > >>>> > >>>> Additional adjustments are likely needed to avoid re-breaking KLP, > >>>> but we have to start somewhere... > >>>> > >>>> Thanx, Paul > >>>> > >>>> ------------------------------------------------------------------------ > >>>> > >>>> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > >>>> index 50ba70f019dea..6a354368ac1d1 100644 > >>>> --- a/kernel/rcu/srcutree.c > >>>> +++ b/kernel/rcu/srcutree.c > >>>> @@ -513,7 +513,7 @@ static bool srcu_readers_active(struct srcu_struct *ssp) > >>>> #define SRCU_INTERVAL 1 // Base delay if no expedited GPs pending. > >>>> #define SRCU_MAX_INTERVAL 10 // Maximum incremental delay from slow readers. > >>>> -#define SRCU_MAX_NODELAY_PHASE 1 // Maximum per-GP-phase consecutive no-delay instances. > >>>> +#define SRCU_MAX_NODELAY_PHASE 3 // Maximum per-GP-phase consecutive no-delay instances. > >>>> #define SRCU_MAX_NODELAY 100 // Maximum consecutive no-delay instances. > >>>> /* > >>>> @@ -522,12 +522,18 @@ static bool srcu_readers_active(struct srcu_struct *ssp) > >>>> */ > >>>> static unsigned long srcu_get_delay(struct srcu_struct *ssp) > >>>> { > >>>> + unsigned long gpstart; > >>>> + unsigned long j; > >>>> unsigned long jbase = SRCU_INTERVAL; > >>>> if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), READ_ONCE(ssp->srcu_gp_seq_needed_exp))) > >>>> jbase = 0; > >>>> - if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) > >>>> - jbase += jiffies - READ_ONCE(ssp->srcu_gp_start); > >>>> + if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) { > >>>> + j = jiffies - 1; > >>>> + gpstart = READ_ONCE(ssp->srcu_gp_start); > >>>> + if (time_after(j, gpstart)) > >>>> + jbase += j - gpstart; > >>>> + } > >>>> if (!jbase) { > >>>> WRITE_ONCE(ssp->srcu_n_exp_nodelay, READ_ONCE(ssp->srcu_n_exp_nodelay) + 1); > >>>> if (READ_ONCE(ssp->srcu_n_exp_nodelay) > SRCU_MAX_NODELAY_PHASE) > >>> Unfortunately, this patch does not helpful. > >>> > >>> Then re-add the debug info. > >>> > >>> During the qemu boot > >>> [ 232.997667] __synchronize_srcu loop=1000 > >>> > >>> [ 361.094493] __synchronize_srcu loop=9000 > >>> [ 361.094501] Call trace: > >>> [ 361.094502] dump_backtrace+0xe4/0xf0 > >>> [ 361.094505] show_stack+0x20/0x70 > >>> [ 361.094507] dump_stack_lvl+0x8c/0xb8 > >>> [ 361.094509] dump_stack+0x18/0x34 > >>> [ 361.094511] __synchronize_srcu+0x120/0x128 > >>> [ 361.094514] synchronize_srcu_expedited+0x2c/0x40 > >>> [ 361.094515] kvm_swap_active_memslots+0x130/0x198 > >>> [ 361.094519] kvm_activate_memslot+0x40/0x68 > >>> [ 361.094520] kvm_set_memslot+0x2f8/0x3b0 > >>> [ 361.094523] __kvm_set_memory_region+0x2e4/0x438 > >>> [ 361.094524] kvm_set_memory_region+0x78/0xb8 > >>> [ 361.094526] kvm_vm_ioctl+0x5a0/0x13e0 > >>> [ 361.094528] __arm64_sys_ioctl+0xb0/0xf8 > >>> [ 361.094530] invoke_syscall+0x4c/0x110 > >>> [ 361.094533] el0_svc_common.constprop.0+0x68/0x128 > >>> [ 361.094536] do_el0_svc+0x34/0xc0 > >>> [ 361.094538] el0_svc+0x30/0x98 > >>> [ 361.094541] el0t_64_sync_handler+0xb8/0xc0 > >>> [ 361.094544] el0t_64_sync+0x18c/0x190 > >>> [ 363.942817] kvm_set_memory_region loop=6000 > >> Huh. > >> > >> One possibility is that the "if (!jbase)" block needs to be nested > >> within the "if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {" block. > > I test this diff and NO helpful > > diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c > index 50ba70f019de..36286a4b74e6 100644 > --- a/kernel/rcu/srcutree.c > +++ b/kernel/rcu/srcutree.c > @@ -513,7 +513,7 @@ static bool srcu_readers_active(struct srcu_struct *ssp) > > #define SRCU_INTERVAL 1 // Base delay if no expedited > GPs pending. > #define SRCU_MAX_INTERVAL 10 // Maximum incremental delay > from slow readers. > -#define SRCU_MAX_NODELAY_PHASE 1 // Maximum per-GP-phase > consecutive no-delay instances. > +#define SRCU_MAX_NODELAY_PHASE 3 // Maximum per-GP-phase > consecutive no-delay instances. > #define SRCU_MAX_NODELAY 100 // Maximum consecutive > no-delay instances. > > /* > @@ -522,16 +522,23 @@ static bool srcu_readers_active(struct > srcu_struct *ssp) > */ > static unsigned long srcu_get_delay(struct srcu_struct *ssp) > { > + unsigned long gpstart; > + unsigned long j; > unsigned long jbase = SRCU_INTERVAL; > > if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), > READ_ONCE(ssp->srcu_gp_seq_needed_exp))) > jbase = 0; > - if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) > - jbase += jiffies - READ_ONCE(ssp->srcu_gp_start); > - if (!jbase) { > - WRITE_ONCE(ssp->srcu_n_exp_nodelay, > READ_ONCE(ssp->srcu_n_exp_nodelay) + 1); > - if (READ_ONCE(ssp->srcu_n_exp_nodelay) > > SRCU_MAX_NODELAY_PHASE) > - jbase = 1; > + if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) { > + j = jiffies - 1; > + gpstart = READ_ONCE(ssp->srcu_gp_start); > + if (time_after(j, gpstart)) > + jbase += j - gpstart; > + > + if (!jbase) { > + WRITE_ONCE(ssp->srcu_n_exp_nodelay, > READ_ONCE(ssp->srcu_n_exp_nodelay) + 1); > + if (READ_ONCE(ssp->srcu_n_exp_nodelay) > > SRCU_MAX_NODELAY_PHASE) > + jbase = 1; > + } > } > > > And when I run 10,000 consecutive synchronize_rcu_expedited() calls, the > > above change reduces the overhead by more than an order of magnitude. > > Except that the overhead of the series is far less than one second, > > not the several minutes that you are seeing. So the per-call overhead > > decreases from about 17 microseconds to a bit more than one microsecond. > > > > I could imagine an extra order of magnitude if you are running HZ=100 > > instead of the HZ=1000 that I am running. But that only gets up to a > > few seconds. > > > >> One additional debug is to apply the patch below on top of the one you > apply the patch below? > >> just now kindly tested, then use whatever debug technique you wish to > >> work out what fraction of the time during that critical interval that > >> srcu_get_delay() returns non-zero. > Sorry, I am confused, no patch right? > Just measure srcu_get_delay return to non-zero? > > > By the way, the issue should be only related with qemu apci. not > related with rmr feature No, this also occurs if you supply the guest's EFI with an empty set of persistent variables. EFI goes and zeroes it, which results in a read-only memslot write access being taken to userspace, the memslot being unmapped from the guest, QEMU doing a little dance, and eventually restoring the memslot back to the guest. Rince, repeat. Do that one byte at a time over 64MB, and your boot time for EFI only goes from 39s to 3m50s (that's on a speed-challenged Synquacer box), which completely kills the "deploy a new VM" use case. Thanks, M. -- Without deviation from the norm, progress is not possible.