Re: Commit 282d8998e997 (srcu: Prevent expedited GPs and blocking readers from consuming CPU) cause qemu boot slow

Marc Zyngier <maz@xxxxxxxxxx> · Wed, 15 Jun 2022 09:29:11 +0100

On Mon, 13 Jun 2022 07:55:47 +0100,
"zhangfei.gao@xxxxxxxxxxx" <zhangfei.gao@xxxxxxxxxxx> wrote:
> 
> Hi, Paul
> 
> On 2022/6/13 下午12:16, Paul E. McKenney wrote:
> > On Sun, Jun 12, 2022 at 08:57:11PM -0700, Paul E. McKenney wrote:
> >> On Mon, Jun 13, 2022 at 11:04:39AM +0800, zhangfei.gao@xxxxxxxxxxx wrote:
> >>> Hi, Paul
> >>> 
> >>> On 2022/6/13 上午2:49, Paul E. McKenney wrote:
> >>>> On Sun, Jun 12, 2022 at 07:29:30PM +0200, Paolo Bonzini wrote:
> >>>>> On 6/12/22 18:40, Paul E. McKenney wrote:
> >>>>>>> Do these reserved memory regions really need to be allocated separately?
> >>>>>>> (For example, are they really all non-contiguous?  If not, that is, if
> >>>>>>> there are a lot of contiguous memory regions, could you sort the IORT
> >>>>>>> by address and do one ioctl() for each set of contiguous memory regions?)
> >>>>>>> 
> >>>>>>> Are all of these reserved memory regions set up before init is spawned?
> >>>>>>> 
> >>>>>>> Are all of these reserved memory regions set up while there is only a
> >>>>>>> single vCPU up and running?
> >>>>>>> 
> >>>>>>> Is the SRCU grace period really needed in this case?  (I freely confess
> >>>>>>> to not being all that familiar with KVM.)
> >>>>>> Oh, and there was a similar many-requests problem with networking many
> >>>>>> years ago.  This was solved by adding a new syscall/ioctl()/whatever
> >>>>>> that permitted many requests to be presented to the kernel with a single
> >>>>>> system call.
> >>>>>> 
> >>>>>> Could a new ioctl() be introduced that requested a large number
> >>>>>> of these memory regions in one go so as to make each call to
> >>>>>> synchronize_rcu_expedited() cover a useful fraction of your 9000+
> >>>>>> requests?  Adding a few of the KVM guys on CC for their thoughts.
> >>>>> Unfortunately not.  Apart from this specific case, in general the calls to
> >>>>> KVM_SET_USER_MEMORY_REGION are triggered by writes to I/O registers in the
> >>>>> guest, and those writes then map to a ioctl.  Typically the guest sets up a
> >>>>> device at a time, and each setup step causes a synchronize_srcu()---and
> >>>>> expedited at that.
> >>>> I was afraid of something like that...
> >>>> 
> >>>>> KVM has two SRCUs:
> >>>>> 
> >>>>> 1) kvm->irq_srcu is hardly relying on the "sleepable" part; it has readers
> >>>>> that are very very small, but it needs extremely fast detection of grace
> >>>>> periods; see commit 719d93cd5f5c ("kvm/irqchip: Speed up
> >>>>> KVM_SET_GSI_ROUTING", 2014-05-05) which split it off kvm->srcu.  Readers are
> >>>>> not so frequent.
> >>>>> 
> >>>>> 2) kvm->srcu is nastier because there are readers all the time.  The
> >>>>> read-side critical section are still short-ish, but they need the sleepable
> >>>>> part because they access user memory.
> >>>> Which one of these two is in play in this case?
> >>>> 
> >>>>> Writers are not frequent per se; the problem is they come in very large
> >>>>> bursts when a guest boots.  And while the whole boot path overall can be
> >>>>> quadratic, O(n) expensive calls to synchronize_srcu() can have a larger
> >>>>> impact on runtime than the O(n^2) parts, as demonstrated here.
> >>>>> 
> >>>>> Therefore, we operated on the assumption that the callers of
> >>>>> synchronized_srcu_expedited were _anyway_ busy running CPU-bound guest code
> >>>>> and the desire was to get past the booting phase as fast as possible.  If
> >>>>> the guest wants to eat host CPU it can "for(;;)" as much as it wants;
> >>>>> therefore, as long as expedited GPs didn't eat CPU *throughout the whole
> >>>>> system*, a preemptable busy wait in synchronize_srcu_expedited() were not
> >>>>> problematic.
> >>>>> 
> >>>>> This assumptions did match the SRCU code when kvm->srcu and kvm->irq_srcu
> >>>>> were was introduced (respectively in 2009 and 2014).  But perhaps they do
> >>>>> not hold anymore now that each SRCU is not as independent as it used to be
> >>>>> in those years, and instead they use workqueues instead?
> >>>> The problem was not internal to SRCU, but rather due to the fact
> >>>> that kernel live patching (KLP) had problems with the CPU-bound tasks
> >>>> resulting from repeated synchronize_rcu_expedited() invocations.  So I
> >>>> added heuristics to get the occasional sleep in there for KLP's benefit.
> >>>> Perhaps these heuristics need to be less aggressive about adding sleep.
> >>>> 
> >>>> These heuristics have these aspects:
> >>>> 
> >>>> 1.	The longer readers persist in an expedited SRCU grace period,
> >>>> 	the longer the wait between successive checks of the reader
> >>>> 	state.  Roughly speaking, we wait as long as the grace period
> >>>> 	has currently been in effect, capped at ten jiffies.
> >>>> 
> >>>> 2.	SRCU grace periods have several phases.  We reset so that each
> >>>> 	phase starts by not waiting (new phase, new set of readers,
> >>>> 	so don't penalize this set for the sins of the previous set).
> >>>> 	But once we get to the point of adding delay, we add the
> >>>> 	delay based on the beginning of the full grace period.
> >>>> 
> >>>> Right now, the checking for grace-period length does not allow for the
> >>>> possibility that a grace period might start just before the jiffies
> >>>> counter gets incremented (because I didn't realize that anyone cared),
> >>>> so that is one possible thing to change.  I can also allow more no-delay
> >>>> checks per SRCU grace-period phase.
> >>>> 
> >>>> Zhangfei, does something like the patch shown below help?
> >>>> 
> >>>> Additional adjustments are likely needed to avoid re-breaking KLP,
> >>>> but we have to start somewhere...
> >>>> 
> >>>> 							Thanx, Paul
> >>>> 
> >>>> ------------------------------------------------------------------------
> >>>> 
> >>>> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> >>>> index 50ba70f019dea..6a354368ac1d1 100644
> >>>> --- a/kernel/rcu/srcutree.c
> >>>> +++ b/kernel/rcu/srcutree.c
> >>>> @@ -513,7 +513,7 @@ static bool srcu_readers_active(struct srcu_struct *ssp)
> >>>>    #define SRCU_INTERVAL		1	// Base delay if no expedited GPs pending.
> >>>>    #define SRCU_MAX_INTERVAL	10	// Maximum incremental delay from slow readers.
> >>>> -#define SRCU_MAX_NODELAY_PHASE	1	// Maximum per-GP-phase consecutive no-delay instances.
> >>>> +#define SRCU_MAX_NODELAY_PHASE	3	// Maximum per-GP-phase consecutive no-delay instances.
> >>>>    #define SRCU_MAX_NODELAY	100	// Maximum consecutive no-delay instances.
> >>>>    /*
> >>>> @@ -522,12 +522,18 @@ static bool srcu_readers_active(struct srcu_struct *ssp)
> >>>>     */
> >>>>    static unsigned long srcu_get_delay(struct srcu_struct *ssp)
> >>>>    {
> >>>> +	unsigned long gpstart;
> >>>> +	unsigned long j;
> >>>>    	unsigned long jbase = SRCU_INTERVAL;
> >>>>    	if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq), READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
> >>>>    		jbase = 0;
> >>>> -	if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq)))
> >>>> -		jbase += jiffies - READ_ONCE(ssp->srcu_gp_start);
> >>>> +	if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {
> >>>> +		j = jiffies - 1;
> >>>> +		gpstart = READ_ONCE(ssp->srcu_gp_start);
> >>>> +		if (time_after(j, gpstart))
> >>>> +			jbase += j - gpstart;
> >>>> +	}
> >>>>    	if (!jbase) {
> >>>>    		WRITE_ONCE(ssp->srcu_n_exp_nodelay, READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
> >>>>    		if (READ_ONCE(ssp->srcu_n_exp_nodelay) > SRCU_MAX_NODELAY_PHASE)
> >>> Unfortunately, this patch does not helpful.
> >>> 
> >>> Then re-add the debug info.
> >>> 
> >>> During the qemu boot
> >>> [  232.997667]  __synchronize_srcu loop=1000
> >>> 
> >>> [  361.094493]  __synchronize_srcu loop=9000
> >>> [  361.094501] Call trace:
> >>> [  361.094502]  dump_backtrace+0xe4/0xf0
> >>> [  361.094505]  show_stack+0x20/0x70
> >>> [  361.094507]  dump_stack_lvl+0x8c/0xb8
> >>> [  361.094509]  dump_stack+0x18/0x34
> >>> [  361.094511]  __synchronize_srcu+0x120/0x128
> >>> [  361.094514]  synchronize_srcu_expedited+0x2c/0x40
> >>> [  361.094515]  kvm_swap_active_memslots+0x130/0x198
> >>> [  361.094519]  kvm_activate_memslot+0x40/0x68
> >>> [  361.094520]  kvm_set_memslot+0x2f8/0x3b0
> >>> [  361.094523]  __kvm_set_memory_region+0x2e4/0x438
> >>> [  361.094524]  kvm_set_memory_region+0x78/0xb8
> >>> [  361.094526]  kvm_vm_ioctl+0x5a0/0x13e0
> >>> [  361.094528]  __arm64_sys_ioctl+0xb0/0xf8
> >>> [  361.094530]  invoke_syscall+0x4c/0x110
> >>> [  361.094533]  el0_svc_common.constprop.0+0x68/0x128
> >>> [  361.094536]  do_el0_svc+0x34/0xc0
> >>> [  361.094538]  el0_svc+0x30/0x98
> >>> [  361.094541]  el0t_64_sync_handler+0xb8/0xc0
> >>> [  361.094544]  el0t_64_sync+0x18c/0x190
> >>> [  363.942817]  kvm_set_memory_region loop=6000
> >> Huh.
> >> 
> >> One possibility is that the "if (!jbase)" block needs to be nested
> >> within the "if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {" block.
> 
> I test this diff and NO helpful
> 
> diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
> index 50ba70f019de..36286a4b74e6 100644
> --- a/kernel/rcu/srcutree.c
> +++ b/kernel/rcu/srcutree.c
> @@ -513,7 +513,7 @@ static bool srcu_readers_active(struct srcu_struct *ssp)
> 
>  #define SRCU_INTERVAL          1       // Base delay if no expedited
> GPs pending.
>  #define SRCU_MAX_INTERVAL      10      // Maximum incremental delay
> from slow readers.
> -#define SRCU_MAX_NODELAY_PHASE 1       // Maximum per-GP-phase
> consecutive no-delay instances.
> +#define SRCU_MAX_NODELAY_PHASE 3       // Maximum per-GP-phase
> consecutive no-delay instances.
>  #define SRCU_MAX_NODELAY       100     // Maximum consecutive
> no-delay instances.
> 
>  /*
> @@ -522,16 +522,23 @@ static bool srcu_readers_active(struct
> srcu_struct *ssp)
>   */
>  static unsigned long srcu_get_delay(struct srcu_struct *ssp)
>  {
> +       unsigned long gpstart;
> +       unsigned long j;
>         unsigned long jbase = SRCU_INTERVAL;
> 
>         if (ULONG_CMP_LT(READ_ONCE(ssp->srcu_gp_seq),
> READ_ONCE(ssp->srcu_gp_seq_needed_exp)))
>                 jbase = 0;
> -       if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq)))
> -               jbase += jiffies - READ_ONCE(ssp->srcu_gp_start);
> -       if (!jbase) {
> -               WRITE_ONCE(ssp->srcu_n_exp_nodelay,
> READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
> -               if (READ_ONCE(ssp->srcu_n_exp_nodelay) >
> SRCU_MAX_NODELAY_PHASE)
> -                       jbase = 1;
> +       if (rcu_seq_state(READ_ONCE(ssp->srcu_gp_seq))) {
> +               j = jiffies - 1;
> +               gpstart = READ_ONCE(ssp->srcu_gp_start);
> +               if (time_after(j, gpstart))
> +                       jbase += j - gpstart;
> +
> +               if (!jbase) {
> +                       WRITE_ONCE(ssp->srcu_n_exp_nodelay,
> READ_ONCE(ssp->srcu_n_exp_nodelay) + 1);
> +                       if (READ_ONCE(ssp->srcu_n_exp_nodelay) >
> SRCU_MAX_NODELAY_PHASE)
> +                               jbase = 1;
> +               }
>         }
> 
> > And when I run 10,000 consecutive synchronize_rcu_expedited() calls, the
> > above change reduces the overhead by more than an order of magnitude.
> > Except that the overhead of the series is far less than one second,
> > not the several minutes that you are seeing.  So the per-call overhead
> > decreases from about 17 microseconds to a bit more than one microsecond.
> > 
> > I could imagine an extra order of magnitude if you are running HZ=100
> > instead of the HZ=1000 that I am running.  But that only gets up to a
> > few seconds.
> > 
> >> One additional debug is to apply the patch below on top of the one you
> apply the patch below?
> >> just now kindly tested, then use whatever debug technique you wish to
> >> work out what fraction of the time during that critical interval that
> >> srcu_get_delay() returns non-zero.
> Sorry, I am confused, no patch right?
> Just measure srcu_get_delay return to non-zero?
> 
> 
> By the way, the issue should be only related with qemu apci. not
> related with rmr feature

No, this also occurs if you supply the guest's EFI with an empty set
of persistent variables. EFI goes and zeroes it, which results in a
read-only memslot write access being taken to userspace, the memslot
being unmapped from the guest, QEMU doing a little dance, and
eventually restoring the memslot back to the guest. Rince, repeat.

Do that one byte at a time over 64MB, and your boot time for EFI only
goes from 39s to 3m50s (that's on a speed-challenged Synquacer box),
which completely kills the "deploy a new VM" use case.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.