Re: [PATCH v4 14/20] KVM: arm/arm64: Avoid timer save/restore in vcpu entry/exit

Jintack Lim <jintack@xxxxxxxxxxxxxxx> · Mon, 20 Nov 2017 11:32:28 -0500

On Mon, Nov 20, 2017 at 6:15 AM, Christoffer Dall <cdall@xxxxxxxxxx> wrote:
> On Thu, Nov 16, 2017 at 03:30:39PM -0500, Jintack Lim wrote:
>> Hi Christoffer,
>>
>> On Fri, Oct 20, 2017 at 7:49 AM, Christoffer Dall
>> <christoffer.dall@xxxxxxxxxx> wrote:
>> > From: Christoffer Dall <cdall@xxxxxxxxxx>
>> >
>> > We don't need to save and restore the hardware timer state and examine
>> > if it generates interrupts on on every entry/exit to the guest.  The
>> > timer hardware is perfectly capable of telling us when it has expired
>> > by signaling interrupts.
>> >
>> > When taking a vtimer interrupt in the host, we don't want to mess with
>> > the timer configuration, we just want to forward the physical interrupt
>> > to the guest as a virtual interrupt.  We can use the split priority drop
>> > and deactivate feature of the GIC to do this, which leaves an EOI'ed
>> > interrupt active on the physical distributor, making sure we don't keep
>> > taking timer interrupts which would prevent the guest from running.  We
>> > can then forward the physical interrupt to the VM using the HW bit in
>> > the LR of the GIC, like we do already, which lets the guest directly
>> > deactivate both the physical and virtual timer simultaneously, allowing
>> > the timer hardware to exit the VM and generate a new physical interrupt
>> > when the timer output is again asserted later on.
>> >
>> > We do need to capture this state when migrating VCPUs between physical
>> > CPUs, however, which we use the vcpu put/load functions for, which are
>> > called through preempt notifiers whenever the thread is scheduled away
>> > from the CPU or called directly if we return from the ioctl to
>> > userspace.
>> >
>> > One caveat is that we have to save and restore the timer state in both
>> > kvm_timer_vcpu_[put/load] and kvm_timer_[schedule/unschedule], because
>> > we can have the following flows:
>> >
>> >   1. kvm_vcpu_block
>> >   2. kvm_timer_schedule
>> >   3. schedule
>> >   4. kvm_timer_vcpu_put (preempt notifier)
>> >   5. schedule (vcpu thread gets scheduled back)
>> >   6. kvm_timer_vcpu_load (preempt notifier)
>> >   7. kvm_timer_unschedule
>> >
>> > And a version where we don't actually call schedule:
>> >
>> >   1. kvm_vcpu_block
>> >   2. kvm_timer_schedule
>> >   7. kvm_timer_unschedule
>> >
>> > Since kvm_timer_[schedule/unschedule] may not be followed by put/load,
>> > but put/load also may be called independently, we call the timer
>> > save/restore functions from both paths.  Since they rely on the loaded
>> > flag to never save/restore when unnecessary, this doesn't cause any
>> > harm, and we ensure that all invokations of either set of functions work
>> > as intended.
>> >
>> > An added benefit beyond not having to read and write the timer sysregs
>> > on every entry and exit is that we no longer have to actively write the
>> > active state to the physical distributor, because we configured the
>> > irq for the vtimer to only get a priority drop when handling the
>> > interrupt in the GIC driver (we called irq_set_vcpu_affinity()), and
>> > the interrupt stays active after firing on the host.
>> >
>> > Signed-off-by: Christoffer Dall <cdall@xxxxxxxxxx>
>> > ---
>> >
>> > Notes:
>> >     Changes since v3:
>> >      - Added comments explaining the 'loaded' flag and made other clarifying
>> >        comments.
>> >      - No longer rely on the armed flag to conditionally save/restore state,
>> >        as we already rely on the 'loaded' flag to not repetitively
>> >        save/restore state.
>> >      - Reworded parts of the commit message.
>> >      - Removed renames not belonging to this patch.
>> >      - Added warning in kvm_arch_timer_handler in case we see spurious
>> >        interrupts, for example if the hardware doesn't retire the
>> >        level-triggered timer signal fast enough.
>> >
>> >  include/kvm/arm_arch_timer.h |  16 ++-
>> >  virt/kvm/arm/arch_timer.c    | 237 +++++++++++++++++++++++++++----------------
>> >  virt/kvm/arm/arm.c           |  19 +++-
>> >  3 files changed, 178 insertions(+), 94 deletions(-)
>> >
>> > diff --git a/include/kvm/arm_arch_timer.h b/include/kvm/arm_arch_timer.h
>> > index 184c3ef2df93..c538f707e1c1 100644
>> > --- a/include/kvm/arm_arch_timer.h
>> > +++ b/include/kvm/arm_arch_timer.h
>> > @@ -31,8 +31,15 @@ struct arch_timer_context {
>> >         /* Timer IRQ */
>> >         struct kvm_irq_level            irq;
>> >
>> > -       /* Active IRQ state caching */
>> > -       bool                            active_cleared_last;
>> > +       /*
>> > +        * We have multiple paths which can save/restore the timer state
>> > +        * onto the hardware, so we need some way of keeping track of
>> > +        * where the latest state is.
>> > +        *
>> > +        * loaded == true:  State is loaded on the hardware registers.
>> > +        * loaded == false: State is stored in memory.
>> > +        */
>> > +       bool                    loaded;
>> >
>> >         /* Virtual offset */
>> >         u64                     cntvoff;
>> > @@ -78,10 +85,15 @@ void kvm_timer_unschedule(struct kvm_vcpu *vcpu);
>> >
>> >  u64 kvm_phys_timer_read(void);
>> >
>> > +void kvm_timer_vcpu_load(struct kvm_vcpu *vcpu);
>> >  void kvm_timer_vcpu_put(struct kvm_vcpu *vcpu);
>> >
>> >  void kvm_timer_init_vhe(void);
>> >
>> >  #define vcpu_vtimer(v) (&(v)->arch.timer_cpu.vtimer)
>> >  #define vcpu_ptimer(v) (&(v)->arch.timer_cpu.ptimer)
>> > +
>> > +void enable_el1_phys_timer_access(void);
>> > +void disable_el1_phys_timer_access(void);
>> > +
>> >  #endif
>> > diff --git a/virt/kvm/arm/arch_timer.c b/virt/kvm/arm/arch_timer.c
>> > index eac1b3d83a86..ec685c1f3b78 100644
>> > --- a/virt/kvm/arm/arch_timer.c
>> > +++ b/virt/kvm/arm/arch_timer.c
>> > @@ -46,10 +46,9 @@ static const struct kvm_irq_level default_vtimer_irq = {
>> >         .level  = 1,
>> >  };
>> >
>> > -void kvm_timer_vcpu_put(struct kvm_vcpu *vcpu)
>> > -{
>> > -       vcpu_vtimer(vcpu)->active_cleared_last = false;
>> > -}
>> > +static bool kvm_timer_irq_can_fire(struct arch_timer_context *timer_ctx);
>> > +static void kvm_timer_update_irq(struct kvm_vcpu *vcpu, bool new_level,
>> > +                                struct arch_timer_context *timer_ctx);
>> >
>> >  u64 kvm_phys_timer_read(void)
>> >  {
>> > @@ -69,17 +68,45 @@ static void soft_timer_cancel(struct hrtimer *hrt, struct work_struct *work)
>> >                 cancel_work_sync(work);
>> >  }
>> >
>> > -static irqreturn_t kvm_arch_timer_handler(int irq, void *dev_id)
>> > +static void kvm_vtimer_update_mask_user(struct kvm_vcpu *vcpu)
>> >  {
>> > -       struct kvm_vcpu *vcpu = *(struct kvm_vcpu **)dev_id;
>> > +       struct arch_timer_context *vtimer = vcpu_vtimer(vcpu);
>> >
>> >         /*
>> > -        * We disable the timer in the world switch and let it be
>> > -        * handled by kvm_timer_sync_hwstate(). Getting a timer
>> > -        * interrupt at this point is a sure sign of some major
>> > -        * breakage.
>> > +        * When using a userspace irqchip with the architected timers, we must
>> > +        * prevent continuously exiting from the guest, and therefore mask the
>> > +        * physical interrupt by disabling it on the host interrupt controller
>> > +        * when the virtual level is high, such that the guest can make
>> > +        * forward progress.  Once we detect the output level being
>> > +        * de-asserted, we unmask the interrupt again so that we exit from the
>> > +        * guest when the timer fires.
>> >          */
>> > -       pr_warn("Unexpected interrupt %d on vcpu %p\n", irq, vcpu);
>> > +       if (vtimer->irq.level)
>> > +               disable_percpu_irq(host_vtimer_irq);
>> > +       else
>> > +               enable_percpu_irq(host_vtimer_irq, 0);
>> > +}
>> > +
>> > +static irqreturn_t kvm_arch_timer_handler(int irq, void *dev_id)
>> > +{
>> > +       struct kvm_vcpu *vcpu = *(struct kvm_vcpu **)dev_id;
>> > +       struct arch_timer_context *vtimer;
>> > +
>> > +       if (!vcpu) {
>> > +               pr_warn_once("Spurious arch timer IRQ on non-VCPU thread\n");
>> > +               return IRQ_NONE;
>> > +       }
>> > +       vtimer = vcpu_vtimer(vcpu);
>> > +
>> > +       if (!vtimer->irq.level) {
>> > +               vtimer->cnt_ctl = read_sysreg_el0(cntv_ctl);
>> > +               if (kvm_timer_irq_can_fire(vtimer))
>> > +                       kvm_timer_update_irq(vcpu, true, vtimer);
>> > +       }
>> > +
>> > +       if (unlikely(!irqchip_in_kernel(vcpu->kvm)))
>> > +               kvm_vtimer_update_mask_user(vcpu);
>> > +
>> >         return IRQ_HANDLED;
>> >  }
>> >
>> > @@ -215,7 +242,6 @@ static void kvm_timer_update_irq(struct kvm_vcpu *vcpu, bool new_level,
>> >  {
>> >         int ret;
>> >
>> > -       timer_ctx->active_cleared_last = false;
>> >         timer_ctx->irq.level = new_level;
>> >         trace_kvm_timer_update_irq(vcpu->vcpu_id, timer_ctx->irq.irq,
>> >                                    timer_ctx->irq.level);
>> > @@ -271,10 +297,16 @@ static void phys_timer_emulate(struct kvm_vcpu *vcpu,
>> >         soft_timer_start(&timer->phys_timer, kvm_timer_compute_delta(timer_ctx));
>> >  }
>> >
>> > -static void timer_save_state(struct kvm_vcpu *vcpu)
>> > +static void vtimer_save_state(struct kvm_vcpu *vcpu)
>> >  {
>> >         struct arch_timer_cpu *timer = &vcpu->arch.timer_cpu;
>> >         struct arch_timer_context *vtimer = vcpu_vtimer(vcpu);
>> > +       unsigned long flags;
>> > +
>> > +       local_irq_save(flags);
>> > +
>> > +       if (!vtimer->loaded)
>> > +               goto out;
>> >
>> >         if (timer->enabled) {
>> >                 vtimer->cnt_ctl = read_sysreg_el0(cntv_ctl);
>> > @@ -283,6 +315,10 @@ static void timer_save_state(struct kvm_vcpu *vcpu)
>> >
>> >         /* Disable the virtual timer */
>> >         write_sysreg_el0(0, cntv_ctl);
>> > +
>> > +       vtimer->loaded = false;
>> > +out:
>> > +       local_irq_restore(flags);
>> >  }
>> >
>> >  /*
>> > @@ -296,6 +332,8 @@ void kvm_timer_schedule(struct kvm_vcpu *vcpu)
>> >         struct arch_timer_context *vtimer = vcpu_vtimer(vcpu);
>> >         struct arch_timer_context *ptimer = vcpu_ptimer(vcpu);
>> >
>> > +       vtimer_save_state(vcpu);
>> > +
>> >         /*
>> >          * No need to schedule a background timer if any guest timer has
>> >          * already expired, because kvm_vcpu_block will return before putting
>> > @@ -318,22 +356,34 @@ void kvm_timer_schedule(struct kvm_vcpu *vcpu)
>> >         soft_timer_start(&timer->bg_timer, kvm_timer_earliest_exp(vcpu));
>> >  }
>> >
>> > -static void timer_restore_state(struct kvm_vcpu *vcpu)
>> > +static void vtimer_restore_state(struct kvm_vcpu *vcpu)
>> >  {
>> >         struct arch_timer_cpu *timer = &vcpu->arch.timer_cpu;
>> >         struct arch_timer_context *vtimer = vcpu_vtimer(vcpu);
>> > +       unsigned long flags;
>> > +
>> > +       local_irq_save(flags);
>> > +
>> > +       if (vtimer->loaded)
>> > +               goto out;
>> >
>> >         if (timer->enabled) {
>> >                 write_sysreg_el0(vtimer->cnt_cval, cntv_cval);
>> >                 isb();
>> >                 write_sysreg_el0(vtimer->cnt_ctl, cntv_ctl);
>> >         }
>> > +
>> > +       vtimer->loaded = true;
>> > +out:
>> > +       local_irq_restore(flags);
>> >  }
>> >
>> >  void kvm_timer_unschedule(struct kvm_vcpu *vcpu)
>> >  {
>> >         struct arch_timer_cpu *timer = &vcpu->arch.timer_cpu;
>> >
>> > +       vtimer_restore_state(vcpu);
>> > +
>> >         soft_timer_cancel(&timer->bg_timer, &timer->expired);
>> >  }
>> >
>> > @@ -352,61 +402,45 @@ static void set_cntvoff(u64 cntvoff)
>> >         kvm_call_hyp(__kvm_timer_set_cntvoff, low, high);
>> >  }
>> >
>> > -static void kvm_timer_flush_hwstate_vgic(struct kvm_vcpu *vcpu)
>> > +static void kvm_timer_vcpu_load_vgic(struct kvm_vcpu *vcpu)
>> >  {
>> >         struct arch_timer_context *vtimer = vcpu_vtimer(vcpu);
>> >         bool phys_active;
>> >         int ret;
>> >
>> > -       /*
>> > -       * If we enter the guest with the virtual input level to the VGIC
>> > -       * asserted, then we have already told the VGIC what we need to, and
>> > -       * we don't need to exit from the guest until the guest deactivates
>> > -       * the already injected interrupt, so therefore we should set the
>> > -       * hardware active state to prevent unnecessary exits from the guest.
>> > -       *
>> > -       * Also, if we enter the guest with the virtual timer interrupt active,
>> > -       * then it must be active on the physical distributor, because we set
>> > -       * the HW bit and the guest must be able to deactivate the virtual and
>> > -       * physical interrupt at the same time.
>> > -       *
>> > -       * Conversely, if the virtual input level is deasserted and the virtual
>> > -       * interrupt is not active, then always clear the hardware active state
>> > -       * to ensure that hardware interrupts from the timer triggers a guest
>> > -       * exit.
>> > -       */
>> >         phys_active = vtimer->irq.level ||
>> > -                       kvm_vgic_map_is_active(vcpu, vtimer->irq.irq);
>> > -
>> > -       /*
>> > -        * We want to avoid hitting the (re)distributor as much as
>> > -        * possible, as this is a potentially expensive MMIO access
>> > -        * (not to mention locks in the irq layer), and a solution for
>> > -        * this is to cache the "active" state in memory.
>> > -        *
>> > -        * Things to consider: we cannot cache an "active set" state,
>> > -        * because the HW can change this behind our back (it becomes
>> > -        * "clear" in the HW). We must then restrict the caching to
>> > -        * the "clear" state.
>> > -        *
>> > -        * The cache is invalidated on:
>> > -        * - vcpu put, indicating that the HW cannot be trusted to be
>> > -        *   in a sane state on the next vcpu load,
>> > -        * - any change in the interrupt state
>> > -        *
>> > -        * Usage conditions:
>> > -        * - cached value is "active clear"
>> > -        * - value to be programmed is "active clear"
>> > -        */
>> > -       if (vtimer->active_cleared_last && !phys_active)
>> > -               return;
>> > +                     kvm_vgic_map_is_active(vcpu, vtimer->irq.irq);
>> >
>> >         ret = irq_set_irqchip_state(host_vtimer_irq,
>> >                                     IRQCHIP_STATE_ACTIVE,
>> >                                     phys_active);
>> >         WARN_ON(ret);
>> > +}
>> >
>> > -       vtimer->active_cleared_last = !phys_active;
>> > +static void kvm_timer_vcpu_load_user(struct kvm_vcpu *vcpu)
>> > +{
>> > +       kvm_vtimer_update_mask_user(vcpu);
>> > +}
>> > +
>> > +void kvm_timer_vcpu_load(struct kvm_vcpu *vcpu)
>> > +{
>> > +       struct arch_timer_cpu *timer = &vcpu->arch.timer_cpu;
>> > +       struct arch_timer_context *vtimer = vcpu_vtimer(vcpu);
>> > +
>> > +       if (unlikely(!timer->enabled))
>> > +               return;
>> > +
>> > +       if (unlikely(!irqchip_in_kernel(vcpu->kvm)))
>> > +               kvm_timer_vcpu_load_user(vcpu);
>> > +       else
>> > +               kvm_timer_vcpu_load_vgic(vcpu);
>> > +
>> > +       set_cntvoff(vtimer->cntvoff);
>> > +
>> > +       vtimer_restore_state(vcpu);
>> > +
>> > +       if (has_vhe())
>> > +               disable_el1_phys_timer_access();
>>
>> Same question here :)
>>
>
> Same answer as below.
>
>> >  }
>> >
>> >  bool kvm_timer_should_notify_user(struct kvm_vcpu *vcpu)
>> > @@ -426,23 +460,6 @@ bool kvm_timer_should_notify_user(struct kvm_vcpu *vcpu)
>> >                ptimer->irq.level != plevel;
>> >  }
>> >
>> > -static void kvm_timer_flush_hwstate_user(struct kvm_vcpu *vcpu)
>> > -{
>> > -       struct arch_timer_context *vtimer = vcpu_vtimer(vcpu);
>> > -
>> > -       /*
>> > -        * To prevent continuously exiting from the guest, we mask the
>> > -        * physical interrupt such that the guest can make forward progress.
>> > -        * Once we detect the output level being deasserted, we unmask the
>> > -        * interrupt again so that we exit from the guest when the timer
>> > -        * fires.
>> > -       */
>> > -       if (vtimer->irq.level)
>> > -               disable_percpu_irq(host_vtimer_irq);
>> > -       else
>> > -               enable_percpu_irq(host_vtimer_irq, 0);
>> > -}
>> > -
>> >  /**
>> >   * kvm_timer_flush_hwstate - prepare timers before running the vcpu
>> >   * @vcpu: The vcpu pointer
>> > @@ -455,23 +472,61 @@ static void kvm_timer_flush_hwstate_user(struct kvm_vcpu *vcpu)
>> >  void kvm_timer_flush_hwstate(struct kvm_vcpu *vcpu)
>> >  {
>> >         struct arch_timer_cpu *timer = &vcpu->arch.timer_cpu;
>> > -       struct arch_timer_context *vtimer = vcpu_vtimer(vcpu);
>> > +       struct arch_timer_context *ptimer = vcpu_ptimer(vcpu);
>> >
>> >         if (unlikely(!timer->enabled))
>> >                 return;
>> >
>> > -       kvm_timer_update_state(vcpu);
>> > +       if (kvm_timer_should_fire(ptimer) != ptimer->irq.level)
>> > +               kvm_timer_update_irq(vcpu, !ptimer->irq.level, ptimer);
>> >
>> >         /* Set the background timer for the physical timer emulation. */
>> >         phys_timer_emulate(vcpu, vcpu_ptimer(vcpu));
>> > +}
>> >
>> > -       if (unlikely(!irqchip_in_kernel(vcpu->kvm)))
>> > -               kvm_timer_flush_hwstate_user(vcpu);
>> > -       else
>> > -               kvm_timer_flush_hwstate_vgic(vcpu);
>> > +void kvm_timer_vcpu_put(struct kvm_vcpu *vcpu)
>> > +{
>> > +       struct arch_timer_cpu *timer = &vcpu->arch.timer_cpu;
>> >
>> > -       set_cntvoff(vtimer->cntvoff);
>> > -       timer_restore_state(vcpu);
>> > +       if (unlikely(!timer->enabled))
>> > +               return;
>> > +
>> > +       if (has_vhe())
>> > +               enable_el1_phys_timer_access();
>>
>> I wonder why we need to enable the EL1 physical timer access on VHE
>> systems (assuming TGE bit is set at this point)? EL2 can access it
>> regardless of EL1PTEN bit status, and EL0 access is controlled by
>> EL0PTEN.
>
> Yeah, my code is bogus, you already addressed that.  I think I wrote the
> first version of these patches prior to you fixing the physical timer
> trap configuration for VHE systems.
>
>>
>> In any case, since cnthcntl_el2 format is changed when E2H == 1, don't
>> we need to consider this in enable_el1_phys_timer_access() function
>> implementation?
>>
>
> You are indeed right.  Nice catch!

Thanks for the confirmation!

>
> Fix incoming.
>
> -Christoffer
>