Re: [PATCH v3 1/3] arm/arm64: KVM: Use set/way op trapping to track the state of the caches

Marc Zyngier <marc.zyngier@xxxxxxx> · Tue, 27 Jan 2015 13:44:05 +0000

On 27/01/15 13:17, Christoffer Dall wrote:
> On Tue, Jan 27, 2015 at 11:21:38AM +0000, Marc Zyngier wrote:
>> On 26/01/15 22:58, Christoffer Dall wrote:
>>> On Wed, Jan 21, 2015 at 06:39:46PM +0000, Marc Zyngier wrote:
>>>> Trying to emulate the behaviour of set/way cache ops is fairly
>>>> pointless, as there are too many ways we can end-up missing stuff.
>>>> Also, there is some system caches out there that simply ignore
>>>> set/way operations.
>>>>
>>>> So instead of trying to implement them, let's convert it to VA ops,
>>>> and use them as a way to re-enable the trapping of VM ops. That way,
>>>> we can detect the point when the MMU/caches are turned off, and do
>>>> a full VM flush (which is what the guest was trying to do anyway).
>>>>
>>>> This allows a 32bit zImage to boot on the APM thingy, and will
>>>> probably help bootloaders in general.
>>>>
>>>> Signed-off-by: Marc Zyngier <marc.zyngier@xxxxxxx>
>>>
>>> This had some conflicts with dirty page logging.  I fixed it up here,
>>> and also removed some trailing white space and mixed spaces/tabs that
>>> patch complained about:
>>>
>>> http://git.linaro.org/people/christoffer.dall/linux-kvm-arm.git mm-fixes
>>
>> Thanks for doing so.
>>
>>>> ---
>>>>  arch/arm/include/asm/kvm_emulate.h   | 10 +++++
>>>>  arch/arm/include/asm/kvm_host.h      |  3 --
>>>>  arch/arm/include/asm/kvm_mmu.h       |  3 +-
>>>>  arch/arm/kvm/arm.c                   | 10 -----
>>>>  arch/arm/kvm/coproc.c                | 64 ++++++------------------------
>>>>  arch/arm/kvm/coproc_a15.c            |  2 +-
>>>>  arch/arm/kvm/coproc_a7.c             |  2 +-
>>>>  arch/arm/kvm/mmu.c                   | 70 ++++++++++++++++++++++++++++++++-
>>>>  arch/arm/kvm/trace.h                 | 39 +++++++++++++++++++
>>>>  arch/arm64/include/asm/kvm_emulate.h | 10 +++++
>>>>  arch/arm64/include/asm/kvm_host.h    |  3 --
>>>>  arch/arm64/include/asm/kvm_mmu.h     |  3 +-
>>>>  arch/arm64/kvm/sys_regs.c            | 75 +++++-------------------------------
>>>>  13 files changed, 155 insertions(+), 139 deletions(-)
>>>>
>>>> diff --git a/arch/arm/include/asm/kvm_emulate.h b/arch/arm/include/asm/kvm_emulate.h
>>>> index 66ce176..7b01523 100644
>>>> --- a/arch/arm/include/asm/kvm_emulate.h
>>>> +++ b/arch/arm/include/asm/kvm_emulate.h
>>>> @@ -38,6 +38,16 @@ static inline void vcpu_reset_hcr(struct kvm_vcpu *vcpu)
>>>>       vcpu->arch.hcr = HCR_GUEST_MASK;
>>>>  }
>>>>
>>>> +static inline unsigned long vcpu_get_hcr(struct kvm_vcpu *vcpu)
>>>> +{
>>>> +     return vcpu->arch.hcr;
>>>> +}
>>>> +
>>>> +static inline void vcpu_set_hcr(struct kvm_vcpu *vcpu, unsigned long hcr)
>>>> +{
>>>> +     vcpu->arch.hcr = hcr;
>>>> +}
>>>> +
>>>>  static inline bool vcpu_mode_is_32bit(struct kvm_vcpu *vcpu)
>>>>  {
>>>>       return 1;
>>>> diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
>>>> index 254e065..04b4ea0 100644
>>>> --- a/arch/arm/include/asm/kvm_host.h
>>>> +++ b/arch/arm/include/asm/kvm_host.h
>>>> @@ -125,9 +125,6 @@ struct kvm_vcpu_arch {
>>>>        * Anything that is not used directly from assembly code goes
>>>>        * here.
>>>>        */
>>>> -     /* dcache set/way operation pending */
>>>> -     int last_pcpu;
>>>> -     cpumask_t require_dcache_flush;
>>>>
>>>>       /* Don't run the guest on this vcpu */
>>>>       bool pause;
>>>> diff --git a/arch/arm/include/asm/kvm_mmu.h b/arch/arm/include/asm/kvm_mmu.h
>>>> index 63e0ecc..286644c 100644
>>>> --- a/arch/arm/include/asm/kvm_mmu.h
>>>> +++ b/arch/arm/include/asm/kvm_mmu.h
>>>> @@ -190,7 +190,8 @@ static inline void coherent_cache_guest_page(struct kvm_vcpu *vcpu, hva_t hva,
>>>>
>>>>  #define kvm_virt_to_phys(x)          virt_to_idmap((unsigned long)(x))
>>>>
>>>> -void stage2_flush_vm(struct kvm *kvm);
>>>> +void kvm_set_way_flush(struct kvm_vcpu *vcpu);
>>>> +void kvm_toggle_cache(struct kvm_vcpu *vcpu, bool was_enabled);
>>>>
>>>>  #endif       /* !__ASSEMBLY__ */
>>>>
>>>> diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
>>>> index 2d6d910..0b0d58a 100644
>>>> --- a/arch/arm/kvm/arm.c
>>>> +++ b/arch/arm/kvm/arm.c
>>>> @@ -281,15 +281,6 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>>>>       vcpu->cpu = cpu;
>>>>       vcpu->arch.host_cpu_context = this_cpu_ptr(kvm_host_cpu_state);
>>>>
>>>> -     /*
>>>> -      * Check whether this vcpu requires the cache to be flushed on
>>>> -      * this physical CPU. This is a consequence of doing dcache
>>>> -      * operations by set/way on this vcpu. We do it here to be in
>>>> -      * a non-preemptible section.
>>>> -      */
>>>> -     if (cpumask_test_and_clear_cpu(cpu, &vcpu->arch.require_dcache_flush))
>>>> -             flush_cache_all(); /* We'd really want v7_flush_dcache_all() */
>>>> -
>>>>       kvm_arm_set_running_vcpu(vcpu);
>>>>  }
>>>>
>>>> @@ -541,7 +532,6 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *run)
>>>>               ret = kvm_call_hyp(__kvm_vcpu_run, vcpu);
>>>>
>>>>               vcpu->mode = OUTSIDE_GUEST_MODE;
>>>> -             vcpu->arch.last_pcpu = smp_processor_id();
>>>>               kvm_guest_exit();
>>>>               trace_kvm_exit(*vcpu_pc(vcpu));
>>>>               /*
>>>> diff --git a/arch/arm/kvm/coproc.c b/arch/arm/kvm/coproc.c
>>>> index 7928dbd..0afcc00 100644
>>>> --- a/arch/arm/kvm/coproc.c
>>>> +++ b/arch/arm/kvm/coproc.c
>>>> @@ -189,82 +189,40 @@ static bool access_l2ectlr(struct kvm_vcpu *vcpu,
>>>>       return true;
>>>>  }
>>>>
>>>> -/* See note at ARM ARM B1.14.4 */
>>>> +/*
>>>> + * See note at ARMv7 ARM B1.14.4 (TL;DR: S/W ops are not easily virtualized).
>>>> + */
>>>>  static bool access_dcsw(struct kvm_vcpu *vcpu,
>>>>                       const struct coproc_params *p,
>>>>                       const struct coproc_reg *r)
>>>>  {
>>>> -     unsigned long val;
>>>> -     int cpu;
>>>> -
>>>>       if (!p->is_write)
>>>>               return read_from_write_only(vcpu, p);
>>>>
>>>> -     cpu = get_cpu();
>>>> -
>>>> -     cpumask_setall(&vcpu->arch.require_dcache_flush);
>>>> -     cpumask_clear_cpu(cpu, &vcpu->arch.require_dcache_flush);
>>>> -
>>>> -     /* If we were already preempted, take the long way around */
>>>> -     if (cpu != vcpu->arch.last_pcpu) {
>>>> -             flush_cache_all();
>>>> -             goto done;
>>>> -     }
>>>> -
>>>> -     val = *vcpu_reg(vcpu, p->Rt1);
>>>> -
>>>> -     switch (p->CRm) {
>>>> -     case 6:                 /* Upgrade DCISW to DCCISW, as per HCR.SWIO */
>>>> -     case 14:                /* DCCISW */
>>>> -             asm volatile("mcr p15, 0, %0, c7, c14, 2" : : "r" (val));
>>>> -             break;
>>>> -
>>>> -     case 10:                /* DCCSW */
>>>> -             asm volatile("mcr p15, 0, %0, c7, c10, 2" : : "r" (val));
>>>> -             break;
>>>> -     }
>>>> -
>>>> -done:
>>>> -     put_cpu();
>>>> -
>>>> +     kvm_set_way_flush(vcpu);
>>>>       return true;
>>>>  }
>>>>
>>>>  /*
>>>>   * Generic accessor for VM registers. Only called as long as HCR_TVM
>>>> - * is set.
>>>> + * is set.  If the guest enables the MMU, we stop trapping the VM
>>>> + * sys_regs and leave it in complete control of the caches.
>>>> + *
>>>> + * Used by the cpu-specific code.
>>>>   */
>>>>  static bool access_vm_reg(struct kvm_vcpu *vcpu,
>>>>                         const struct coproc_params *p,
>>>>                         const struct coproc_reg *r)
>>>>  {
>>>> +     bool was_enabled = vcpu_has_cache_enabled(vcpu);
>>>> +
>>>>       BUG_ON(!p->is_write);
>>>>
>>>>       vcpu->arch.cp15[r->reg] = *vcpu_reg(vcpu, p->Rt1);
>>>>       if (p->is_64bit)
>>>>               vcpu->arch.cp15[r->reg + 1] = *vcpu_reg(vcpu, p->Rt2);
>>>>
>>>> -     return true;
>>>> -}
>>>> -
>>>> -/*
>>>> - * SCTLR accessor. Only called as long as HCR_TVM is set.  If the
>>>> - * guest enables the MMU, we stop trapping the VM sys_regs and leave
>>>> - * it in complete control of the caches.
>>>> - *
>>>> - * Used by the cpu-specific code.
>>>> - */
>>>> -bool access_sctlr(struct kvm_vcpu *vcpu,
>>>> -               const struct coproc_params *p,
>>>> -               const struct coproc_reg *r)
>>>> -{
>>>
>>> I think you left in a prototype in arch/arm/kvm/coproc.h
>>>
>>>> -     access_vm_reg(vcpu, p, r);
>>>> -
>>>> -     if (vcpu_has_cache_enabled(vcpu)) {     /* MMU+Caches enabled? */
>>>> -             vcpu->arch.hcr &= ~HCR_TVM;
>>>> -             stage2_flush_vm(vcpu->kvm);
>>>> -     }
>>>> -
>>>> +     kvm_toggle_cache(vcpu, was_enabled);
>>>>       return true;
>>>>  }
>>>>
>>>> diff --git a/arch/arm/kvm/coproc_a15.c b/arch/arm/kvm/coproc_a15.c
>>>> index e6f4ae4..a713675 100644
>>>> --- a/arch/arm/kvm/coproc_a15.c
>>>> +++ b/arch/arm/kvm/coproc_a15.c
>>>> @@ -34,7 +34,7 @@
>>>>  static const struct coproc_reg a15_regs[] = {
>>>>       /* SCTLR: swapped by interrupt.S. */
>>>>       { CRn( 1), CRm( 0), Op1( 0), Op2( 0), is32,
>>>> -                     access_sctlr, reset_val, c1_SCTLR, 0x00C50078 },
>>>> +                     access_vm_reg, reset_val, c1_SCTLR, 0x00C50078 },
>>>>  };
>>>>
>>>>  static struct kvm_coproc_target_table a15_target_table = {
>>>> diff --git a/arch/arm/kvm/coproc_a7.c b/arch/arm/kvm/coproc_a7.c
>>>> index 17fc7cd..b19e46d 100644
>>>> --- a/arch/arm/kvm/coproc_a7.c
>>>> +++ b/arch/arm/kvm/coproc_a7.c
>>>> @@ -37,7 +37,7 @@
>>>>  static const struct coproc_reg a7_regs[] = {
>>>>       /* SCTLR: swapped by interrupt.S. */
>>>>       { CRn( 1), CRm( 0), Op1( 0), Op2( 0), is32,
>>>> -                     access_sctlr, reset_val, c1_SCTLR, 0x00C50878 },
>>>> +                     access_vm_reg, reset_val, c1_SCTLR, 0x00C50878 },
>>>
>>> I'm confused how you tested this, this doesn't seem to compile for me
>>> (access_vm_reg is a static function).
>>
>> That's worrying. It looks like I dropped a fixup while rebasing the
>> series. Nothing major, but still... I'll fix that.
>>
> 
> no worries, just wanted to make sure we have tested the right thing.
> 
>>>>  };
>>>>
>>>>  static struct kvm_coproc_target_table a7_target_table = {
>>>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>>>> index 1dc9778..106737e 100644
>>>> --- a/arch/arm/kvm/mmu.c
>>>> +++ b/arch/arm/kvm/mmu.c
>>>> @@ -278,7 +278,7 @@ static void stage2_flush_memslot(struct kvm *kvm,
>>>>   * Go through the stage 2 page tables and invalidate any cache lines
>>>>   * backing memory already mapped to the VM.
>>>>   */
>>>> -void stage2_flush_vm(struct kvm *kvm)
>>>> +static void stage2_flush_vm(struct kvm *kvm)
>>>>  {
>>>>       struct kvm_memslots *slots;
>>>>       struct kvm_memory_slot *memslot;
>>>> @@ -1411,3 +1411,71 @@ void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
>>>>       unmap_stage2_range(kvm, gpa, size);
>>>>       spin_unlock(&kvm->mmu_lock);
>>>>  }
>>>> +
>>>> +/*
>>>> + * See note at ARMv7 ARM B1.14.4 (TL;DR: S/W ops are not easily virtualized).
>>>> + *
>>>> + * Main problems:
>>>> + * - S/W ops are local to a CPU (not broadcast)
>>>> + * - We have line migration behind our back (speculation)
>>>> + * - System caches don't support S/W at all (damn!)
>>>> + *
>>>> + * In the face of the above, the best we can do is to try and convert
>>>> + * S/W ops to VA ops. Because the guest is not allowed to infer the
>>>> + * S/W to PA mapping, it can only use S/W to nuke the whole cache,
>>>> + * which is a rather good thing for us.
>>>> + *
>>>> + * Also, it is only used when turning caches on/off ("The expected
>>>> + * usage of the cache maintenance instructions that operate by set/way
>>>> + * is associated with the cache maintenance instructions associated
>>>> + * with the powerdown and powerup of caches, if this is required by
>>>> + * the implementation.").
>>>> + *
>>>> + * We use the following policy:
>>>> + *
>>>> + * - If we trap a S/W operation, we enable VM trapping to detect
>>>> + *   caches being turned on/off, and do a full clean.
>>>> + *
>>>> + * - We flush the caches on both caches being turned on and off.
>>>> + *
>>>> + * - Once the caches are enabled, we stop trapping VM ops.
>>>> + */
>>>> +void kvm_set_way_flush(struct kvm_vcpu *vcpu)
>>>> +{
>>>> +     unsigned long hcr = vcpu_get_hcr(vcpu);
>>>> +
>>>> +     /*
>>>> +      * If this is the first time we do a S/W operation
>>>> +      * (i.e. HCR_TVM not set) flush the whole memory, and set the
>>>> +      * VM trapping.
>>>
>>> what about when the guest boots, wouldn't it be using S/W operations to
>>> flush the whole cache before turning it on?  But we ignore this case now
>>> because we're already trapping VM regs so this is deferred until caches
>>> are turned on.
>>
>> Exactly. Doing S/W ops with caches off can always be postponed until we
>> turn it on (this is the point where it actually matters).
>>
>>> However, when we are turning off caches, we both flush the cache when
>>> doing the actual S/W operation and when the caches are then turned off.
>>> Why?
>>
>> Think of it as a "belt and braces" measure. The guest tries to do
>> something *at that point*, and we should respect that. It doesn't mean
>> it makes sense from our point of view though. This is what the 32bit
>> kernel does, BTW (have a look at early_cachepolicy -- Mark Rutland is
>> trying to get rid of this).
> 
> ok, and we don't need the "belt and braces" when the MMU is off because
> the guest is not accessing the caches anyhow?

Yes. We don't hit in the cache (since they are off), so we can safely
defer this until caches are turned on.

> So shouldn't we be detecting the guest MMU state instead? But we found
> this to be unstable before (for a still unknown reason), so we instead
> rely on always having trapping TVM set when the MMU is off?

Tracking the MMU state would be the ideal thing to do. It is not so much
that it is unreliable, but that detecting the MMU being turned off is
extremely expensive (TVM traps all accessed to the VM regs).

Also, as you mentioned, there is still a code path I don't really
understand in the 32bit decompressor, where we somehow failed to notice
that the MMU had been turned off (it was with another version of the
patch, and things have changed quite a bit since).

> So is there any conceivable guest (UEFI or something else) that could
> correctly be turning off the MMU without calling invalidate by set/way
> and creating a situation that we don't catch now?  Or is that not a
> problem because in the worst case we'll just be flushing too often?

If the guest flushes by VA, then we're safe. If it doesn't flush at all,
it is a blatant bug. In both cases, trapping the MMU access doesn't buy
us much.

> What about the opposite case, if we have trapping set, but we actually
> do need to flush the caches, how are we sure this never happens?

You lost me here. What is the exact case?

>>
>> So you can think of these as two different, rather unrelated events:
>> - Guest does S/W ops with caches on: we flush, knowing that this is a
>> best effort thing, probably doomed, but hey...
> 
> ok, so Linux is basically doing something architecturally invalid?

Indeed. That area of the code is fairly old and covers all versions of
the architecture, without much distinctions as to what is valid where...

>> - Guest turns its caches off: we flush, knowing that this is the right
>> time to do it.
>>
>> Hope this helps,
>>
> 
> Yes a bit, but this does feel incredibly brittle, but I know it may be
> the best option we have....

Yup, this is probably the worse part of the architecture. We have to
work our way through both architectural issues (inherited from previous
revisions), and legacy code that has hardly evolved since ARMv4 (and
basically works by luck, and the lack of a system cache).

Thanks,

	M.
-- 
Jazz is not dead. It just smells funny...
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html