Re: [PATCH 6/6] KVM: x86: Emulator support for TF

Gleb Natapov <gleb@xxxxxxxxxx> · Tue, 23 Feb 2010 13:00:54 +0200



On Tue, Feb 23, 2010 at 11:37:21AM +0100, Jan Kiszka wrote:
> Gleb Natapov wrote:
> > On Tue, Feb 23, 2010 at 11:10:57AM +0100, Jan Kiszka wrote:
> >> Gleb Natapov wrote:
> >>> On Mon, Feb 22, 2010 at 06:51:23PM +0100, Jan Kiszka wrote:
> >>>> Support both guest- as well as host-owned EFLAGS.TF while emulating
> >>>> instructions. For guest-owned TF, we simply inject DB and update DR6.BS
> >>>> after completing an instruction that has TF set on entry. To support
> >>>> guest single-stepping under host control, we store the pending step
> >>>> along with its CS and RIP and trigger a corresponding user space exit
> >>>> once guest execution is about to resume. This check is is also required
> >>>> in the VMX emulation loop during invalid guest states.
> >>>>
> >>> Emulator currently is a total mess. It is not a good time to add more mess
> >>> there right now IMO.
> >> Then let's clean up what you consider "mess" in this feature. Unless
> >> there are plans to clean up the emulator for the next or next-but-one
> >> kernel release, I do not want to wait for this.
> >>
> > There are plans to cleanup the emulator.
> 
> When?
ASAP :) I am looking into that, but it will not be easy.

> 
> > 
> >>>> Signed-off-by: Jan Kiszka <jan.kiszka@xxxxxxxxxxx>
> >>>> ---
> >>>>  arch/x86/include/asm/kvm_host.h |    5 +++
> >>>>  arch/x86/kvm/vmx.c              |    6 +++
> >>>>  arch/x86/kvm/x86.c              |   65 ++++++++++++++++++++++++++++++++++-----
> >>>>  3 files changed, 68 insertions(+), 8 deletions(-)
> >>>>
> >>>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> >>>> index d46e791..d69d8aa 100644
> >>>> --- a/arch/x86/include/asm/kvm_host.h
> >>>> +++ b/arch/x86/include/asm/kvm_host.h
> >>>> @@ -362,8 +362,11 @@ struct kvm_vcpu_arch {
> >>>>  	u64 *mce_banks;
> >>>>  
> >>>>  	/* used for guest single stepping over the given code position */
> >>>> +	bool singlestep_pending;
> >>>>  	u16 singlestep_cs;
> >>>> +	u16 singlestep_pending_cs;
> >>>>  	unsigned long singlestep_rip;
> >>>> +	unsigned long singlestep_pending_rip;
> >>> If we are going to have many of those rip/cs pairs may be it is better
> >>> to add structure linear_ip and have functions is_same_ip().
> >> Agreed.
> >>
> >>>  
> >>>>  	/* fields used by HYPER-V emulation */
> >>>>  	u64 hv_vapic;
> >>>>  };
> >>>> @@ -820,4 +823,6 @@ int kvm_cpu_get_interrupt(struct kvm_vcpu *v);
> >>>>  void kvm_define_shared_msr(unsigned index, u32 msr);
> >>>>  void kvm_set_shared_msr(unsigned index, u64 val, u64 mask);
> >>>>  
> >>>> +int kvm_check_guest_singlestep(struct kvm_vcpu *vcpu);
> >>>> +
> >>>>  #endif /* _ASM_X86_KVM_HOST_H */
> >>>> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> >>>> index d772476..317828f 100644
> >>>> --- a/arch/x86/kvm/vmx.c
> >>>> +++ b/arch/x86/kvm/vmx.c
> >>>> @@ -3489,6 +3489,12 @@ static int handle_invalid_guest_state(struct kvm_vcpu *vcpu)
> >>>>  			goto out;
> >>>>  		if (need_resched())
> >>>>  			schedule();
> >>>> +
> >>>> +		if (unlikely(vcpu->arch.singlestep_pending)) {
> >>>> +			ret = kvm_check_guest_singlestep(vcpu);
> >>>> +			if (ret == 0)
> >>>> +				goto out;
> >>>> +		}
> >>>>  	}
> >>>>  
> >>>>  	vmx->emulation_required = 0;
> >>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> >>>> index 19e8b28..6ebebb9 100644
> >>>> --- a/arch/x86/kvm/x86.c
> >>>> +++ b/arch/x86/kvm/x86.c
> >>>> @@ -3441,6 +3441,27 @@ static void cache_all_regs(struct kvm_vcpu *vcpu)
> >>>>  	vcpu->arch.regs_dirty = ~0;
> >>>>  }
> >>>>  
> >>>> +static u16 get_segment_selector(struct kvm_vcpu *vcpu, int seg)
> >>>> +{
> >>>> +	struct kvm_segment kvm_seg;
> >>>> +
> >>>> +	kvm_get_segment(vcpu, &kvm_seg, seg);
> >>>> +	return kvm_seg.selector;
> >>>> +}
> >>>> +
> >>>> +static void queue_singlestep(struct kvm_vcpu *vcpu)
> >>>> +{
> >>>> +	if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP) {
> >>>> +		vcpu->arch.singlestep_pending = true;
> >>>> +		vcpu->arch.singlestep_pending_cs =
> >>>> +			get_segment_selector(vcpu, VCPU_SREG_CS);
> >>>> +		vcpu->arch.singlestep_pending_rip = kvm_rip_read(vcpu);
> >>> Why should we remember rip where TF happened? We should exit
> >>> immediately to userspace anyway, no?
> >> I think MMIO exits takes precedence, so this is intended to exit after
> >> they completed, ie. after the instruction is fully finished.
> >>
> >>>> +	} else {
> >>>> +		vcpu->arch.dr6 |= DR6_BS;
> >>>> +		kvm_queue_exception(vcpu, DB_VECTOR);
> >>> What if instruction emulation generated fault?
> >> Fault-like exceptions will trigger before that, and the instruction
> >> won't complete. Do we have any trap-like exceptions to worry about?
> >>
> > They will not trigger before that. They will be queued for the next
> > entry and queuing another one will either overwrite the previous one,
> > or will queue double fault (depending on what what the first exception).
> 
> The will not stack as the instruction failed, thus no singlestep will be
> queued as well.
Instruction failed doesn't mean emulation failed, so lets see what
happens when you single step over instruction that generates page fault.
#PF is queued and x86_emulate_insn() returns 0 to emulate_instruction()
no you call queue_singlestep() which calls kvm_queue_exception(vcpu, DB_VECTOR);
and this cause #DF to be injected.

> 
> > 
> >>>> +	}
> >>>> +}
> >>>> +
> >>>>  int emulate_instruction(struct kvm_vcpu *vcpu,
> >>>>  			unsigned long cr2,
> >>>>  			u16 error_code,
> >>>> @@ -3449,6 +3470,7 @@ int emulate_instruction(struct kvm_vcpu *vcpu,
> >>>>  	int r, shadow_mask;
> >>>>  	struct decode_cache *c;
> >>>>  	struct kvm_run *run = vcpu->run;
> >>>> +	bool singlestep;
> >>>>  
> >>>>  	kvm_clear_exception_queue(vcpu);
> >>>>  	vcpu->arch.mmio_fault_cr2 = cr2;
> >>>> @@ -3515,8 +3537,12 @@ int emulate_instruction(struct kvm_vcpu *vcpu,
> >>>>  		}
> >>>>  	}
> >>>>  
> >>>> +	singlestep = vcpu->arch.emulate_ctxt.eflags & X86_EFLAGS_TF;
> >>>> +
> >>>>  	if (emulation_type & EMULTYPE_SKIP) {
> >>>>  		kvm_rip_write(vcpu, vcpu->arch.emulate_ctxt.decode.eip);
> >>>> +		if (singlestep)
> >>>> +			queue_singlestep(vcpu);
> >>> Instruction that wasn't emulated shouldn't generate faults.
> >>>
> >> Skipping here doesn't mean it's not emulated. A valid question might be
> >> if we should catch it here or in skip_emulated_instruction.
> > In skip_emulated_instruction() or even above that. Only at the point we
> > are sure we actually emulate instruction.
> > 
> >>>>  		return EMULATE_DONE;
> >>>>  	}
> >>>>  
> >>>> @@ -3549,6 +3575,9 @@ int emulate_instruction(struct kvm_vcpu *vcpu,
> >>>>  
> >>>>  	kvm_x86_ops->set_rflags(vcpu, vcpu->arch.emulate_ctxt.eflags);
> >>>>  
> >>>> +	if (singlestep)
> >>>> +		queue_singlestep(vcpu);
> >>>> +
> >>> if vcpu->mmio_is_write == true we can still exit with DO_MMIO, so
> >>> instruction is not yes completely executed. This queue_singlestep
> >>> mechanism looks bogus anyway. emulate_instruction() caller should
> >>> initiate exit to userspace space if required.
> >> That's while it is _queued_, not immediately delivered: MMIO exits will
> >> continue to take precedence.
> > That is bogus. We should queue TF only after instruction is completely
This should have been "We should queue #DB ..".

> > emulated, not during emulation. Instruction may generate dozen MMIO
> > exits and eventually be aborted by exception, so DB shouldn't be generated
> > at all in this case.
> 
> How to detect this? This potential looping over user space caused sever
> headache will designing the TF feature. I thought I got all cases.
> 
How to detect that emulation is complete? emulate_instruction() should
return EMULATE_DONE in this case.

> The current model is: if we first execute a different instruction than
> the one that singlestep targets at, singlestep should be cleared without
> having any effect. Re-executing the stepped instruction after a
> potential exception resolution should then reassert singlestep and
> requeue it properly.
> 
> > 
> >>>>  	if (vcpu->mmio_is_write) {
> >>>>  		vcpu->mmio_needed = 0;
> >>>>  		return EMULATE_DO_MMIO;
> >>>> @@ -4450,6 +4479,26 @@ out:
> >>>>  	return r;
> >>>>  }
> >>>>  
> >>>> +int kvm_check_guest_singlestep(struct kvm_vcpu *vcpu)
> >>>> +{
> >>>> +	unsigned long rip = kvm_rip_read(vcpu);
> >>>> +
> >>>> +	vcpu->arch.singlestep_pending = false;
> >>>> +
> >>>> +	if (vcpu->arch.singlestep_pending_cs !=
> >>>> +		get_segment_selector(vcpu, VCPU_SREG_CS) ||
> >>>> +	    vcpu->arch.singlestep_pending_rip != rip)
> >>>> +		return 1;
> >>>> +
> >>> Again how this check can be false?
> >> E.g. someone fiddled with the VCPU state, resetting the guest.
> > How is this someone? It should reset this state too then.
> 
> If somehow possible, I do not want to make this state part of the user
> visible VCPU state but rather translate it into a pending #DB + set
> DR6.BS (for guest-owned TF) or drop it (for host-owned - not critical here).
As far as I can see for guest owned TF you always inject #DB immediately
in queue_singlestep().  And I think this should always be the case. At
the end of instruction emulator should inject #DB. When we enter vcpu
 or emulator again  if #DB interception is enables and #DB is pending
intercept function should be called.

> 
> > 
> >>>> +	vcpu->run->debug.arch.dr6 = DR6_BS | DR6_FIXED_1;
> >>>> +	vcpu->run->debug.arch.dr7 = 0;
> >>>> +	vcpu->run->exit_reason = KVM_EXIT_DEBUG;
> >>>> +	vcpu->run->debug.arch.pc = get_segment_base(vcpu, VCPU_SREG_CS) + rip;
> >>>> +	vcpu->run->debug.arch.exception = DB_VECTOR;
> >>>> +
> >>>> +	return 0;
> >>>> +}
> >>>> +EXPORT_SYMBOL_GPL(kvm_check_guest_singlestep);
> >>>>  
> >>>>  static int __vcpu_run(struct kvm_vcpu *vcpu)
> >>>>  {
> >>>> @@ -4471,6 +4520,12 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
> >>>>  
> >>>>  	r = 1;
> >>>>  	while (r > 0) {
> >>>> +		if (unlikely(vcpu->arch.singlestep_pending)) {
> >>>> +			r = kvm_check_guest_singlestep(vcpu);
> >>>> +			if (r == 0)
> >>>> +				break;
> >>>> +		}
> >>>> +
> >>> Why not use existing mechanism to cause run loop to exit to userspace
> >>> i.e return 0 from vcpu_enter_guest(), instead of adding special cases here?
> >>>
> >> I wanted to exit in case of vcpu->arch.mp_state != KVM_MP_STATE_RUNNABLE
> >> as well, but thinking about this again it's actually more reasonable to
> >> exit once the VCPU unblocks again, e.g. once halt resumes.
> >>
> > We will exit immediately after halt if return value is 0, why would you
> > want to exit on vcpu entry after that?
> 
> As I said: I will move this into vcpu_enter_guest.
> 
> Jan
> 
> -- 
> Siemens AG, Corporate Technology, CT T DE IT 1
> Corporate Competence Center Embedded Linux

--
			Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html