Re: [PATCH 2/2] KVM: PPC: booke: Add watchdog emulation

Alexander Graf <agraf@xxxxxxx> · Sat, 7 Jul 2012 09:50:51 +0200

On 07.07.2012, at 01:37, Scott Wood wrote:

> On 07/06/2012 08:17 AM, Alexander Graf wrote:
>> On 28.06.2012, at 08:17, Bharat Bhushan wrote:
>>> +/*
>>> + * The timer system can almost deal with LONG_MAX timeouts, except that
>>> + * when you get very close to LONG_MAX, the slack added can cause overflow.
>>> + *
>>> + * LONG_MAX/2 is a conservative threshold, but it should be adequate for
>>> + * any realistic use.
>>> + */
>>> +#define MAX_TIMEOUT (LONG_MAX/2)
>> 
>> Should this really be in kvm code?
> 
> It looks like we can use NEXT_TIMER_MAX_DELTA for this.
> 
>>> +	mask = 1ULL << (63 - period);
>>> +	tb = get_tb();
>>> +	if (tb & mask)
>>> +		nr_jiffies += mask;
>> 
>> To be honest, you lost me here. nr_jiffies is jiffies, right? While
>> mask is basically in timebase granularity. I suppose you're just
>> reusing the variable here for no good reason - the compiler will
>> gladly optimize things for you if you write things a bit more verbose
>> :).
> 
> Probably due to the way do_div() works, but yeah, it's confusing.  Maybe
> something generic like "ticks", "interval", "remaining", etc. would be
> better, with a comment on the do_div saying it's converting timebase
> ticks into jiffies.

Well, you could start off with a variable "delta_tb", then do

  nr_jiffies = delta_tb;
  x = do_div(...);

and things would suddenly become readable :). Of course I don't object to comments along the code either :).

> 
>>> +static void arm_next_watchdog(struct kvm_vcpu *vcpu)
>>> +{
>>> +	unsigned long nr_jiffies;
>>> +
>>> +	nr_jiffies = watchdog_next_timeout(vcpu);
>>> +	if (nr_jiffies < MAX_TIMEOUT)
>>> +		mod_timer(&vcpu->arch.wdt_timer, jiffies + nr_jiffies);
>>> +	else
>>> +		del_timer(&vcpu->arch.wdt_timer);
>> 
>> Can you del a timer that's not armed? Could that ever happen in this case?
> 
> "del_timer() deactivates a timer - this works on both active and
> inactive timers."

Ah, good :).

> 
>> Also, could the timer possibly be running somewhere, so do we need del_timer_sync? Or don't we need to care?
> 
> This can be called in the context of the timer, so del_timer_sync()
> would hang.
> 
> As for what would happen if a caller from a different context were to
> race with a timer, I think you could end up with the timer armed based
> on an old TCR.  del_timer_sync() won't help though, unless you replace
> mod_timer() with del_timer_sync() plus add_timer() (with a check to see
> whether it's running in timer context).  A better solution is probably
> to use a spinlock in arm_next_watchdog().

Yup. Either way, we have a race that the guest might not expect.

> 
>>> +void kvmppc_watchdog_func(unsigned long data)
>>> +{
>>> +	struct kvm_vcpu *vcpu = (struct kvm_vcpu *)data;
>>> +	u32 tsr, new_tsr;
>>> +	int final;
>>> +
>>> +	do {
>>> +		new_tsr = tsr = vcpu->arch.tsr;
>>> +		final = 0;
>>> +
>>> +		/* Time out event */
>>> +		if (tsr & TSR_ENW) {
>>> +			if (tsr & TSR_WIS) {
>>> +				new_tsr = (tsr & ~TCR_WRC_MASK) |
>>> +					  (vcpu->arch.tcr & TCR_WRC_MASK);
>>> +				vcpu->arch.tcr &= ~TCR_WRC_MASK;
>>> +				final = 1;
>>> +			} else {
>>> +				new_tsr = tsr | TSR_WIS;
>>> +			}
>>> +		} else {
>>> +			new_tsr = tsr | TSR_ENW;
>>> +		}
>>> +	} while (cmpxchg(&vcpu->arch.tsr, tsr, new_tsr) != tsr);
>>> +
>>> +	if (new_tsr & (TSR_WIS | TCR_WRC_MASK)) {
>>> +		smp_wmb();
>>> +		kvm_make_request(KVM_REQ_PENDING_TIMER, vcpu);
>>> +		kvm_vcpu_kick(vcpu);
>>> +	}
>>> +
>>> +	/*
>>> +	 * Avoid getting a storm of timers if the guest sets
>>> +	 * the period very short.  We'll restart it if anything
>>> +	 * changes.
>>> +	 */
>>> +	if (!final)
>>> +		arm_next_watchdog(vcpu);
>> 
>> Mind to explain this part a bit further?
> 
> The whole function, or some subset near the end?
> 
> The "if (!final)" check means that we stop running the timer after final
> expiration, to prevent the host from being flooded with timers if the
> guest sets a short period but does not have TCR set to exit to QEMU.
> Timers will resume the next time TSR/TCR is updated.

Ah. The semantics make sense. The comment however is slightly too short. Please explain this in a more verbose way, so someone who didn't write the code knows what's going on :).

> 
>>> @@ -1106,7 +1213,14 @@ static int set_sregs_base(struct kvm_vcpu *vcpu,
>>> 	}
>>> 
>>> 	if (sregs->u.e.update_special & KVM_SREGS_E_UPDATE_TSR) {
>>> +		u32 old_tsr = vcpu->arch.tsr;
>>> +
>>> 		vcpu->arch.tsr = sregs->u.e.tsr;
>>> +
>>> +		if ((old_tsr ^ vcpu->arch.tsr) &
>>> +		    (TSR_ENW | TSR_WIS | TCR_WRC_MASK))
>>> +			arm_next_watchdog(vcpu);
>> 
>> Why isn't this one guarded by vcpu->arch.watchdog_enable?
> 
> I'm not sure that any of them should be -- there's no reason for the
> watchdog interrupt mechanism to be dependent on QEMU, only the
> heavyweight exit on final expiration.

Well - I like the concept of having new features switchable. Overlapping the "watchdog is implemented" feature with "user space wants watchdog exits" makes sense. But I definitely want to have a switch for the former, because we otherwise differ quite substantially from the emulation we had before.

> 
>>> +			spr_val &= ~TCR_WRC_MASK;
>>> +		kvmppc_set_tcr(vcpu,
>>> +		               spr_val | (TCR_WRC_MASK & vcpu->arch.tcr));
>> 
>> In fact, what you're trying to do here is keep TCR_WRC always on when it was enabled once. So all you need is the OR here. No need for the mask above.
> 
> WRC is a 2-bit field that is supposed to preserve its value once written
> to be non-zero.  Not that we actually do anything different based on the
> specific non-zero value, but still we should implement the architected
> semantics.

Ah, being 2 bits wide, the above code suddenly makes more sense :). How about

/* WRC is a 2-bit field that is supposed to preserve its value once written to be non-zero */
spr_val &= ~TCR_WRC_MASK;
spr_val |= vcpu->arch.tcr & TCR_WRC_MASK;
kvmppc_set_tcr(vcpu, spr_val);

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html