Re: [PATCH 2/2] KVM: PPC: booke: Add watchdog emulation

Alexander Graf <agraf@xxxxxxx> · Mon, 9 Jul 2012 19:29:41 +0200

On 09.07.2012, at 19:09, Scott Wood wrote:

> On 07/07/2012 02:50 AM, Alexander Graf wrote:
>> 
>> On 07.07.2012, at 01:37, Scott Wood wrote:
>> 
>>> On 07/06/2012 08:17 AM, Alexander Graf wrote:
>>>> On 28.06.2012, at 08:17, Bharat Bhushan wrote:
>>>>> +/*
>>>>> + * The timer system can almost deal with LONG_MAX timeouts, except that
>>>>> + * when you get very close to LONG_MAX, the slack added can cause overflow.
>>>>> + *
>>>>> + * LONG_MAX/2 is a conservative threshold, but it should be adequate for
>>>>> + * any realistic use.
>>>>> + */
>>>>> +#define MAX_TIMEOUT (LONG_MAX/2)
>>>> 
>>>> Should this really be in kvm code?
>>> 
>>> It looks like we can use NEXT_TIMER_MAX_DELTA for this.
>>> 
>>>>> +	mask = 1ULL << (63 - period);
>>>>> +	tb = get_tb();
>>>>> +	if (tb & mask)
>>>>> +		nr_jiffies += mask;
>>>> 
>>>> To be honest, you lost me here. nr_jiffies is jiffies, right? While
>>>> mask is basically in timebase granularity. I suppose you're just
>>>> reusing the variable here for no good reason - the compiler will
>>>> gladly optimize things for you if you write things a bit more verbose
>>>> :).
>>> 
>>> Probably due to the way do_div() works, but yeah, it's confusing.  Maybe
>>> something generic like "ticks", "interval", "remaining", etc. would be
>>> better, with a comment on the do_div saying it's converting timebase
>>> ticks into jiffies.
>> 
>> Well, you could start off with a variable "delta_tb", then do
>> 
>>  nr_jiffies = delta_tb;
>>  x = do_div(...);
>> 
>> and things would suddenly become readable :). Of course I don't object to comments along the code either :).
>> 
>>> 
>>>>> +static void arm_next_watchdog(struct kvm_vcpu *vcpu)
>>>>> +{
>>>>> +	unsigned long nr_jiffies;
>>>>> +
>>>>> +	nr_jiffies = watchdog_next_timeout(vcpu);
>>>>> +	if (nr_jiffies < MAX_TIMEOUT)
>>>>> +		mod_timer(&vcpu->arch.wdt_timer, jiffies + nr_jiffies);
>>>>> +	else
>>>>> +		del_timer(&vcpu->arch.wdt_timer);
>>>> 
>>>> Can you del a timer that's not armed? Could that ever happen in this case?
>>> 
>>> "del_timer() deactivates a timer - this works on both active and
>>> inactive timers."
>> 
>> Ah, good :).
>> 
>>> 
>>>> Also, could the timer possibly be running somewhere, so do we need del_timer_sync? Or don't we need to care?
>>> 
>>> This can be called in the context of the timer, so del_timer_sync()
>>> would hang.
>>> 
>>> As for what would happen if a caller from a different context were to
>>> race with a timer, I think you could end up with the timer armed based
>>> on an old TCR.  del_timer_sync() won't help though, unless you replace
>>> mod_timer() with del_timer_sync() plus add_timer() (with a check to see
>>> whether it's running in timer context).  A better solution is probably
>>> to use a spinlock in arm_next_watchdog().
>> 
>> Yup. Either way, we have a race that the guest might not expect.
>> 
>>> 
>>>>> +void kvmppc_watchdog_func(unsigned long data)
>>>>> +{
>>>>> +	struct kvm_vcpu *vcpu = (struct kvm_vcpu *)data;
>>>>> +	u32 tsr, new_tsr;
>>>>> +	int final;
>>>>> +
>>>>> +	do {
>>>>> +		new_tsr = tsr = vcpu->arch.tsr;
>>>>> +		final = 0;
>>>>> +
>>>>> +		/* Time out event */
>>>>> +		if (tsr & TSR_ENW) {
>>>>> +			if (tsr & TSR_WIS) {
>>>>> +				new_tsr = (tsr & ~TCR_WRC_MASK) |
>>>>> +					  (vcpu->arch.tcr & TCR_WRC_MASK);
>>>>> +				vcpu->arch.tcr &= ~TCR_WRC_MASK;
>>>>> +				final = 1;
>>>>> +			} else {
>>>>> +				new_tsr = tsr | TSR_WIS;
>>>>> +			}
>>>>> +		} else {
>>>>> +			new_tsr = tsr | TSR_ENW;
>>>>> +		}
>>>>> +	} while (cmpxchg(&vcpu->arch.tsr, tsr, new_tsr) != tsr);
>>>>> +
>>>>> +	if (new_tsr & (TSR_WIS | TCR_WRC_MASK)) {
>>>>> +		smp_wmb();
>>>>> +		kvm_make_request(KVM_REQ_PENDING_TIMER, vcpu);
>>>>> +		kvm_vcpu_kick(vcpu);
>>>>> +	}
>>>>> +
>>>>> +	/*
>>>>> +	 * Avoid getting a storm of timers if the guest sets
>>>>> +	 * the period very short.  We'll restart it if anything
>>>>> +	 * changes.
>>>>> +	 */
>>>>> +	if (!final)
>>>>> +		arm_next_watchdog(vcpu);
>>>> 
>>>> Mind to explain this part a bit further?
>>> 
>>> The whole function, or some subset near the end?
>>> 
>>> The "if (!final)" check means that we stop running the timer after final
>>> expiration, to prevent the host from being flooded with timers if the
>>> guest sets a short period but does not have TCR set to exit to QEMU.
>>> Timers will resume the next time TSR/TCR is updated.
>> 
>> Ah. The semantics make sense. The comment however is slightly too short. Please explain this in a more verbose way, so someone who didn't write the code knows what's going on :).
>> 
>>> 
>>>>> @@ -1106,7 +1213,14 @@ static int set_sregs_base(struct kvm_vcpu *vcpu,
>>>>> 	}
>>>>> 
>>>>> 	if (sregs->u.e.update_special & KVM_SREGS_E_UPDATE_TSR) {
>>>>> +		u32 old_tsr = vcpu->arch.tsr;
>>>>> +
>>>>> 		vcpu->arch.tsr = sregs->u.e.tsr;
>>>>> +
>>>>> +		if ((old_tsr ^ vcpu->arch.tsr) &
>>>>> +		    (TSR_ENW | TSR_WIS | TCR_WRC_MASK))
>>>>> +			arm_next_watchdog(vcpu);
>>>> 
>>>> Why isn't this one guarded by vcpu->arch.watchdog_enable?
>>> 
>>> I'm not sure that any of them should be -- there's no reason for the
>>> watchdog interrupt mechanism to be dependent on QEMU, only the
>>> heavyweight exit on final expiration.
>> 
>> Well - I like the concept of having new features switchable.
> 
> The code quickly becomes an unmantainable mess if every new feature is
> switchable (and where is the line between new feature and bugfix when it
> comes to filling in missing emulation?).  The reason for switchability
> in this case is limited to QEMU interface compatibility.
> 
> The internal-to-KVM part of the watchdog isn't much different from the
> FIT -- we're not going to conditionalize that, are we?
> 
>> Overlapping the "watchdog is implemented" feature with "user space
>> wants watchdog exits" makes sense. But I definitely want to have a
>> switch for the former, because we otherwise differ quite
>> substantially from the emulation we had before.
> 
> It differs in that it is now a bit closer to being correct.

If we can maintain user space backwards compatibility, I'm fine with not switching it, since we're effectively fixing a bug wrt the implementation of the announced features, yes.

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html