Re: [PATCH 2/2] KVM: PPC: booke: Add watchdog emulation

Scott Wood <scottwood@xxxxxxxxxxxxx> · Mon, 9 Jul 2012 12:09:10 -0500

On 07/07/2012 02:50 AM, Alexander Graf wrote:
> 
> On 07.07.2012, at 01:37, Scott Wood wrote:
> 
>> On 07/06/2012 08:17 AM, Alexander Graf wrote:
>>> On 28.06.2012, at 08:17, Bharat Bhushan wrote:
>>>> +/*
>>>> + * The timer system can almost deal with LONG_MAX timeouts, except that
>>>> + * when you get very close to LONG_MAX, the slack added can cause overflow.
>>>> + *
>>>> + * LONG_MAX/2 is a conservative threshold, but it should be adequate for
>>>> + * any realistic use.
>>>> + */
>>>> +#define MAX_TIMEOUT (LONG_MAX/2)
>>>
>>> Should this really be in kvm code?
>>
>> It looks like we can use NEXT_TIMER_MAX_DELTA for this.
>>
>>>> +	mask = 1ULL << (63 - period);
>>>> +	tb = get_tb();
>>>> +	if (tb & mask)
>>>> +		nr_jiffies += mask;
>>>
>>> To be honest, you lost me here. nr_jiffies is jiffies, right? While
>>> mask is basically in timebase granularity. I suppose you're just
>>> reusing the variable here for no good reason - the compiler will
>>> gladly optimize things for you if you write things a bit more verbose
>>> :).
>>
>> Probably due to the way do_div() works, but yeah, it's confusing.  Maybe
>> something generic like "ticks", "interval", "remaining", etc. would be
>> better, with a comment on the do_div saying it's converting timebase
>> ticks into jiffies.
> 
> Well, you could start off with a variable "delta_tb", then do
> 
>   nr_jiffies = delta_tb;
>   x = do_div(...);
> 
> and things would suddenly become readable :). Of course I don't object to comments along the code either :).
> 
>>
>>>> +static void arm_next_watchdog(struct kvm_vcpu *vcpu)
>>>> +{
>>>> +	unsigned long nr_jiffies;
>>>> +
>>>> +	nr_jiffies = watchdog_next_timeout(vcpu);
>>>> +	if (nr_jiffies < MAX_TIMEOUT)
>>>> +		mod_timer(&vcpu->arch.wdt_timer, jiffies + nr_jiffies);
>>>> +	else
>>>> +		del_timer(&vcpu->arch.wdt_timer);
>>>
>>> Can you del a timer that's not armed? Could that ever happen in this case?
>>
>> "del_timer() deactivates a timer - this works on both active and
>> inactive timers."
> 
> Ah, good :).
> 
>>
>>> Also, could the timer possibly be running somewhere, so do we need del_timer_sync? Or don't we need to care?
>>
>> This can be called in the context of the timer, so del_timer_sync()
>> would hang.
>>
>> As for what would happen if a caller from a different context were to
>> race with a timer, I think you could end up with the timer armed based
>> on an old TCR.  del_timer_sync() won't help though, unless you replace
>> mod_timer() with del_timer_sync() plus add_timer() (with a check to see
>> whether it's running in timer context).  A better solution is probably
>> to use a spinlock in arm_next_watchdog().
> 
> Yup. Either way, we have a race that the guest might not expect.
> 
>>
>>>> +void kvmppc_watchdog_func(unsigned long data)
>>>> +{
>>>> +	struct kvm_vcpu *vcpu = (struct kvm_vcpu *)data;
>>>> +	u32 tsr, new_tsr;
>>>> +	int final;
>>>> +
>>>> +	do {
>>>> +		new_tsr = tsr = vcpu->arch.tsr;
>>>> +		final = 0;
>>>> +
>>>> +		/* Time out event */
>>>> +		if (tsr & TSR_ENW) {
>>>> +			if (tsr & TSR_WIS) {
>>>> +				new_tsr = (tsr & ~TCR_WRC_MASK) |
>>>> +					  (vcpu->arch.tcr & TCR_WRC_MASK);
>>>> +				vcpu->arch.tcr &= ~TCR_WRC_MASK;
>>>> +				final = 1;
>>>> +			} else {
>>>> +				new_tsr = tsr | TSR_WIS;
>>>> +			}
>>>> +		} else {
>>>> +			new_tsr = tsr | TSR_ENW;
>>>> +		}
>>>> +	} while (cmpxchg(&vcpu->arch.tsr, tsr, new_tsr) != tsr);
>>>> +
>>>> +	if (new_tsr & (TSR_WIS | TCR_WRC_MASK)) {
>>>> +		smp_wmb();
>>>> +		kvm_make_request(KVM_REQ_PENDING_TIMER, vcpu);
>>>> +		kvm_vcpu_kick(vcpu);
>>>> +	}
>>>> +
>>>> +	/*
>>>> +	 * Avoid getting a storm of timers if the guest sets
>>>> +	 * the period very short.  We'll restart it if anything
>>>> +	 * changes.
>>>> +	 */
>>>> +	if (!final)
>>>> +		arm_next_watchdog(vcpu);
>>>
>>> Mind to explain this part a bit further?
>>
>> The whole function, or some subset near the end?
>>
>> The "if (!final)" check means that we stop running the timer after final
>> expiration, to prevent the host from being flooded with timers if the
>> guest sets a short period but does not have TCR set to exit to QEMU.
>> Timers will resume the next time TSR/TCR is updated.
> 
> Ah. The semantics make sense. The comment however is slightly too short. Please explain this in a more verbose way, so someone who didn't write the code knows what's going on :).
> 
>>
>>>> @@ -1106,7 +1213,14 @@ static int set_sregs_base(struct kvm_vcpu *vcpu,
>>>> 	}
>>>>
>>>> 	if (sregs->u.e.update_special & KVM_SREGS_E_UPDATE_TSR) {
>>>> +		u32 old_tsr = vcpu->arch.tsr;
>>>> +
>>>> 		vcpu->arch.tsr = sregs->u.e.tsr;
>>>> +
>>>> +		if ((old_tsr ^ vcpu->arch.tsr) &
>>>> +		    (TSR_ENW | TSR_WIS | TCR_WRC_MASK))
>>>> +			arm_next_watchdog(vcpu);
>>>
>>> Why isn't this one guarded by vcpu->arch.watchdog_enable?
>>
>> I'm not sure that any of them should be -- there's no reason for the
>> watchdog interrupt mechanism to be dependent on QEMU, only the
>> heavyweight exit on final expiration.
> 
> Well - I like the concept of having new features switchable.

The code quickly becomes an unmantainable mess if every new feature is
switchable (and where is the line between new feature and bugfix when it
comes to filling in missing emulation?).  The reason for switchability
in this case is limited to QEMU interface compatibility.

The internal-to-KVM part of the watchdog isn't much different from the
FIT -- we're not going to conditionalize that, are we?

> Overlapping the "watchdog is implemented" feature with "user space
> wants watchdog exits" makes sense. But I definitely want to have a
> switch for the former, because we otherwise differ quite
> substantially from the emulation we had before.

It differs in that it is now a bit closer to being correct.

-Scott

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html