On 2015-07-09 12:22, Christoffer Dall wrote: > Hi Peter and Marc, > > [cc'ing Paolo for his input on x86 timekeeping] > > On Wed, Jul 08, 2015 at 08:13:59PM +0100, Peter Maydell wrote: >> On 8 July 2015 at 17:37, Marc Zyngier <marc.zyngier@xxxxxxx> wrote: >>> On 08/07/15 17:06, Peter Maydell wrote: >>>> I'd prefer it if somebody could investigate to see why QEMU >>>> is actually doing this -- so far we just have speculation. >>> >>> I'd prefer that too, but so far people seem to be more comfortable >>> waiting for the issue to fix itself. In the meantime, VMs are broken in >>> weird and wonderful ways, and I don't think the current status-quo helps >>> anyone. >> >> Putting in a patch which might not be the right fix isn't >> necessarily a good plan either... >> >> Does has_run_once get cleared if we do a re-VCPU_INIT >> of a CPU that's run before? (We need to allow rewriting >> of guest state at that point so that "reset VM and >> load migration state" behaves correctly.) > > no, it does not, has_run_once is set the first time a VCPU is run and is > currently *never* cleared. > >> >> I suspect Jan is right and we really need to distinguish >> the KVM_PUT_*_STATE levels in ARM QEMU. This probably >> implies some kind of whitelist/override mechanism, since >> by and large we neither know nor want to know the >> semantics for system registers, we leave that up to the >> kernel. >> >> Q: if you have a running VM, and you pause it for >> an hour, what should the CNTVCT register do? Presumably >> it should not advance, but how do we arrange for that >> to happen? >> > > I think the CNTVCT should not advance when the VM is not scheduled, so > if we pause the VM or starve all the VCPUs for enough time, the guest > should not see time progressing, since otherwise the guest scheduler > cannot maintain fairness and you're bound to see spurious RCU stalls > etc. > > That's exactly why a guest can read both a virtual and physical counter > and it is an area where you simply want some level of > paravirtualization. I haven't studied how/if Linux deals with this at > all. > > So I think adjusting CNTVOFF should be managed by the kernel for the > pause/starvation scenario (which I think Avi once told me x86 does too - > does anyone know the current state of the art?). > > So the only situation where I think userspace should adjust the CNTVOFF > value is for migration where we are talking about a brand new VM with > has_run_once clear. > > Thus, if we were designing this from scratch now, the API should > be to return an error when trying to set KVM_REG_ARM_TIMER_CNT after the > VM has run once, but it's too late for that as we would break userspace. > The best alternative IMHO would be to merge Marc's patch and fix CNTVOFF > in the kernel side as well, and finally also fix QEMU so that it doesn't > try to do the thing that future kernels will ignore. Fixing QEMU to only write on KVM_PUT_FULL_STATE - yes, that should be done, but I don't think the approach for the kernel is generally right. The kernel should not do any policing on user space requests to change the VCPU or VM state unless - security is affected - userspace lacks information, thus cannot decide correctly - legacy userspace has a bug, we can detect it and want to fix that up without affecting future userspace that has a reason to do it differently Regarding CNTVOFF, the first two criteria do not apply for sure. Maybe the last one, don't know. Just think of the hypothetical scenario that a userspace VM debugger wants to inject certain register manipulations. If you block this by some hidden VM state like proposed, that feature would no longer be implementable easily. Jan
Attachment:
signature.asc
Description: OpenPGP digital signature