On Fri, 17 Feb 2023 22:11:36 +0000, Oliver Upton <oliver.upton@xxxxxxxxx> wrote: > > On Fri, Feb 17, 2023 at 10:17:27AM +0000, Marc Zyngier wrote: > > Hi Oliver, > > > > On Thu, 16 Feb 2023 22:09:47 +0000, > > Oliver Upton <oliver.upton@xxxxxxxxx> wrote: > > > > > > Hi Marc, > > > > > > On Thu, Feb 16, 2023 at 02:21:15PM +0000, Marc Zyngier wrote: > > > > And this is the moment you have all been waiting for: setting the > > > > counter offsets from userspace. > > > > > > > > We expose a brand new capability that reports the ability to set > > > > the offsets for both the virtual and physical sides, independently. > > > > > > > > In keeping with the architecture, the offsets are expressed as > > > > a delta that is substracted from the physical counter value. > > > > > > > > Once this new API is used, there is no going back, and the counters > > > > cannot be written to to set the offsets implicitly (the writes > > > > are instead ignored). > > > > > > Is there any particular reason to use an explicit ioctl as opposed to > > > the KVM_{GET,SET}_DEVICE_ATTR ioctls? Dunno where you stand on it, but I > > > quite like that interface for simple state management. We also avoid > > > eating up more UAPI bits in the global namespace. > > > > The problem with that is that it requires yet another KVM device for > > this, and I'm lazy. It also makes it a bit harder for the VMM to buy > > into this (need to track another FD, for example). > > You can also accept the device ioctls on the actual VM FD, quite like > we do for the vCPU right now. And hey, I've got a patch that gets you > most of the way there! > > https://lore.kernel.org/kvmarm/20230211013759.3556016-3-oliver.upton@xxxxxxxxx/ Huh... I don't know yet if I love it or hate it.At the end of the day, this is just another ioctl, so I don't care either way. > > > Is there any reason why we can't just order this ioctl before vCPU > > > creation altogether, or is there a need to do this at runtime? We're > > > about to tolerate multiple writers to the offset value, and I think the > > > only thing we need to guarantee is that the below flag is set before > > > vCPU ioctls have a chance to run. > > > > Again, we don't know for sure whether the final offset is available > > before vcpu creation time. My idea for QEMU would be to perform the > > offset adjustment as late as possible, right before executing the VM, > > after having restored the vcpus with whatever value they had. > > So how does userspace work out an offset based on available information? > The part that hasn't clicked for me yet is where userspace gets the > current value of the true physical counter to calculate an offset. What's wrong with CNTVCT_EL0? > We could make it ABI that the guest's physical counter matches that of > the host by default. Of course, that has been the case since the > beginning of time but it is now directly user-visible. > > The only part I don't like about that is that we aren't fully creating > an abstraction around host and guest system time. So here's my current > mental model of how we represent the generic timer to userspace: > > +-----------------------+ > | | > | Host System Counter | > | (1) | > +-----------------------+ > | > +-----------+-----------+ > | | > +-----------------+ +-----+ +-----+ +--------------------+ > | (2) CNTPOFF_EL2 |--| sub | | sub |--| (3) CNTVOFF_EL2 | > +-----------------+ +-----+ +-----+ +--------------------+ > | | > | | > +-----------------+ +----------------+ > | (5) CNTPCT_EL0 | | (4) CNTVCT_EL0 | > +-----------------+ +----------------+ > > AFAICT, this UAPI exposes abstractions for (2) and (3) to userspace, but > userspace cannot directly get at (1). Of course it can! CNTVCT_EL0 is accessible from userspace, and is guaranteed to have an offset of 0 on a host. > > Chewing on this a bit more, I don't think userspace has any business > messing with virtual and physical time independently, especially when > nested virtualization comes into play. Well, NV already ignores the virtual offset completely (see how the virtual timer gets its offset reassigned at reset time). > > I think the illusion to userspace needs to be built around the notion of > a system counter: > > +-----------------------+ > | | > | Host System Counter | > | (1) | > +-----------------------+ > | > | > +-----+ +-------------------+ > | sub |---| (6) system_offset | > +-----+ +-------------------+ > | > | > +-----------------------+ > | | > | Guest System Counter | > | (7) | > +-----------------------+ > | > +-----------+-----------+ > | | > +-----------------+ +-----+ +-----+ +--------------------+ > | (2) CNTPOFF_EL2 |--| sub | | sub |--| (3) CNTVOFF_EL2 | > +-----------------+ +-----+ +-----+ +--------------------+ > | | > | | > +-----------------+ +----------------+ > | (5) CNTPCT_EL0 | | (4) CNTVCT_EL0 | > +-----------------+ +----------------+ > > And from a UAPI perspective, we would either expose (1) and (6) to let > userspace calculate an offset or simply allow (7) to be directly > read/written. I previously toyed with this idea, and I really like it. However, the problem with this is that it breaks the current behaviour of having two different values for CNTVCT and CNTPCT in the guest, and CNTPCT representing the counter value on the host. Such a VM cannot be migrated *today*, but not everybody cares about migration. My "dual offset" approach allows the current behaviour to persist, and such a VM to be migrated. The luser even gets the choice of preserving counter continuity in the guest or to stay without a physical offset and reflect the host's counter. Is it a good behaviour? Of course not. Does anyone depend on it? I have no idea, but odds are that someone does. Can we break their toys? The jury is still out. > > That frees up the meaning of the counter offsets as being purely a > virtual EL2 thing. These registers would reset to 0, and non-NV guests > could never change their value. > > Under the hood KVM would program the true offset registers as: > > CNT{P,V}OFF_EL2 = 'virtual CNT{P,V}OFF_EL2' + system_offset > > With this we would effectively configure CNTPCT = CNTVCT = 0 at the > point of VM creation. Only crappy thing is it requires full physical > counter/timer emulation for non-ECV systems, but the guest shouldn't be > using the physical counter in the first place. And I think that's the point where we differ. I can completely imagine some in-VM code using the physical counter to export some timestamping to the host (for tracing purposes, amongst other things). > Yes, this sucks for guests running on hosts w/ NV but not ECV. If anyone > can tell me how an L0 hypervisor is supposed to do NV without ECV, I'm > all ears. You absolutely can run with NV2 without ECV. You just get a bad quality of emulation for the EL0 timers. But that's about it. > Does any of what I've written make remote sense or have I gone entirely > off the rails with my ASCII art? :) Your ASCII art is beautiful, only a tad too wide! ;-) What you suggest makes a lot of sense, but it leaves existing behaviours in the lurch. Can we pretend they don't exist? You tell me! Thanks, M. -- Without deviation from the norm, progress is not possible.