On Thu, Jul 25, 2024 at 08:35:40PM +0100, David Woodhouse wrote: > On Thu, 2024-07-25 at 12:38 -0400, Michael S. Tsirkin wrote: > > On Thu, Jul 25, 2024 at 04:18:43PM +0100, David Woodhouse wrote: > > > The use case isn't necessarily for all users of gettimeofday(), of > > > course; this is for those applications which *need* precision time. > > > Like distributed databases which rely on timestamps for coherency, and > > > users who get fined millions of dollars when LM messes up their clocks > > > and they put wrong timestamps on financial transactions. > > > > I would however worry that with all this pass through, > > applications have to be coded to each hypervisor or even > > version of the hypervisor. > > Yes, that would be a problem. Which is why I feel it's so important to > harmonise the contents of the shared memory, and I'm implementing it > both QEMU and $DAYJOB, as well as aligning with virtio-rtc. Writing an actual spec for this would be another thing that might help. > I don't think the structure should be changing between hypervisors (and > especially versions). We *will* see a progression from simply providing > the disruption signal, to providing the full clock information so that > guests don't have to abort transactions while they resync their clock. > But that's perfectly fine. > > And it's also entirely agnostic to the mechanism by which the memory > region is *discovered*. It doesn't matter if it's ACPI, DT, a > hypervisor enlightenment, a BAR of a simple PCI device, virtio, or > anything else. > > ACPI is one of the *simplest* options for a hypervisor and guest to > implement, and doesn't prevent us from using the same structure in > virtio-rtc. I'm happy enough using ACPI and letting virtio-rtc come > along later. > > > virtio has been developed with the painful experience that we keep > > making mistakes, or coming up with new needed features, > > and that maintaining forward and backward compatibility > > becomes a whole lot harder than it seems in the beginning. > > Yes. But as you note, this shared memory structure is a userspace ABI > all of its own, so we get to make a completely *different* kind of > mistake :) > So, something I still don't completely understand. Can't the VDSO thing be written to by kernel? Let's say on LM, an interrupt triggers and kernel copies data from a specific device to the VDSO. Is that problematic somehow? I imagine there is a race where userspace reads vdso after lm but before kernel updated vdso - is that the concern? Then can't we fix it by interrupting all CPUs right after LM? To me that seems like a cleaner approach - we then compartmentalize the ABI issue - kernel has its own ABI against userspace, devices have their own ABI against kernel. It'd mean we need a way to detect that interrupt was sent, maybe yet another counter inside that structure. WDYT? By the way the same idea would work for snapshots - some people wanted to expose that info to userspace, too. -- MST