Re: [PATCH] ptp: Add vDSO-style vmclock support

David Woodhouse <dwmw2@xxxxxxxxxxxxx> · Thu, 25 Jul 2024 22:00:24 +0100



On Thu, 2024-07-25 at 16:50 -0400, Michael S. Tsirkin wrote:
> On Thu, Jul 25, 2024 at 08:35:40PM +0100, David Woodhouse wrote:
> > On Thu, 2024-07-25 at 12:38 -0400, Michael S. Tsirkin wrote:
> > > On Thu, Jul 25, 2024 at 04:18:43PM +0100, David Woodhouse wrote:
> > > > The use case isn't necessarily for all users of gettimeofday(), of
> > > > course; this is for those applications which *need* precision time.
> > > > Like distributed databases which rely on timestamps for coherency, and
> > > > users who get fined millions of dollars when LM messes up their clocks
> > > > and they put wrong timestamps on financial transactions.
> > > 
> > > I would however worry that with all this pass through,
> > > applications have to be coded to each hypervisor or even
> > > version of the hypervisor.
> > 
> > Yes, that would be a problem. Which is why I feel it's so important to
> > harmonise the contents of the shared memory, and I'm implementing it
> > both QEMU and $DAYJOB, as well as aligning with virtio-rtc.
> 
> 
> Writing an actual spec for this would be another thing that might help.
> 

> > I don't think the structure should be changing between hypervisors (and
> > especially versions). We *will* see a progression from simply providing
> > the disruption signal, to providing the full clock information so that
> > guests don't have to abort transactions while they resync their clock.
> > But that's perfectly fine.
> > 
> > And it's also entirely agnostic to the mechanism by which the memory
> > region is *discovered*. It doesn't matter if it's ACPI, DT, a
> > hypervisor enlightenment, a BAR of a simple PCI device, virtio, or
> > anything else.
> > 
> > ACPI is one of the *simplest* options for a hypervisor and guest to
> > implement, and doesn't prevent us from using the same structure in
> > virtio-rtc. I'm happy enough using ACPI and letting virtio-rtc come
> > along later.
> > 
> > > virtio has been developed with the painful experience that we keep
> > > making mistakes, or coming up with new needed features,
> > > and that maintaining forward and backward compatibility
> > > becomes a whole lot harder than it seems in the beginning.
> > 
> > Yes. But as you note, this shared memory structure is a userspace ABI
> > all of its own, so we get to make a completely *different* kind of
> > mistake :)
> > 
> 
> 
> So, something I still don't completely understand.
> Can't the VDSO thing be written to by kernel?
> Let's say on LM, an interrupt triggers and kernel copies
> data from a specific device to the VDSO.
> 
> Is that problematic somehow? I imagine there is a race where
> userspace reads vdso after lm but before kernel updated
> vdso - is that the concern?
> 
> Then can't we fix it by interrupting all CPUs right after LM?
> 
> To me that seems like a cleaner approach - we then compartmentalize
> the ABI issue - kernel has its own ABI against userspace,
> devices have their own ABI against kernel.
> It'd mean we need a way to detect that interrupt was sent,
> maybe yet another counter inside that structure.
> 
> WDYT?
> 
> By the way the same idea would work for snapshots -
> some people wanted to expose that info to userspace, too.
> 

<<attachment: smime.p7s>>