Re: [PATCH v7] posix-timers: add clock_compare system call

Sagi Maimon <maimon.sagi@xxxxxxxxx> · Wed, 20 Mar 2024 16:42:31 +0200

On Thu, Mar 14, 2024 at 8:08 PM Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
>
> Sagi!
>
> On Thu, Mar 14 2024 at 14:19, Sagi Maimon wrote:
> > On Thu, Mar 14, 2024 at 1:12 PM Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
> >> On Thu, Mar 14 2024 at 11:05, Sagi Maimon wrote:
> >> > Some user space applications need to read a couple of different clocks.
> >> > Each read requires moving from user space to kernel space.
> >> > Reading each clock separately (syscall) introduces extra
> >> > unpredictable/unmeasurable delay. Minimizing this delay contributes to user
> >> > space actions on these clocks (e.g. synchronization etc).
> >>
> >> I asked for a proper description of the actual problem several times now
> >> and you still provide some handwaving blurb. Feel free to ignore me, but
> >> then please don't be surprised if I ignore you too.
> >>
> > Nobody is ignoring your notes, and I address any notes given by any
> > maintainer in the most serious way.
> > As far as explaining the actual problem this is the best that I can,
> > but let me try to explain better:
> > We did many tests with different CPU loading and compared sampling the
> > same clock twice,
> > once in user space and once by using the system call.
> > We have noticed an improvement up to hundreds of nanoseconds while
> > using the system call.
> > Those results improved our ability to sync different PHCs
>
> So let me express how I understand the problem - as far as I decoded it
> from your writeups:
>
>   Synchronizing two PHCs requires to read timestamps from both and
>   correlate them. Currently this requires several seperate system calls.
>   This is subject to unmeasurable delays due to system call overhead,
>   preemption and interrupts which makes the correlation imprecise.
>
>   Therefore you want a system call, which samples two clocks at once, to
>   make the correlation more precise.
>
> Right? For the further comments I assume this is what you are trying to
> say and to solve
You are right.
>
> So far so good, except that I do not agree with that reasoning at all:
>
>    1. The delays are measurable and as precise as the cross time stamp
>       mechanism (hardware or software based) allows.
>

Most of the PHCs do not support crosstime stamps.

>    2. The system call overhead is completely irrelevant.
>

You are right in case of long preemption, but in other cases it is relevant.

>    3. The time deltas between the sample points are irrelevant within a
>       reasonable upper bound to the time delta between the two outer
>       sample points.
>

See below

>    4. The alledged higher precision is based on a guesstimate and not on
>       correctness. Just because it behaves slightly better in testing
>       does not make it any more correct.
>

See below

>    5. The problem can be solved with maximal possible accuracy by using
>       the existing PTP IOCTLs.
>
> See below.
>
> >> Also why does reading two random clocks make any sense at all? Your code
> >> allows to read CLOCK_TAI and CLOCK_THREAD_CPUTIME_ID. What for?
> >>
> > Initially we needed to sync some different PHCs for our user space
> > application, that is why we came with this idea.
> > The first idea was an IOCTL that returned samples of several PHCs for
> > the need of synchronization.
> > Richard Cochran suggested a system call instead, which will add the
> > ability to get various system clocks, while this
> > implementation is more complex then IOCTL, I think that he was right,
> > for future usage.
>
> Which future usage? We are not introducing swiss army knife interfaces
> just because there might be an illusional use case somewhere in the
> unspecified future.
>
> Adding a system call needs a proper design and justification. Handwaving
> future usage is not enough.
>
> Documentation/process/adding-syscalls.rst is very clear about what is
> required for a new system call.
>
> >> This needs to be split up into:
> >>
> >>      1) Infrastructure in posix-timers.c
> >>      2) Wire up the syscall in x86
> >>      3) Infrastructure in posix-clock.c
> >>      4) Usage in ptp_clock.c
> >>
> >> and not as a big lump of everything.
> >>
> > I know, but I think the benefit worth it
>
> It's worth it because it makes review easier. It's well documented in
> the process documentation that patches should do one thing and not a
> whole lump of changes.
>

No problem it will be split into two different patches.

> >> > +             if (!error) {
> >> > +                     if (clock_b == CLOCK_MONOTONIC_RAW) {
> >> > +                             ts_b = ktime_to_timespec64(xtstamp_a1.sys_monoraw);
> >> > +                             ts_a1 = ktime_to_timespec64(xtstamp_a1.device);
> >> > +                             goto out;
> >> > +                     } else if (clock_b == CLOCK_REALTIME) {
> >> > +                             ts_b = ktime_to_timespec64(xtstamp_a1.sys_realtime);
> >> > +                             ts_a1 = ktime_to_timespec64(xtstamp_a1.device);
> >> > +                             goto out;
> >> > +                     } else {
> >> > +                             crosstime_support_a = true;
> >>
> >> Right. If clock_b is anything else than CLOCK_MONOTONIC_RAW or
> >> CLOCK_REALTIME then this is true.
> >>
> >> > +                     }
> >> > +             }
> >>
> >> So in case of an error, this just keeps going with an uninitialized
> >> xtstamp_a1 and if the clock_b part succeeds it continues and operates on
> >> garbage.
> >>
> > On error  xtstamp_a1 will be taken again using clock_get_crosstimespec
> > so no one will be operating on garbage.
>
> It will not, because crosstime_support_a == false. It will therefore
> fall back to kc_a->clock_get_timespec(), no?
>
> Sorry, I misread the code vs. using the uninitialized value, but this is
> just unneccesary hard to follow.
>
> >> > +     if (crosstime_support_a)
> >> > +             error = kc_a->clock_get_crosstimespec(clock_a, &xtstamp_a2);
> >> > +     else
> >> > +             error = kc_a->clock_get_timespec(clock_a, &ts_a2);
> >> > +
> >> > +     if (error)
> >> > +             return error;
> >>
> >> The logic and the code flow here are unreadable garbage and there are
> >> zero comments what this is supposed to do.
> >>
> > I will add comments.
> > please no need to use negative words like "garbage" (not the first time),
> > please keep it professional and civilized.
>
> Let me rephrase:
>
> The code and the logic is incomprehensible unless I waste an unjustified
> amount of time to decode it. Sorry, I don't have that time.
>
> >> > +     if (crosstime_support_a) {
> >> > +             ktime_a = ktime_sub(xtstamp_a2.device, xtstamp_a1.device);
> >> > +             ts_offs_err = ktime_divns(ktime_a, 2);
> >> > +             ktime_a = ktime_add_ns(xtstamp_a1.device, (u64)ts_offs_err);
> >> > +             ts_a1 = ktime_to_timespec64(ktime_a);
> >>
> >> This is just wrong.
> >>
> >>      read(a1);
> >>      read(b);
> >>      read(a2);
> >>
> >> You _CANNOT_ assume that (a1 + ((a2 - a1) / 2) is anywhere close to the
> >> point in time where 'b' is read. This code is preemtible and
> >> interruptible. I explained this to you before.
> >>
> >> Your explanation in the comment above the function is just wishful
> >> thinking.
> >>
> > you explained it before, but still it is better then two consecutive
> > user space calls which are also preemptible
> > and the userspace to kernel context switch time is added.
>
> It might be marginally better, but it is still just _pretending_ that it
> does the right thing, is correct and better than the existing IOCTLs.
>
> If your user space implementation has the same algorithm, then I'm
> absolutely not surprised that the results are not useful. Why?
>
> You simply cannot use the midpoint of the outer samples if you want to
> have precise results if there is no guarantee that b was sampled exactly
> in the midpoint of a1 and a2. A hardware implementation might give that
> guarantee, but the kernel cannot.
>
> But why using the midpoint in the first place?
>
> There is absolutely no reason to do so because the sampling points a1, b
> and a2 can be precisely determined with the precision of the cross time
> stamp mechanism, which is best with a hardware based cross time stamp
> obviously.
>
> The whole point of ptp::info::getcrosststamp() is to get properly
> correlated clock samples of
>
>       1) PHC clock
>       2) CLOCK_MONOTONIC_RAW
>       3) CLOCK_REALTIME
>
> So if you take 3 samples:
>
>    get_cross_timestamp(a1);
>    get_cross_timestamp(b);
>    get_cross_timestamp(a2);
>
> then each of them provides:
>
>     - device time
>     - correlated CLOCK_MONOTONIC_RAW
>     - correlated CLOCK_REALTIME
>
> Ergo the obvious thing to do is:
>
>     d1 = b.sys_monoraw - a1.sys_monoraw;
>     d2 = a2.sys_monoraw - a1.sys_monoraw;
>
>     tsa = a1.device + ((a2.device - a1.device) * d1) / d2;
>
> Which is maximaly precise under the assumption that in the time between
> the sample points a1 and a2 neither the system clock nor the PCH clocks
> are changing their frequency significantly. That is a valid assumption
> when you put a reasonable upper bound on d2.
>

You are right.
In fact, we are running this calculation on a user space application.
We use the new system call to get pairs of mono and PHC and then run
that calculation in user space.
That is why the system call returns pairs of clock samples and not the
diff between them.

> Even when the device does not implement getcrosststamp() then loop based
> sampling like it is implemented in the PTP_SYS_OFFSET[_EXTENDED] IOTCLs
> is providing reasonably accurate results to the extent possible.
>
> Your algorithm is imprecise by definition and you can apply as much
> testing as you want, it won't become magically correct. It's still a
> guesstimate, i.e. an estimate made without using adequate or complete
> information.
>
> Now why a new syscall?
>
> This can be done from user space with existing interfaces and the very
> same precision today:
>
>      ioctl(fda, PTP_SYS_OFFSET*, &a1);
>      ioctl(fdb, PTP_SYS_OFFSET*, &b);
>      ioctl(fda, PTP_SYS_OFFSET*, &a2);
>
>      u64 d1 = timespec_delta_ns(b.sys_monoraw, a1.sys_monoraw);
>      u64 d2 = timespec_delta_ns(a2.sys_monoraw, a1.sys_monoraw);
>      u64 td = (timespec_delta_ns(a2.device, a1.device) * d1) / d2
>
>      tsa = timespec_add_ns(a1.device, td);
>      tsb = b.device;
>
> with the extra benefit of:
>
>      1) The correct CLOCK_REALTIME at that sample point,
>         i.e. b.sys_realtime
>
>      2) The correct CLOCK_MONOTONIC_RAW at that sample point,
>         i.e. b.sys_monoraw
>
If PTP_SYS_OFFSET IOCTL returns sys_monoraw, then you are right, but
unfortunately the only IOCTL that returns sys_monoraw is
PTP_SYS_OFFSET_PRECISE (getcrosststamp)
And most of the drivers does not support it.

> It works with PTP_SYS_OFFSET_PRECISE and PTP_SYS_OFFSET[_EXTENDED], with
> the obvious limitations of PTP_SYS_OFFSET[_EXTENDED], which are still
> vastly superior to your proposed (a2 - a1) / 2 guestimate which is just
> reading the PCH clocks with clock_get_timespec().
>
It only works with PTP_SYS_OFFSET_PRECISE (which most of the NIC
drivers does not support and I take it under consideration in my
system call),
PTP_SYS_OFFSET_EXTENDED ioctl returns system time before, PHC, system
time after , and no monotic raw.

> It is completely independent of the load, the syscall overhead and the
> actual time delta between the sample points when you apply a reasonable
> upper bound for d2, i.e. the time delta between the sample points a1 and
> a2 to eliminate the issue that system clock and/or the PCH clocks change
> their frequency significantly during that time. You'd need to do that in
> the kernel too.
>
> The actual frequency difference between PCH A and system clock is
> completely irrelevant when the frequencies of both are stable accross
> the sample period.
>
> You might still argue that the time delta between the sample points a1
> and a2 matters and is slightly shorter in the kernel, but that is a
> non-argument because:
>
>   1) The kernel implementation does not guarantee atomicity of the
>      consecutive samples either. The time delta is just statistically
>      better, which is obviously useless when you want to make
>      guarantees.
>
>   2) It does not matter when the time delta is slightly larger because
>      you need a large frequency change of the involved clocks in the
>      sample interval between the sample points a1 and a2 to make an
>      actual difference in the resulting accuracy.
>
>      A typical temperature drift of a high quality cyrstal is less than
>      1ppm per degree Celsius and even if you assume that the overall
>      system drift is 10ppm per degree Celsius then still the actual
>      error for a bound time delta between the sample points a1 and a2 is
>      just somewhere in the irrelevant noise, unless you manage to blow
>      torch or ice spray your crystals during the sample interval.
>
>      If your clocks are not stable enough then nothing can cure it and
>      you cannot do high precision timekeeping with them.
>

You are right in case of long preemption (which still the system call
is better), but in other cases it is relevant.

> So what is your new syscall solving that can't be done with the existing
> IOCTLs other than providing worse precision results based on
> guesstimates and some handwavy future use for random clock ids?
>
> Nothing as far as I can tell, but I might be missing something important
> here.
>
Few points to consider:
1) Most of the PHCs do not support cross time stamping.
2) Users can implement your suggested code in user space while using
the new system call to get pairs of mono and PHC
     . This is what we did already in user space.
3) User with less tight requirement will benefit high accuracy with
the new system call

> Thanks,
>
>         tglx
> ---
> arch/x86/kernel/tsc.c:119: "Math is hard, let's go shopping." - John Stultz