Sagi! On Thu, Mar 14 2024 at 14:19, Sagi Maimon wrote: > On Thu, Mar 14, 2024 at 1:12 PM Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote: >> On Thu, Mar 14 2024 at 11:05, Sagi Maimon wrote: >> > Some user space applications need to read a couple of different clocks. >> > Each read requires moving from user space to kernel space. >> > Reading each clock separately (syscall) introduces extra >> > unpredictable/unmeasurable delay. Minimizing this delay contributes to user >> > space actions on these clocks (e.g. synchronization etc). >> >> I asked for a proper description of the actual problem several times now >> and you still provide some handwaving blurb. Feel free to ignore me, but >> then please don't be surprised if I ignore you too. >> > Nobody is ignoring your notes, and I address any notes given by any > maintainer in the most serious way. > As far as explaining the actual problem this is the best that I can, > but let me try to explain better: > We did many tests with different CPU loading and compared sampling the > same clock twice, > once in user space and once by using the system call. > We have noticed an improvement up to hundreds of nanoseconds while > using the system call. > Those results improved our ability to sync different PHCs So let me express how I understand the problem - as far as I decoded it from your writeups: Synchronizing two PHCs requires to read timestamps from both and correlate them. Currently this requires several seperate system calls. This is subject to unmeasurable delays due to system call overhead, preemption and interrupts which makes the correlation imprecise. Therefore you want a system call, which samples two clocks at once, to make the correlation more precise. Right? For the further comments I assume this is what you are trying to say and to solve. So far so good, except that I do not agree with that reasoning at all: 1. The delays are measurable and as precise as the cross time stamp mechanism (hardware or software based) allows. 2. The system call overhead is completely irrelevant. 3. The time deltas between the sample points are irrelevant within a reasonable upper bound to the time delta between the two outer sample points. 4. The alledged higher precision is based on a guesstimate and not on correctness. Just because it behaves slightly better in testing does not make it any more correct. 5. The problem can be solved with maximal possible accuracy by using the existing PTP IOCTLs. See below. >> Also why does reading two random clocks make any sense at all? Your code >> allows to read CLOCK_TAI and CLOCK_THREAD_CPUTIME_ID. What for? >> > Initially we needed to sync some different PHCs for our user space > application, that is why we came with this idea. > The first idea was an IOCTL that returned samples of several PHCs for > the need of synchronization. > Richard Cochran suggested a system call instead, which will add the > ability to get various system clocks, while this > implementation is more complex then IOCTL, I think that he was right, > for future usage. Which future usage? We are not introducing swiss army knife interfaces just because there might be an illusional use case somewhere in the unspecified future. Adding a system call needs a proper design and justification. Handwaving future usage is not enough. Documentation/process/adding-syscalls.rst is very clear about what is required for a new system call. >> This needs to be split up into: >> >> 1) Infrastructure in posix-timers.c >> 2) Wire up the syscall in x86 >> 3) Infrastructure in posix-clock.c >> 4) Usage in ptp_clock.c >> >> and not as a big lump of everything. >> > I know, but I think the benefit worth it It's worth it because it makes review easier. It's well documented in the process documentation that patches should do one thing and not a whole lump of changes. >> > + if (!error) { >> > + if (clock_b == CLOCK_MONOTONIC_RAW) { >> > + ts_b = ktime_to_timespec64(xtstamp_a1.sys_monoraw); >> > + ts_a1 = ktime_to_timespec64(xtstamp_a1.device); >> > + goto out; >> > + } else if (clock_b == CLOCK_REALTIME) { >> > + ts_b = ktime_to_timespec64(xtstamp_a1.sys_realtime); >> > + ts_a1 = ktime_to_timespec64(xtstamp_a1.device); >> > + goto out; >> > + } else { >> > + crosstime_support_a = true; >> >> Right. If clock_b is anything else than CLOCK_MONOTONIC_RAW or >> CLOCK_REALTIME then this is true. >> >> > + } >> > + } >> >> So in case of an error, this just keeps going with an uninitialized >> xtstamp_a1 and if the clock_b part succeeds it continues and operates on >> garbage. >> > On error xtstamp_a1 will be taken again using clock_get_crosstimespec > so no one will be operating on garbage. It will not, because crosstime_support_a == false. It will therefore fall back to kc_a->clock_get_timespec(), no? Sorry, I misread the code vs. using the uninitialized value, but this is just unneccesary hard to follow. >> > + if (crosstime_support_a) >> > + error = kc_a->clock_get_crosstimespec(clock_a, &xtstamp_a2); >> > + else >> > + error = kc_a->clock_get_timespec(clock_a, &ts_a2); >> > + >> > + if (error) >> > + return error; >> >> The logic and the code flow here are unreadable garbage and there are >> zero comments what this is supposed to do. >> > I will add comments. > please no need to use negative words like "garbage" (not the first time), > please keep it professional and civilized. Let me rephrase: The code and the logic is incomprehensible unless I waste an unjustified amount of time to decode it. Sorry, I don't have that time. >> > + if (crosstime_support_a) { >> > + ktime_a = ktime_sub(xtstamp_a2.device, xtstamp_a1.device); >> > + ts_offs_err = ktime_divns(ktime_a, 2); >> > + ktime_a = ktime_add_ns(xtstamp_a1.device, (u64)ts_offs_err); >> > + ts_a1 = ktime_to_timespec64(ktime_a); >> >> This is just wrong. >> >> read(a1); >> read(b); >> read(a2); >> >> You _CANNOT_ assume that (a1 + ((a2 - a1) / 2) is anywhere close to the >> point in time where 'b' is read. This code is preemtible and >> interruptible. I explained this to you before. >> >> Your explanation in the comment above the function is just wishful >> thinking. >> > you explained it before, but still it is better then two consecutive > user space calls which are also preemptible > and the userspace to kernel context switch time is added. It might be marginally better, but it is still just _pretending_ that it does the right thing, is correct and better than the existing IOCTLs. If your user space implementation has the same algorithm, then I'm absolutely not surprised that the results are not useful. Why? You simply cannot use the midpoint of the outer samples if you want to have precise results if there is no guarantee that b was sampled exactly in the midpoint of a1 and a2. A hardware implementation might give that guarantee, but the kernel cannot. But why using the midpoint in the first place? There is absolutely no reason to do so because the sampling points a1, b and a2 can be precisely determined with the precision of the cross time stamp mechanism, which is best with a hardware based cross time stamp obviously. The whole point of ptp::info::getcrosststamp() is to get properly correlated clock samples of 1) PHC clock 2) CLOCK_MONOTONIC_RAW 3) CLOCK_REALTIME So if you take 3 samples: get_cross_timestamp(a1); get_cross_timestamp(b); get_cross_timestamp(a2); then each of them provides: - device time - correlated CLOCK_MONOTONIC_RAW - correlated CLOCK_REALTIME Ergo the obvious thing to do is: d1 = b.sys_monoraw - a1.sys_monoraw; d2 = a2.sys_monoraw - a1.sys_monoraw; tsa = a1.device + ((a2.device - a1.device) * d1) / d2; Which is maximaly precise under the assumption that in the time between the sample points a1 and a2 neither the system clock nor the PCH clocks are changing their frequency significantly. That is a valid assumption when you put a reasonable upper bound on d2. Even when the device does not implement getcrosststamp() then loop based sampling like it is implemented in the PTP_SYS_OFFSET[_EXTENDED] IOTCLs is providing reasonably accurate results to the extent possible. Your algorithm is imprecise by definition and you can apply as much testing as you want, it won't become magically correct. It's still a guesstimate, i.e. an estimate made without using adequate or complete information. Now why a new syscall? This can be done from user space with existing interfaces and the very same precision today: ioctl(fda, PTP_SYS_OFFSET*, &a1); ioctl(fdb, PTP_SYS_OFFSET*, &b); ioctl(fda, PTP_SYS_OFFSET*, &a2); u64 d1 = timespec_delta_ns(b.sys_monoraw, a1.sys_monoraw); u64 d2 = timespec_delta_ns(a2.sys_monoraw, a1.sys_monoraw); u64 td = (timespec_delta_ns(a2.device, a1.device) * d1) / d2 tsa = timespec_add_ns(a1.device, td); tsb = b.device; with the extra benefit of: 1) The correct CLOCK_REALTIME at that sample point, i.e. b.sys_realtime 2) The correct CLOCK_MONOTONIC_RAW at that sample point, i.e. b.sys_monoraw It works with PTP_SYS_OFFSET_PRECISE and PTP_SYS_OFFSET[_EXTENDED], with the obvious limitations of PTP_SYS_OFFSET[_EXTENDED], which are still vastly superior to your proposed (a2 - a1) / 2 guestimate which is just reading the PCH clocks with clock_get_timespec(). It is completely independent of the load, the syscall overhead and the actual time delta between the sample points when you apply a reasonable upper bound for d2, i.e. the time delta between the sample points a1 and a2 to eliminate the issue that system clock and/or the PCH clocks change their frequency significantly during that time. You'd need to do that in the kernel too. The actual frequency difference between PCH A and system clock is completely irrelevant when the frequencies of both are stable accross the sample period. You might still argue that the time delta between the sample points a1 and a2 matters and is slightly shorter in the kernel, but that is a non-argument because: 1) The kernel implementation does not guarantee atomicity of the consecutive samples either. The time delta is just statistically better, which is obviously useless when you want to make guarantees. 2) It does not matter when the time delta is slightly larger because you need a large frequency change of the involved clocks in the sample interval between the sample points a1 and a2 to make an actual difference in the resulting accuracy. A typical temperature drift of a high quality cyrstal is less than 1ppm per degree Celsius and even if you assume that the overall system drift is 10ppm per degree Celsius then still the actual error for a bound time delta between the sample points a1 and a2 is just somewhere in the irrelevant noise, unless you manage to blow torch or ice spray your crystals during the sample interval. If your clocks are not stable enough then nothing can cure it and you cannot do high precision timekeeping with them. So what is your new syscall solving that can't be done with the existing IOCTLs other than providing worse precision results based on guesstimates and some handwavy future use for random clock ids? Nothing as far as I can tell, but I might be missing something important here. Thanks, tglx --- arch/x86/kernel/tsc.c:119: "Math is hard, let's go shopping." - John Stultz