Arul, On Sat, 17 Aug 2019, Arul Jeniston wrote: > Do you agree the possibility of returning zero value from timerfd_read()? Obviosuly it happens. > > That has absolutely nothing to do with CLOCK_REALTIME. Your machines > > TSC is either going backwards or not synchronized between cores. > > > > Hint: Dell has a track record of BIOS doing the wrong things to TSC in > > order to hide their 'value add' features stealing CPU time. > > We haven't seen any issue in Dell hardware but we would definitely check > the possibility of hardware bug. BIOS is the more likely candidate. > Let us say, due to some reason the tsc goes backwards. Isn't it handled in > clocksource_delta(). No. clocksource_delta() does damage limitation. It prevents insane large time jumps which would result if the read out TSC value is less than the base value which is used to calculate the delta. It cannot do more than that. > Is timerfd_read expected to return 0 if tsc drifts backwards? If so, why it > is not documented? > Being a system call, we expect all return values of read() on timerfd to be > documented in the man page. We expect TSC not to go backwards. If it does it's broken and not usable as a clocksource for the kernel. The problem is that this is not necessarily easy to detect. Fact is, that your machines TSC goes backwards or is not properly synchronized between the cores. Otherwise the problem would not exist at all. That's the problem which needs to be fixed and not papered over with crude hacks and weird justifications. > > ktime_get() is CLOCK_MONOTONIC and not CLOCK_REALTIME. > > We see the same base used for CLOCK_MONOTONIC, CLOCK_REALTIME timers here. > both MONOTONIC, REALTIME timers hits the following code flow. we confirmed > it through instrumentation. > timerfd_read()-->hrtimer_forward_now()-->ktime_get()-->timekeeping_get_ns()-->timekeeping_get_delta()-->clocksource_delta(). > Do you want me to share any other logs to confirm it? No. That's the case when you use a relative timer with CLOCK_REALTIME because only absolute timers are affected by modifications of CLOCK_REALTIME. So it's NOT an issue of a CLOCK_REALTIME modification via settimeofday() or adjtimex(). > > It's a bug, but either a hardware or a BIOS bug and you are trying to > > paper over it at the place where you observe the symptom, which is > > obviously the wrong place because: > > > > 1) Any other time related function even in timerfd is affected as well > > > > 2) We do not cure symptoms, we cure the root cause. And clearly the root > > cause hase not been explained and addressed. > > if we don't fix this in kernel, can we document this return value in > timerfd read() man page? Again: You cannot fix a hardware problem by hacking around it at exactly one place where you can observe it. If that problem exists on your machine, then any other time related function is affected as well. Are you going to submit patches against _ALL_ time{r} related syscalls to fix^Wpaper over this? Either against the kernel or against the man pages? As this is a 4 core Rangely, it has a properly synchronized TSC on all 4 cores which runs with constant frequency and is not affected by deeper C-States. Here is the flow: timerfd_read() waitfortick() timer interrupt() time = ktime_get() expire timer time >= timer_time tick++; wakeup_reader() hrtimer_forward_now() now = ktime_get() In the failure case now < timer_time i.e. time went backwards since the timer was expired. That's absolutely unexpected behaviour and no, we are not papering over it. Did you ever quantify how much time goes backwards in that case? Is the timer expiry and the timerfd_read() on the same CPU or on different ones? Can you please provide a full dmesg from boot to after the point where this failure happens? Thanks, tglx