Hello Thomas, On 3/30/20 12:50 AM, Thomas Gleixner wrote: > Micheal, > > "Michael Kerrisk (man-pages)" <mtk.manpages@xxxxxxxxx> writes: >> [Greetings, Thomas; now I recall a conversation we had in Lyon :-) ] > > Hehe. > >> I think this patch does not really capture the details >> properly. The immediately preceding paragraph says: >> >> If the associated clock is either CLOCK_REALTIME or >> CLOCK_REALTIME_ALARM, the timer is absolute >> (TFD_TIMER_ABSTIME), and the flag TFD_TIMER_CANCEL_ON_SET >> was specified when calling timerfd_settime(), then read(2) >> fails with the error ECANCELED if the real-time clock >> undergoes a discontinuous change. (This allows the reading >> application to discover such discontinuous changes to the >> clock.) >> >> Following on from that, I think we should have a pargraph that says >> something like: >> >> If the associated clock is either CLOCK_REALTIME or >> CLOCK_REALTIME_ALARM, the timer is absolute >> (TFD_TIMER_ABSTIME), and the flag TFD_TIMER_CANCEL_ON_SET >> was not specified when calling timerfd_settime(), then a >> discontinuous negative change to the clock >> (e.g., clock_settime(2)) may cause read(2) to unblock, but >> return a value of 0 (i.e., no bytes read), if the clock >> change occurs after the time expired, but before the >> read(2) on the timerfd file descriptor. > > Yes, that's correct. Accurate as always! Thanks. (It took me a while to nut it out, actually.) > This is pretty much in line with clock_nanosleep(CLOCK_REALTIME, > TIMER_ABSTIME) which has a similar problem vs. observability in user > space. > > clock_nanosleep(2) mutters: > > "POSIX.1 specifies that after changing the value of the CLOCK_REALTIME > clock via clock_settime(2), the new clock value shall be used to > determine the time at which a thread blocked on an absolute > clock_nanosleep() will wake up; if the new clock value falls past the > end of the sleep interval, then the clock_nanosleep() call will return > immediately." > > which can be interpreted as guarantee that clock_nanosleep() never > returns prematurely, <nod> > i.e. the assert() in the below code would indicate > a kernel failure: > > ret = clock_nanosleep(CLOCK_REALTIME, TIMER_ABSTIME, &expiry, NULL); > if (!ret) { > clock_gettime(CLOCK_REALTIME, &now); > assert(now >= expiry); > } > > But that assert can trigger when CLOCK_REALTIME was modified after the > timer fired and the kernel decided to wake up the task and let it return > to user space. > > clock_nanosleep(..., &expiry) > arm_timer(expires); > schedule(); > > -> timer interrupt > now = ktime_get_real(); > if (expires <= now) > -------------------------------- After this point > wakeup(); clock_settime(2) or > adjtimex(2) which > makes CLOCK_REALTIME > jump back far enough will > cause the above assert > to trigger. > > ... > return from syscall (retval == 0) > > There is no guarantee against clock_settime() coming after the > wakeup. Even if we put another check into the return to user path then > we won't catch a clock_settime() which comes right after that and before > user space invokes clock_gettime(). <nod> > POSIX spec Issue 7 (2018 edition) says: > > The suspension for the absolute clock_nanosleep() function (that is, > with the TIMER_ABSTIME flag set) shall be in effect at least until the > value of the corresponding clock reaches the absolute time specified by > rqtp. > > And that's what the kernel implements for clock_nanosleep() and timerfd > behaves exactly the same way. > > The wakeup of the waiter, i.e. task blocked in clock_nanosleep(2), > read(2), poll(2), is not happening _before_ the absolute time specified > is reached. > > If clock_settime() happens right before the expiry check, then it does > the right thing, but any modification to the clock after the wakeup > cannot be mitigated. At least not in a way which would make the assert() > in the example code above a reliable indicator for a kernel fail. > > That's the reason why I rejected the attempt to mitigate that particular > 0 tick issue in timerfd as it would just scratch a particular itch but > still not provide any guarantee. So having the '0' return documented is > the right way to go. Thanks for the incredibly detailed follow-up Thomas (which all landed in my commit message). I've applied a patch almost exactly the same as the text I showed above (and it's pushed to Git). All of 2020 is a bust, I expect. Perhaps we see us at a conference in 2021 ;-). Cheers, Michael -- Michael Kerrisk Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ Linux/UNIX System Programming Training: http://man7.org/training/