On 22/02/2017, Jason Vas Dias <jason.vas.dias@xxxxxxxxx> wrote: > RE: >>> 4.10 has new code which utilizes the TSC_ADJUST MSR. > > I just built an unpatched linux v4.10 with tglx's TSC improvements - > much else improved in this kernel (like iwlwifi) - thanks! > > I have attached an updated version of the test program which > doesn't print the bogus "Nominal TSC Frequency" (the previous > version printed it, but equally ignored it). > > The clock_gettime(CLOCK_MONOTONIC_RAW,&ts) latency has improved by > a factor of 2 - it used to be @140ns and is now @ 70ns ! Wow! : > > $ uname -r > 4.10.0 > $ ./ttsc1 > max_extended_leaf: 80000008 > has tsc: 1 constant: 1 > Invariant TSC is enabled: Actual TSC freq: 2.893299GHz. > ts2 - ts1: 144 ts3 - ts2: 96 ns1: 0.000000588 ns2: 0.000002599 > ts3 - ts2: 178 ns1: 0.000000592 > ts3 - ts2: 14 ns1: 0.000000577 > ts3 - ts2: 14 ns1: 0.000000651 > ts3 - ts2: 17 ns1: 0.000000625 > ts3 - ts2: 17 ns1: 0.000000677 > ts3 - ts2: 17 ns1: 0.000000626 > ts3 - ts2: 17 ns1: 0.000000627 > ts3 - ts2: 17 ns1: 0.000000627 > ts3 - ts2: 18 ns1: 0.000000655 > ts3 - ts2: 17 ns1: 0.000000631 > t1 - t0: 89067 - ns2: 0.000091411 > Oops, going blind in my old age. These latencies are actually 3 times greater than under 4.8 !! Under 4.8, the program printed latencies of @ 140ns for clock_gettime, as shown in bug 194609 as the 'ns1' (timespec_b - timespec_a) value:: ts3 - ts2: 24 ns1: 0.000000162 ts3 - ts2: 17 ns1: 0.000000143 ts3 - ts2: 17 ns1: 0.000000146 ts3 - ts2: 17 ns1: 0.000000149 ts3 - ts2: 17 ns1: 0.000000141 ts3 - ts2: 16 ns1: 0.000000142 now the clock_gettime(CLOCK_MONOTONIC_RAW,&ts) latency is @ 600ns, @ 4 times more than under 4.8 . But I'm glad the TSC_ADJUST problems are fixed. Will programs reading : $ cat /sys/devices/msr/events/tsc event=0x00 read a new event for each setting of the TSC_ADJUST MSR or a wrmsr on the TSC ? > I think this is because under Linux 4.8, the CPU got a fault every > time it read the TSC_ADJUST MSR. maybe it still is! > But user programs wanting to use the TSC and correlate its value to > clock_gettime(CLOCK_MONOTONIC_RAW) values accurately like the above > program still have to dig the TSC frequency value out of the kernel > with objdump - this was really the point of the bug #194609. > > I would still like to investigate exporting 'tsc_khz' & 'mult' + > 'shift' values via sysfs. > > Regards, > Jason. > > > > > > On 21/02/2017, Jason Vas Dias <jason.vas.dias@xxxxxxxxx> wrote: >> Thank You for enlightening me - >> >> I was just having a hard time believing that Intel would ship a chip >> that features a monotonic, fixed frequency timestamp counter >> without specifying in either documentation or on-chip or in ACPI what >> precisely that hard-wired frequency is, but I now know that to >> be the case for the unfortunate i7-4910MQ - I mean, how can the CPU >> assert CPUID:80000007[8] ( InvariantTSC ) which it does, which is >> difficult to reconcile with the statement in the SDM : >> 17.16.4 Invariant Time-Keeping >> The invariant TSC is based on the invariant timekeeping hardware >> (called Always Running Timer or ART), that runs at the core crystal >> clock >> frequency. The ratio defined by CPUID leaf 15H expresses the >> frequency >> relationship between the ART hardware and TSC. If CPUID.15H:EBX[31:0] >> != >> 0 >> and CPUID.80000007H:EDX[InvariantTSC] = 1, the following linearity >> relationship holds between TSC and the ART hardware: >> TSC_Value = (ART_Value * CPUID.15H:EBX[31:0] ) >> / CPUID.15H:EAX[31:0] + K >> Where 'K' is an offset that can be adjusted by a privileged agent*2. >> When ART hardware is reset, both invariant TSC and K are also reset. >> >> So I'm just trying to figure out what CPUID.15H:EBX[31:0] and >> CPUID.15H:EAX[31:0] are for my hardware. I assumed (incorrectly) >> that >> the "Nominal TSC Frequency" formulae in the manul must apply to all >> CPUs with InvariantTSC . >> >> Do I understand correctly , that since I do have InvariantTSC , the >> TSC_Value is in fact calculated according to the above formula, but with >> a "hidden" ART Value, & Core Crystal Clock frequency & its ratio to >> TSC frequency ? >> It was obvious this nominal TSC Frequency had nothing to do with the >> actual TSC frequency used by Linux, which is 'tsc_khz' . >> I guess wishful thinking led me to believe CPUID:15h was actually >> supported somehow , because I thought InvariantTSC meant it had ART >> hardware . >> >> I do strongly suggest that Linux exports its calibrated TSC Khz >> somewhere to user >> space . >> >> I think the best long-term solution would be to allow programs to >> somehow read the TSC without invoking >> clock_gettime(CLOCK_MONOTONIC_RAW,&ts), & >> having to enter the kernel, which incurs an overhead of > 120ns on my >> system >> . >> >> >> Couldn't linux export its 'tsc_khz' and / or 'clocksource->mult' and >> 'clocksource->shift' values to /sysfs somehow ? >> >> For instance , only if the 'current_clocksource' is 'tsc', then these >> values could be exported as: >> /sys/devices/system/clocksource/clocksource0/shift >> /sys/devices/system/clocksource/clocksource0/mult >> /sys/devices/system/clocksource/clocksource0/freq >> >> So user-space programs could know that the value returned by >> clock_gettime(CLOCK_MONOTONIC_RAW) >> would be >> { .tv_sec = ( ( rdtsc() * mult ) >> shift ) >> 32, >> , .tv_nsec = ( ( rdtsc() * mult ) >> shift ) >> &~0U >> } >> and that represents ticks of period (1.0 / ( freq * 1000 )) S. >> >> That would save user-space programs from having to know 'tsc_khz' by >> parsing the 'Refined TSC' frequency from log files or by examining the >> running kernel with objdump to obtain this value & figure out 'mult' & >> 'shift' themselves. >> >> And why not a >> /sys/devices/system/clocksource/clocksource0/value >> file that actually prints this ( ( rdtsc() * mult ) >> shift ) >> expression as a long integer? >> And perhaps a >> /sys/devices/pnp0/XX\:YY/rtc/rtc0/nanoseconds >> file that actually prints out the number of real-time nano-seconds since >> the >> contents of the existing >> /sys/devices/pnp0/XX\:YY/rtc/rtc0/{time,since_epoch} >> files using the current TSC value? >> To read the rtc0/{date,time} files is already faster than entering the >> kernel to call >> clock_gettime(CLOCK_REALTIME, &ts) & convert to integer for scripts. >> >> I will work on developing a patch to this effect if no-one else is. >> >> Also, am I right in assuming that the maximum granularity of the >> real-time >> clock >> on my system is 1/64th of a second ? : >> $ cat /sys/devices/pnp0/00\:02/rtc/rtc0/max_user_freq >> 64 >> This is the maximum granularity that can be stored in CMOS , not >> returned by TSC? Couldn't we have something similar that gave an >> accurate idea of TSC frequency and the precise formula applied to TSC >> value to get clock_gettime >> (CLOCK_MONOTONIC_RAW) value ? >> >> Regards, >> Jason >> >> >> This code does produce good timestamps with a latency of @20ns >> that correlate well with clock_gettIme(CLOCK_MONOTONIC_RAW,&ts) >> values, but it depends on a global variable that is initialized to >> the 'tsc_khz' value >> computed by running kernel parsed from objdump /proc/kcore output : >> >> static inline __attribute__((always_inline)) >> U64_t >> IA64_tsc_now() >> { if(!( _ia64_invariant_tsc_enabled >> ||(( _cpu0id_fd == -1) && IA64_invariant_tsc_is_enabled(NULL,NULL)) >> ) >> ) >> { fprintf(stderr, __FILE__":%d:(%s): must be called with invariant >> TSC enabled.\n"); >> return 0; >> } >> U32_t tsc_hi, tsc_lo; >> register UL_t tsc; >> asm volatile >> ( "rdtscp\n\t" >> "mov %%edx, %0\n\t" >> "mov %%eax, %1\n\t" >> "mov %%ecx, %2\n\t" >> : "=m" (tsc_hi) , >> "=m" (tsc_lo) , >> "=m" (_ia64_tsc_user_cpu) : >> : "%eax","%ecx","%edx" >> ); >> tsc=(((UL_t)tsc_hi) << 32)|((UL_t)tsc_lo); >> return tsc; >> } >> >> __thread >> U64_t _ia64_first_tsc = 0xffffffffffffffffUL; >> >> static inline __attribute__((always_inline)) >> U64_t IA64_tsc_ticks_since_start() >> { if(_ia64_first_tsc == 0xffffffffffffffffUL) >> { _ia64_first_tsc = IA64_tsc_now(); >> return 0; >> } >> return (IA64_tsc_now() - _ia64_first_tsc) ; >> } >> >> static inline __attribute__((always_inline)) >> void >> ia64_tsc_calc_mult_shift >> ( register U32_t *mult, >> register U32_t *shift >> ) >> { /* paraphrases Linux clocksource.c's clocks_calc_mult_shift() function: >> * calculates second + nanosecond mult + shift in same way linux does. >> * we want to be compatible with what linux returns in struct >> timespec ts after call to >> * clock_gettime(CLOCK_MONOTONIC_RAW, &ts). >> */ >> const U32_t scale=1000U; >> register U32_t from= IA64_tsc_khz(); >> register U32_t to = NSEC_PER_SEC / scale; >> register U64_t sec = ( ~0UL / from ) / scale; >> sec = (sec > 600) ? 600 : ((sec > 0) ? sec : 1); >> register U64_t maxsec = sec * scale; >> UL_t tmp; >> U32_t sft, sftacc=32; >> /* >> * Calculate the shift factor which is limiting the conversion >> * range: >> */ >> tmp = (maxsec * from) >> 32; >> while (tmp) >> { tmp >>=1; >> sftacc--; >> } >> /* >> * Find the conversion shift/mult pair which has the best >> * accuracy and fits the maxsec conversion range: >> */ >> for (sft = 32; sft > 0; sft--) >> { tmp = ((UL_t) to) << sft; >> tmp += from / 2; >> tmp = tmp / from; >> if ((tmp >> sftacc) == 0) >> break; >> } >> *mult = tmp; >> *shift = sft; >> } >> >> __thread >> U32_t _ia64_tsc_mult = ~0U, _ia64_tsc_shift=~0U; >> >> static inline __attribute__((always_inline)) >> U64_t IA64_s_ns_since_start() >> { if( ( _ia64_tsc_mult == ~0U ) || ( _ia64_tsc_shift == ~0U ) ) >> ia64_tsc_calc_mult_shift( &_ia64_tsc_mult, &_ia64_tsc_shift); >> register U64_t cycles = IA64_tsc_ticks_since_start(); >> register U64_t ns = ((cycles >> *((UL_t)_ia64_tsc_mult))>>_ia64_tsc_shift); >> return( (((ns / NSEC_PER_SEC)&0xffffffffUL) << 32) | ((ns % >> NSEC_PER_SEC)&0x3fffffffUL) ); >> /* Yes, we are purposefully ignoring durations of more than 4.2 >> billion seconds here! */ >> } >> >> >> I think Linux should export the 'tsc_khz', 'mult' and 'shift' values >> somehow, >> then user-space libraries could have more confidence in using 'rdtsc' >> or 'rdtscp' >> if Linux's current_clocksource is 'tsc'. >> >> Regards, >> Jason >> >> >> >> On 20/02/2017, Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote: >>> On Sun, 19 Feb 2017, Jason Vas Dias wrote: >>> >>>> CPUID:15H is available in user-space, returning the integers : ( 7, >>>> 832, 832 ) in EAX:EBX:ECX , yet boot_cpu_data.cpuid_level is 13 , so >>>> in detect_art() in tsc.c, >>> >>> By some definition of available. You can feed CPUID random leaf numbers >>> and >>> it will return something, usually the value of the last valid CPUID >>> leaf, >>> which is 13 on your CPU. A similar CPU model has >>> >>> 0x0000000d 0x00: eax=0x00000007 ebx=0x00000340 ecx=0x00000340 >>> edx=0x00000000 >>> >>> i.e. 7, 832, 832, 0 >>> >>> Looks familiar, right? >>> >>> You can verify that with 'cpuid -1 -r' on your machine. >>> >>>> Linux does not think ART is enabled, and does not set the synthesized >>>> CPUID + >>>> ((3*32)+10) bit, so a program looking at /dev/cpu/0/cpuid would not >>>> see this bit set . >>> >>> Rightfully so. This is a Haswell Core model. >>> >>>> if an e1000 NIC card had been installed, PTP would not be available. >>> >>> PTP is independent of the ART kernel feature . ART just provides >>> enhanced >>> PTP features. You are confusing things here. >>> >>> The ART feature as the kernel sees it is a hardware extension which >>> feeds >>> the ART clock to peripherals for timestamping and time correlation >>> purposes. The ratio between ART and TSC is described by CPUID leaf 0x15 >>> so >>> the kernel can make use of that correlation, e.g. for enhanced PTP >>> accuracy. >>> >>> It's correct, that the NONSTOP_TSC feature depends on the availability >>> of >>> ART, but that has nothing to do with the feature bit, which solely >>> describes the ratio between TSC and the ART frequency which is exposed >>> to >>> peripherals. That frequency is not necessarily the real ART frequency. >>> >>>> Also, if the MSR TSC_ADJUST has not yet been written, as it seems to be >>>> nowhere else in Linux, the code will always think X86_FEATURE_ART is 0 >>>> because the CPU will always get a fault reading the MSR since it has >>>> never been written. >>> >>> Huch? If an access to the TSC ADJUST MSR faults, then something is >>> really >>> wrong. And writing it unconditionally to 0 is not going to happen. 4.10 >>> has >>> new code which utilizes the TSC_ADJUST MSR. >>> >>>> It would be nice for user-space programs that want to use the TSC with >>>> rdtsc / rdtscp instructions, such as the demo program attached to the >>>> bug report, >>>> could have confidence that Linux is actually generating the results of >>>> clock_gettime(CLOCK_MONOTONIC_RAW, ×pec) >>>> in a predictable way from the TSC by looking at the >>>> /dev/cpu/0/cpuid[bit(((3*32)+10)] value before enabling user-space >>>> use of TSC values, so that they can correlate TSC values with linux >>>> clock_gettime() values. >>> >>> What has ART to do with correct CLOCK_MONOTONIC_RAW values? >>> >>> Nothing at all, really. >>> >>> The kernel makes use of the proper information values already. >>> >>> The TSC frequency is determined from: >>> >>> 1) CPUID(0x16) if available >>> 2) MSRs if available >>> 3) By calibration against a known clock >>> >>> If the kernel uses TSC as clocksource then the CLOCK_MONOTONIC_* values >>> are >>> correct whether that machine has ART exposed to peripherals or not. >>> >>>> has tsc: 1 constant: 1 >>>> 832 / 7 = 118 : 832 - 9.888914286E+04hz : OK:1 >>> >>> And that voodoo math tells us what? That you found a way to correlate >>> CPUID(0xd) to the TSC frequency on that machine. >>> >>> Now I'm curious how you do that on this other machine which returns for >>> cpuid(15): 1, 1, 1 >>> >>> You can't because all of this is completely wrong. >>> >>> Thanks, >>> >>> tglx >>> >> > -- To unsubscribe from this list: send the line "unsubscribe kernel-janitors" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html