On Thursday, June 1st, 2023 at 11:20 AM, Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote: > Here is an example where it falls flat on its nose. > > One of the early Ryzen laptops had a broken BIOS which came up with > unsynchronized TSCs. I tried to fix that up, but couldn't get it to sync > on all CPUs because for some stupid reason the TSC write got > arbritrarily delayed (assumably by SMI/SMM). Hah, I remember that. That was actually my laptop. A Lenovo ThinkPad A485 with a Ryzen 2700U. I've seen the problem since then occasionally on newer Ryzen laptops (and even desktops). Without the awful "tsc=directsync" patch I wrote, which I've been carrying for years now in my own kernel builds, it just falls back to HPET. It's not pleasant, but at least it's a stable clock. > After the vendor fixed the BIOS, I tried again and the problem > persisted. > > So on such a machine the 'fixup time' mechanism would simply render an > otherwise perfectly fine TSC unusable for timekeeping. > > We asked both Intel and AMD to add TSC_ADJUST probably 15 years > ago. Intel added it with some HSW variants (IIRC) and since SKL all CPUs > have it. I don't know why AMD thought it's not required. That could have > spared a gazillion of bugzilla entries vs. the early Ryzen machines. > Agreed, TSC_ADJUST is the ultimate solution for any of these kinds of issues. But last I heard from AMD, it's still several years out in silicon, and there's plenty of hardware to maintain compatibility with. Ugh. A software solution would be preferable in the meantime, but I don't know what options are left at this point. The trap-and-emulate via SIGSEGV approach proposed earlier in the thread is unfortunately not likely to be practical, assuming I implemented it properly. One issue is how much overhead it has. This is an instruction that normally executes in roughly 50 clock cycles (RDTSC) to 100 clock cycles (RDTSCP) on Zen 3. Based on a proof-of-concept I wrote, the overhead of trapping and emulating with a signal handler is roughly 100x. On my Zen 3 system, it goes up to around 10000 clock cycles per trapped read of RDTSCP. Most Windows games that use this instruction directly are doing so under the assumption that the TSC is faster to read than any of the native Windows API clock sources. If it's suddenly ~100x slower than even the slowest-to-read Windows clocksource, those games would likely become entirely unplayable, depending on how frequently they do TSC reads. (And many do so quite often!) Also, my proof-of-concept doesn't actually do the emulation part. It just traps the instruction and then executes that same instruction in the signal handler, putting the results in the right registers. So it's a pass-through approach, which is about the best you can do performance wise. Another issue is that the implementation might be tricky. In the case of Wine, you'd need to enable PR_TSC_SIGSEGV whenever entering the Windows executable and PR_TSC_ENABLE whenever leaving it. If you don't, any of the normally well-behaved clock sources implemented using the TSC (e.g. CLOCK_MONOTONIC_RAW, etc) would also fault on the Wine side. Also, there's some Windows-specific trickery, in that the Windows registry exposes the TSC frequency in a couple of places, so those would need to be replaced with the frequency of the emulated clocksource. - Steven
Attachment:
signature.asc
Description: OpenPGP digital signature