On Fri, Jan 26, 2018 at 11:46 AM, Andy Lutomirski <luto@xxxxxxxxxx> wrote: > On Fri, Jan 26, 2018 at 10:59 AM, Andy Lutomirski <luto@xxxxxxxxxx> wrote: >> On Fri, Jan 26, 2018 at 8:22 AM, Andy Lutomirski <luto@xxxxxxxxxx> wrote: >>> On Fri, Jan 26, 2018 at 7:36 AM, Dan Rue <dan.rue@xxxxxxxxxx> wrote: >>>> >>>> We've noticed that fsgsbase_64 can fail intermittently with the >>>> following error: >>>> >>>> [RUN] ARCH_SET_GS(0x0) and clear gs, then schedule to 0x1 >>>> Before schedule, set selector to 0x1 >>>> other thread: ARCH_SET_GS(0x1) -- sel is 0x0 >>>> [FAIL] GS/BASE changed from 0x1/0x0 to 0x0/0x0 >>>> >>>> This can be reliably reproduced by running fsgsbase_64 in a loop. i.e. >>>> >>>> for i in $(seq 1 10000); do ./fsgsbase_64 || break; done >>>> >>>> This problem isn't new - I've reproduced it on latest mainline and every >>>> release going back to v4.12 (I did not try earlier). This was tested on >>>> a Supermicro board with a Xeon E3-1220 as well as an Intel Nuc with an >>>> i3-5010U. >>>> >>> >>> Hmm, I can reproduce it, too. I'll look in a bit. >> >> I'm triggering a different error, and I think what's going on is that >> the kernel doesn't currently re-save GSBASE when a task switches out >> and that task has save gsbase != 0 and in-register GS == 0. This is >> arguably a bug, but it's not an infoleak, and fixing it could be a wee >> bit expensive. I'm not sure what, if anything, to do about this. I >> suppose I could add some gross perf hackery to the test to detect this >> case and suppress the error. >> >> I can also trigger the problem you're seeing, and I don't know what's >> up. It may be related to and old problem I've seen that causes signal >> delivery to sometimes corrupt %gs. It's deterministic, but it depends >> in some odd way on register state. I can currently reproduce that >> issue 100% of the time, and I'm trying to see if I can figure out >> what's happening. > > I think it's a CPU bug, and I'm a bit mystified. I can trigger the > following, plausibly related issue: > > Write a program that writes %gs = 1. > Run that program under gdb > break in which %gs == 1 > display/x $gs > si > > Under QEMU TCG, gs stays equal to 1. On native or KVM, on Skylake, it > changes to 0. > > On KVM or native, I do not observe do_debug getting called with %gs == > 1. On TCG, I do. I don't think that's precisely the problem that's > causing the test to fail, since the test doesn't use TF or ptrace, but > I wouldn't be shocked if it's related. > > hpa, any insight? > > (NB: if you want to play with this as I've described it, you may need > to make invalid_selector() in ptrace.c always return false. The > current implementation is too strict and causes problems.) Much simpler test. Run the attached program (gs1). It more or less just sets %gs to 1 and spins until it stops being 1. Do it on a kernel with the attached patch applied. I see stuff like this: # ./gs1 PID = 129 [ 15.703015] pid 129 saved gs = 1 [ 15.703517] pid 129 loaded gs = 1 [ 15.703973] pid 129 prepare_exit_to_usermode: gs = 1 ax = 0, cx = 0, dx = 0 So we're interrupting the program, switching out, switching back in, setting %gs to 1, observing that %gs is *still* 1 in prepare_exit_to_usermode(), returning to usermode, and observing %gs == 0. Presumably what's happening is that the IRET microcode matches the SDM's pseudocode, which says: RETURN-TO-OUTER-PRIVILEGE-LEVEL: ... FOR each SegReg in (ES, FS, GS, and DS) DO tempDesc ← descriptor cache for SegReg (* hidden part of segment register *) IF tempDesc(DPL) < CPL AND tempDesc(Type) is data or non-conforming code THEN (* Segment register invalid *) SegReg ← NULL; FI; OD; But this is very odd. The actual permission checks (in the docs for MOV) are: IF DS, ES, FS, or GS is loaded with non-NULL selector THEN IF segment selector index is outside descriptor table limits or segment is not a data or readable code segment or ((segment is a data or nonconforming code segment) or ((RPL > DPL) and (CPL > DPL)) THEN #GP(selector); FI; ^^^^ This makes no sense. This says that the data segments cannot be loaded with MOV. Empirically, it seems like MOV works if CPL <= DPL and RPL <= DPL, but I haven't checked that hard. IF segment not marked present THEN #NP(selector); ELSE SegmentRegister ← segment selector; SegmentRegister ← segment descriptor; FI; FI; IF DS, ES, FS, or GS is loaded with NULL selector THEN SegmentRegister ← segment selector; SegmentRegister ← segment descriptor; ^^^^ wtf? There is no "segment descriptor". Presumably what actually gets written to segment.DPL is nonsense. FI; Anyway, I think it's nonsense that user code can load a selector using MOV that is, in turn, rejected by IRET. I don't suppose Intel would consider fixing this going forward. Borislav, any chance you could run the attached program on an AMD machine to see what it does?
#include <stdio.h> #include <sys/types.h> #include <unistd.h> int main() { unsigned short ax, cx, dx; printf("PID = %d\n", (int)getpid()); asm volatile ("mov %[one], %%gs\n\t" "1:\n\t" "mov %%gs, %%eax\n\t" "mov %%gs, %%ecx\n\t" "mov %%gs, %%edx\n\t" "cmpw $1, %%ax\n\tjne 2f\n\t" "cmpw $1, %%cx\n\tjne 2f\n\t" "cmpw $1, %%dx\n\tjne 2f\n\t" "jmp 1b\n\t" "2:" : "=a" (ax), "=c" (cx), "=d" (dx) : [one] "rm" ((unsigned short)1)); printf("ax = %hx, cx = %hx, dx = %hx\n", ax, cx, dx); return 0; }