On Mon, Oct 3, 2022, at 3:21 PM, Ali Raza wrote: > If a UKL application makes a system call, it won't go through with the > syscall assembly instruction. Instead, the application will use the call > instruction to go to the kernel entry point. Instead of adding checks to > the normal entry_SYSCALL_64 to see if we came here from a UKL task or a > normal application task, we create a totally new entry point called > ukl_entry_SYSCALL_64. This allows the normal entry point to be unchanged > and simplifies the UKL specific code as well. > > ukl_entry_SYSCALL_64 is similar to entry_SYSCALL_64 except that it has to > populate %rcx with return address manually (syscall instruction does that > automatically for normal application tasks). This allows the pt_regs to be > correct. Also, we have to push the flags onto the user stack, because on > the return path, we first switch to user stack, then pop the flags and then > return. Popping the flags would restart interrupts, so we dont want to be > stuck on kernel stack when an interrupt hits. All this can be done with an > iret instruction, but call/iret pair performans way slower than a call/ret > pair. > > Also, on the entry path, we make sure the context flag i.e., in_user is set > to 1 to indicate we are now in kernel context so any new interrupts dont > have to go through kernel entry code again. This is normally done with the > CS value on stack, but in UKL case that will always be a kernel value. On > the way back, the in_user is switched back to 2 to indicate that now > application context is being entered. All non-UKL tasks have the in_user > value set to 0. > > The UKL application uses a slightly different value for CS, instead of > 0x33, we use 0xC3. As most of the tests compare only the least significant > nibble, they behave as expected. The C value in the second nibble allows us > to distinguish between user space and UKL application code. My intuition would be to try this the other way around. Use an actual honest CS (specifically _KERNEL_CS) for pt_regs->cs. Translate at the user ABI boundary instead. After all, a UKL task is essentially just a kernel thread that happens to have a pt_regs area. > > Rest of the code makes sure the above mentioned in_user context tracking is > done for all entry and exit cases i.e., for interrupts, exceptions etc. If > its a UKL task, if in_user value is 2, we treat it as an application task, > and if it is 1, we treat it as coming from kernel context. We skip these > checks if in_user is 0. By "context tracking" are you referring to RCU? Since a UKL task is essentially a kernel thread, what "entry" is there other than setting up pt_regs? > > swapgs_restore_regs_and_return_to_usermode changes also make sure that > in_user is correct and then we iret back. > > Double fault handling is special case. Normally, if a user stack suffers a > page fault, hardware switches to a kernel stack and pushes a frame onto the > kernel stack. This switch only happens if the execution was in user > privilege level when the page fault occurred. For UKL, execution is always > in kernel level, so when the user stack suffers a page fault, no switch to > a pinned kernel stack happens, and hardware tries to push state on the > already faulting user stack. This generates a double fault. So we handle > this case in the double fault handler by assuming any double fault is > actually a user stack page fault. This can also be fixed by making all page > faults go through a pinned stack using the IST mechanism. We have tried and > tested that, but in the interest of touching as little code as possible, we > chose this option instead. Eww. I guess this is a real problem, but eww.