From: Sai Praneeth <sai.praneeth.prakhya@xxxxxxxxx> Problem statement: ------------------ Presently, efi_runtime_services() silently switch %cr3 from swapper_pgd to efi_pgd. As a consequence, kernel code that runs in efi_pgd (e.g., perf code via an NMI) will have incorrect user space mappings[1]. This could lead to otherwise unexpected access errors and, worse, unauthorized access to firmware code and data. Detailed discussion of problem statement: ----------------------------------------- As this switch is not propagated to other kernel subsystems; they will wrongly assume that swapper_pgd is still in use and it can lead to following issues: 1. If kernel code tries to access user space addresses while in efi_pgd, it could lead to unauthorized accesses to firmware code/data. (e.g: <__>/copy_from_user_nmi()). [This could also be disastrous if the frame pointer happens to point at MMIO in the EFI runtime mappings] - Mark Rutland. An example of a subsystem that could touch user space while in efi_pgd is perf. Assume that we are in efi_pgd, a user could use perf to profile some user data and depending on the address the user is trying to profile, two things could happen. 1. If the mappings are absent, perf fails to profile. 2. If efi_pgd does have mappings for the requested address (these mappings are erroneous), perf profiles firmware code/data. If the address is MMIO'ed, perf could have potentially changed some device state. The culprit in both the cases is, EFI subsystem swapping out pgd and not perf. Because, EFI subsystem has broken the *general assumption* that all other subsystems rely on - "user space might be valid and nobody has switched %cr3". Solutions: ---------- There are two ways to fix this issue: 1. Educate about pgd change to *all* the subsystems that could potentially access user space while in efi_pgd. On x86, AFAIK, it could happen only when some one touches user space from the back of an NMI (a quick audit on <__>/copy_from_user_nmi, showed perf and oprofile). On arm, it could happen from multiple places as arm runs efi_runtime_services() interrupts enabled (ARM folks, please comment on this as I might be wrong); whereas x86 runs efi_runtime_services() interrupts disabled. I think, this solution isn't holistic because a. Any other subsystem might well do the same, if not now, in future. b. Also, this solution looks simpler on x86 but not true if it's the same for arm (ARM folks, please comment on this as I might be wrong). c. This solution looks like a work around rather than addressing the issue. 2. Running efi_runtime_services() in kthread context. This makes sense because efi_pgd doesn't have user space and kthread by definition means that user space is not valid. Any kernel code that tries to touch user space while in kthread is buggy in itself. If so, it should be an easy fix in the other subsystem. This also take us one step closer to long awaiting proposal of Andy - Running EFI at CPL 3. What does this patch set do? ---------------------------- Introduce efi_rts_wq (EFI runtime services work queue). When a user process requests the kernel to execute any efi_runtime_service(), kernel queues the work to efi_rts_wq, a kthread comes along, switches to efi_pgd and executes efi_runtime_service() in kthread context. IOW, this patch set adds support to the EFI subsystem to handle all calls to efi_runtime_services() using a work queue (which in turn uses kthread). How running efi_runtime_services() in kthread fixes above discussed issues? --------------------------------------------------------------------------- If we run efi_runtime_services() in kthread context and if perf checks for it, we could get both the above scenarios correct by perf aborting the profiling. Not only perf, but any subsystem that tries to touch user space should first check for kthread context and if so, should abort. Q. If we still need check for kthread context in other subsystems that access user space, what does this patch set fix? A. This patch set makes sure that EFI subsystem is not at fault. Without this patch set the blame is upon EFI subsystem, because it's the one that changed pgd and hasn't communicated this change to everyone and hence broke the general assumption. Running efi_runtime_services() in kthread means explicitly communicating that user space is invalid, now it's the responsibility of other subsystem to make sure that it's running in right context. Testing: -------- Tested using LUV (Linux UEFI Validation) for x86_64, x86_32 and arm64 (qemu only). Will appreciate the effort if someone could test the patches on real ARM/ARM64 machines. LUV: https://01.org/linux-uefi-validation Credits: -------- Thanks to Ricardo, Dan, Miguel, Mark, Peter Z and Ard for reviews and suggestions. Thanks to Boris and Andy for making me think through/help on what I am addressing with this patch set. Please feel free to pour in your comments and concerns. Note: ----- Patches are based on Linus's kernel v4.17-rc7 [1] Backup: Detailing efi_pgd: ------------------------------ efi_pgd has mappings for EFI Runtime Code/Data (on x86, plus EFI Boot time Code/Data) regions. Due to the nature of these mappings, they fall in user space address ranges and they are not the same as swapper. [On arm64, the EFI mappings are in the VA range usually used for user space. The two halves of the address space are managed by separate tables, TTBR0 and TTBR1. We always map the kernel in TTBR1, and we map user space or EFI runtime mappings in TTBR0.] - Mark Rutland Changes from V4 to V5: ---------------------- 1. As suggested by Ard, don't use efi_rts_wq for non-blocking variants. Non-blocking variants are supposed to not block and using workqueue exactly does the opposite, hence refrain from using it. 2. Use non-blocking variants in efi_delete_dummy_variable(). Use of blocking variants means that we have to call efi_delete_dummy_variable() after efi_rts_wq has been created. 3. Remove in_atomic() check in set_variable<>() and query_variable_info<>(). Any caller wishing to use set_variable() and query_variable_info() in atomic context should use their non-blocking variants. Changes from V3 to V4: ---------------------- 1. As suggested by Peter, use completions instead of flush_work() as the former is cheaper 2. Call efi_delete_dummy_variable() from efisubsys_init(). Sorry! Ard, wasn't able to find a better alternative to keep this change local to arch/x86. Changes from V2 to V3: ---------------------- 1. Rewrite the cover letter to clearly state the problem. What we are fixing and what we are not fixing. 2. Make efi_delete_dummy_variable() change local to x86. 3. Avoid using BUG(), instead, print error message and exit gracefully. 4. Move struct efi_runtime_work to runtime-wrappers.c file. 5. Give enum a name (efi_rts_ids) and use it in efi_runtime_work. 6. Add Naresh (maintainer of LUV for ARM) and Miguel to the CC list. Changes from V1 to V2: ---------------------- 1. Remove unnecessary include of asm/efi.h file - Fixes build error on ia64, reported by 0-day 2. Use enum to identify efi_runtime_services() 3. Use alloc_ordered_workqueue() to create efi_rts_wq as create_workqueue() is scheduled for depreciation. 4. Make efi_call_rts() static, as it has no callers outside runtime-wrappers.c 5. Use BUG(), when we are unable to queue work or unable to identify requested efi_runtime_service() - Because these two situations should *never* happen. Sai Praneeth (3): x86/efi: Make efi_delete_dummy_variable() use set_variable_nonblocking() instead of set_variable() efi: Create efi_rts_wq and efi_queue_work() to invoke all efi_runtime_services() efi: Use efi_rts_wq to invoke EFI Runtime Services arch/x86/platform/efi/quirks.c | 11 +- drivers/firmware/efi/efi.c | 14 ++ drivers/firmware/efi/runtime-wrappers.c | 218 +++++++++++++++++++++++++++++--- include/linux/efi.h | 3 + 4 files changed, 224 insertions(+), 22 deletions(-) Signed-off-by: Sai Praneeth Prakhya <sai.praneeth.prakhya@xxxxxxxxx> Suggested-by: Andy Lutomirski <luto@xxxxxxxxxx> Cc: Lee Chun-Yi <jlee@xxxxxxxx> Cc: Borislav Petkov <bp@xxxxxxxxx> Cc: Tony Luck <tony.luck@xxxxxxxxx> Cc: Will Deacon <will.deacon@xxxxxxx> Cc: Dave Hansen <dave.hansen@xxxxxxxxx> Cc: Mark Rutland <mark.rutland@xxxxxxx> Cc: Bhupesh Sharma <bhsharma@xxxxxxxxxx> Cc: Naresh Bhat <naresh.bhat@xxxxxxxxxx> Cc: Ricardo Neri <ricardo.neri@xxxxxxxxx> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx> Cc: Ravi Shankar <ravi.v.shankar@xxxxxxxxx> Cc: Matt Fleming <matt@xxxxxxxxxxxxxxxxxxx> Cc: Dan Williams <dan.j.williams@xxxxxxxxx> Cc: Ard Biesheuvel <ard.biesheuvel@xxxxxxxxxx> Cc: Miguel Ojeda <miguel.ojeda.sandonis@xxxxxxxxx> -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-efi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html