On 25/05/2023 00:20, Edgecombe, Rick P wrote:
On Fri, 2023-05-05 at 17:20 +0200, Mickaël Salaün wrote:
# How does it work?
This implementation mainly leverages KVM capabilities to control the
Second
Layer Address Translation (or the Two Dimensional Paging e.g.,
Intel's EPT or
AMD's RVI/NPT) and Mode Based Execution Control (Intel's MBEC)
introduced with
the Kaby Lake (7th generation) architecture. This allows to set
permissions on
memory pages in a complementary way to the guest kernel's managed
memory
permissions. Once these permissions are set, they are locked and
there is no
way back.
A first KVM_HC_LOCK_MEM_PAGE_RANGES hypercall enables the guest
kernel to lock
a set of its memory page ranges with either the HEKI_ATTR_MEM_NOWRITE
or the
HEKI_ATTR_MEM_EXEC attribute. The first one denies write access to a
specific
set of pages (allow-list approach), and the second only allows kernel
execution
for a set of pages (deny-list approach).
The current implementation sets the whole kernel's .rodata (i.e., any
const or
__ro_after_init variables, which includes critical security data such
as LSM
parameters) and .text sections as non-writable, and the .text section
is the
only one where kernel execution is allowed. This is possible thanks
to the new
MBEC support also brough by this series (otherwise the vDSO would
have to be
executable). Thanks to this hardware support (VT-x, EPT and MBEC),
the
performance impact of such guest protection is negligible.
The second KVM_HC_LOCK_CR_UPDATE hypercall enables guests to pin some
of its
CPU control register flags (e.g., X86_CR0_WP, X86_CR4_SMEP,
X86_CR4_SMAP),
which is another complementary hardening mechanism.
Heki can be enabled with the heki=1 boot command argument.
Can the guest kernel ask the host VMM's emulated devices to DMA into
the protected data? It should go through the host userspace mappings I
think, which don't care about EPT permissions. Or did I miss where you
are protecting that another way? There are a lot of easy ways to ask
the host to write to guest memory that don't involve the EPT. You
probably need to protect the host userspace mappings, and also the
places in KVM that kmap a GPA provided by the guest.
Good point, I'll check this confused deputy attack. Extended KVM
protections should indeed handle all ways to map guests' memory. I'm
wondering if current VMMs would gracefully handle such new restrictions
though.
[ snip ]
# Current limitations
The main limitation of this patch series is the statically enforced
permissions. This is not an issue for kernels without module but this
needs to
be addressed. Mechanisms that dynamically impact kernel executable
memory are
not handled for now (e.g., kernel modules, tracepoints, eBPF JIT),
and such
code will need to be authenticated. Because the hypervisor is highly
privileged and critical to the security of all the VMs, we don't want
to
implement a code authentication mechanism in the hypervisor itself
but delegate
this verification to something much less privileged. We are thinking
of two
ways to solve this: implement this verification in the VMM or spawn a
dedicated
special VM (similar to Windows's VBS). There are pros on cons to each
approach:
complexity, verification code ownership (guest's or VMM's), access to
guest
memory (i.e., confidential computing).
The kernel often creates writable aliases in order to write to
protected data (kernel text, etc). Some of this is done right as text
is being first written out (alternatives for example), and some happens
way later (jump labels, etc). So for verification, I wonder what stage
you would be verifying? If you want to verify the end state, you would
have to maintain knowledge in the verifier of all the touch-ups the
kernel does. I think it would get very tricky.
For now, in the static kernel case, all rodata and text GPA is
restricted, so aliasing such memory in a writable way before or after
the KVM enforcement would still restrict write access to this memory,
which could be an issue but not a security one. Do you have such
examples in mind?
It also seems it will be a decent ask for the guest kernel to keep
track of GPA permissions as well as normal virtual memory pemirssions,
if this thing is not widely used.
This would indeed be required to properly handle the dynamic cases.
So I wondering if you could go in two directions with this:
1. Make this a feature only for super locked down kernels (no modules,
etc). Forbid any configurations that might modify text. But eBPF is
used for seccomp, so you might be turning off some security protections
to get this.
Good idea. For "super locked down kernels" :) , we should disable all
kernel executable changes with the related kernel build configuration
(e.g. eBPF JIT, kernel module, kprobes…) to make sure there is no such
legitimate access. This looks like an acceptable initial feature.
2. Loosen the rules to allow the protections to not be so one-way
enable. Get less security, but used more widely.
This is our goal. I think both static and dynamic cases are legitimate
and have value according to the level of security sought. This should be
a build-time configuration.
There were similar dilemmas with the PV CR pinning stuff.