Hi Andy,
On 12/3/2021 11:28 AM, Andy Lutomirski wrote:
On 12/1/21 11:23, Reinette Chatre wrote:
Enclave creators declare their paging permission intent at the time
the pages are added to the enclave. These paging permissions are
vetted when pages are added to the enclave and stashed off
(in sgx_encl_page->vm_max_prot_bits) for later comparison with
enclave PTEs.
I'm a bit confused here. ENCLU[EMODPE] allows the enclave to change the
EPCM permission bits however it likes with no oversight from the kernel.
So we end up with a whole bunch of permission masks:
Before jumping to the permission masks I would like to step back and
just confirm the context. We need to consider the following three
permissions:
EPCM permissions: the enclave page permissions maintained in the SGX
hardware. The OS is constrained here in that it cannot query the current
EPCM permissions. Even so, the OS needs to ensure PTEs are installed
appropriately (we do not want a RW PTE for a read-only enclave page) and
thus the OS keeps its own record of EPCM permissions to support this.
As you note later even in current kernel the enclave can change these
permissions without OS knowing. EPCM permissions can only be relaxed
without the OS knowledge though so the OS record of EPCM permissions can
only ever be stricter than the actual EPCM permissions.
VMA permissions: Current behavior (not changed in this series) is that
the OS enforces that a new VMA should have the same or weaker
permissions than the EPCM permissions.
Page table entries: These should match the EPCM permissions without
exceeding VMA permissions.
The PTE: controlled by complex kernel policy
The VMA: with your series, this is entirely controlled by userspace. I
think I'm fine with that.
vm_max_prot_bits: populated from secinfo at setup time, unless I missed
something that changes it later. Maybe I'm confused or missed something
in one of the patches,
Yes, vm_max_prot_bits is currently and continues to be populated from
secinfo for pages added before the enclave is initialized and in a later
patch it would be hardcoded to RW for pages that are added after the
enclave is initialized. In the current implementation vm_max_prot_bits
is the OS's record of the EPCM permissions used to guide VMA and PTE
permissions.
On a higher level, the implementation decision is that vm_max_prot_bits
is the static "vetted" permissions of a page - the maximum permissions a
page is allowed to have during its entire lifetime. This matches the
current implementation. In the current implementation permissions are
only able to change via VMA and PTE ... for example a read-only VMA can
access an enclave page with vm_max_prot_bits of RW. With the SGX2
support permission changes are allowed to EPCM permissions - but in this
implementation they are not allowed to exceed the originally vetted
vm_max_prot_bits.
In this SGX2 implementation an enclave page could thus be added to an
enclave with secinfo and vm_max_prot_bits of RW that would only allow
that page to have R or RW permissions (VMA, PTE, and OS view of EPCM
permissions) in its lifetime, never RX or RWX. Yes, it may be possible
for the enclave to change the EPCM permissions from within the enclave
using ENCLU[EMODPE] but to access the page the enclave would need the OS
to install the appropriate PTE and the OS would not do so if
vm_max_prot_bits does not allow it. Neither would the OS allow an
executable VMA.
vm_run_prot_bits: populated from some combination of ioctls. I'm
entirely lost as to what this is for.
With SGX2 it is possible to change the EPCM permissions of an enclave
page after the enclave is initialized. vm_max_prot_bits would provide
guidance to what permissions a page is allowed to have while
vm_run_prot_bits contains the current view of EPCM permissions used by
the OS to guide whether requested VMA permissions are allowed and guide
what PTE permissions should be.
Consider this example how vm_max_prot_bits and vm_run_prot_bits are used:
(1) Add enclave page with secinfo of RW to uninitialized enclave
vm_max_prot_bits = RW
vm_run_prot_bits = RW
(2) User space runs SGX_IOC_PAGE_MODP to change the permissions to read-
only. This is allowed because vm_max_prot_bits = RW. Now:
vm_max_prot_bits = RW
vm_run_prot_bits = R
Now VMAs are created and PTEs installed based on value of
vm_run_prot_bits - write access will not be allowed.
(3) User space runs SGX_IOC_PAGE_MODP to change the permissions to RX.
This will be denied because vm_max_prot_bits = RW.
(3) User space runs SGX_IOC_PAGE_MODP to change the permissions to RW.
This will be allowed because vm_max_prot_bits = RW.
Now VMAs are created and PTEs installed based on value of
vm_run_prot_bits - write access will again be allowed.
EPCM bits: controlled by the guest. basically useless for any host
purpose on SGX2 hardware (with or without kernel support -- the enclave
can do ENCLU[EMODPE] whether we like it or not, even on old kernels)
Indeed - permissions can only be relaxed without the OS knowledge so the
OS's view would always be the same or stricter than the enclave.
So I guess I don't understand the purpose of this patch or of the
rules in the later patches, and I feel like this is getting more
complicated than makes sense.
Could we perhaps make vm_max_prot_bits dynamic or at least controllable
in some useful way? My initial proposal (years ago) was for
vm_max_prot_bits to be *separately* configured at initial load time
instead of being inferred from secinfo with the intent being that the
user untrusted runtime would set it appropriately. I have no problem
with allowing runtime changes as long as the security policy makes sense
and it's kept consistent with PTEs.
This SGX2 enabling indeed builds on the current implementation where
vm_max_prot_bits is inferred from secinfo. The intent is for
vm_max_prot_bits to reflect the maximum allowed vetted permissions.
At this time vm_max_prot_bits is indeed static and thus creates the need
for (dynamic) vm_run_prot_bits that reflects the current EPCM
permissions and guides VMA and PTE permissions while vm_max_prot_bits
guides new permission requests. From what I understand this
implementation follows the current security policy - permissions are
never allowed to exceed the originally vetted permissions. PTEs are kept
consistent in that they match the (vetted, OS view of) EPCM permissions.
Also, I think we need a changelog message or, even better, actual docs
in kernel, explaining the actual final set of rules and invariants for
all these masks.
I will add a section to Documentation/x86/sgx.rst.
Reinette