On Thu, 2023-06-22 at 15:01 -0700, Sean Christopherson wrote: > On Mon, Jun 19, 2023, Kai Huang wrote: > > On Fri, 2023-06-16 at 15:05 -0700, Sean Christopherson wrote: > > > But I think this is unrelated to what you really care about, e.g. a userspace that > > > tightly controls its virtual memory could hardcode enclave placement (IIRC graphene > > > did/does do that). I.e. the alignment issue is a completely different discussion. > > > > > > [*] https://lore.kernel.org/all/20190522153836.GA24833@xxxxxxxxxxxxxxx > > > > Right. For the sake of making progress of this Haitao's series, we want to > > understand where did "SGX driver requires mmap(/dev/sgx_enclave) to use > > MAP_ANONYMOUS semantics" come from. > > > > But for this topic (how to reserve ELRANGE). I am not sure the reason that we > > prefer mmap(MAP_ANONYMOUS) still stands. For instance, the discussion in your > > above link was old implementation/assumption -- e.g., MAP_FIXED wasn't even > > used/supported for mmap()ing enclave chunks. > > > > So I am wondering now whether we suggest user to use mmap(MAP_ANONYMOUS) to get > > a size-aligned address still stands? The current SGX driver's > > get_unmapped_area: > > > > static unsigned long sgx_get_unmapped_area(struct file *file, > > unsigned long addr, > > unsigned long len, > > unsigned long pgoff, > > unsigned long flags) > > { > > if ((flags & MAP_TYPE) == MAP_PRIVATE) > > return -EINVAL; > > > > if (flags & MAP_FIXED) > > return addr; > > > > return current->mm->get_unmapped_area(file, addr, len, pgoff, flags); > > } > > > > > > As you can see if we don't pass MAP_FIXED, which is the case for mmap() to > > reserve ELRANGE, looks there's no difference between mmap(MAP_ANONYMOUS) and > > mmap(/dev/sgx_enclave)? > > In practice, there's *probably* no significant difference. Using an anonymous > mapping is advantageous because there's no additional overhead, e.g. for locking > the file, it can be done in advance of actually opening /dev/sgx_enclave, helps > document (in userspace) that the code isn't actually creating an enclave, just > finding a naturally aligned area in virtual memory (which isn't SGX specific), etc. Sure, and I agree this perhaps is cleaner and simpler for SGX. But for this purpose I think we should just enforce this policy in SGX driver, i.e., we should disable mmap(/dev/sgx_enclave) before ECREATE. To me at least the SGX driver should be clear whether it supports userspace to use mmap(/dev/sgx_enclave) to reserve ELRANGE. There should be no ambiguity here. The current driver only disallows mmap(/dev/sgx_enclave) outside of enclave range after EINIT: int sgx_encl_may_map(struct sgx_encl *encl, unsigned long start, unsigned long end, unsigned long vm_flags) { ... /* Disallow mapping outside enclave's address range. */ if (test_bit(SGX_ENCL_INITIALIZED, &encl->flags) && (start < encl->base || end > encl->base + encl->size)) return -EACCES; } which has what value btw? > > There are definitely differences, e.g. an LSM could restrict access to > /dev/sgx_enclave. That particular one is obviously a "don't care", > I think you meant "RW->RX" requires EXECMOD, etc? Yeah I am not sure whether this matters to reserving ELRANGE. That part should apply to the mmap(/dev/sgx_enclave) after ECREATE. > but it's quite > difficult to say that mmap(/dev/sgx_enclave) and mmap(NULL) are equivalent due to > the amount of code that's involved in handling mmap(). Sure. I was talking about only from "allocating size-aligned" part. Arguably, in terms of reserving ELRANGE, userspace only cares about: 1) The ELRANGE is big enough to hold the enclave 2) The ELRANGE meets alignment requirement 3) Usperspace can mmap(/dev/sgx_enclave) and/or mprotect() small areas to get the correct enclave addr within ELRANGE with desired permission > > > > > Also, if I am not missing something, the current driver doesn't prevent using > > > > mmap(/dev/sgx_enclave, PROT_NONE) to reserve ELANGE. So having clear > > > > documentation is helpful for SGX users to choose how to write their apps. > > > > > > > > Go back to the "SGX driver uses MAP_ANONYMOUS semantics for mmap()", I believe > > > > this just is "SGX driver requires mmap() after ECREATE/EINIT to use MAP_SHARED | > > > > MAP_FIXED and pgoff is ignored". Or more precisely, pgoff is "not _used_ by SGX > > > > driver". > > > > > > > > In fact, I think "pgoff is ignored/not used" is technically wrong for enclave. > > > > > > Yeah, it's wrong. It works because, until now apparently, there was never a reason > > > a need to care about the file offset since ELRANGE base always provided the necessary > > > information. It wasn't a deliberate design choice, we simply overlooked that detail > > > (unless Jarkko was holding back on me all those years ago :-) ). > > > > Yeah. But I am not sure whether there are corner cases that we can have > > potential bug around here, since those VMA's are not aligned between the core-mm > > and the driver. > > > > I haven't thought carefully though. I guess from core-mm's perspective there > > are multi-VMAs mapping to the same/overlapping part of the enclave, while the > > SGX driver treats them as mapping to different non-overlapping parts. Perhaps > > it's OK as long as SGX driver doesn't use vma->pgoff to do something??? > > Ya, exactly. > > > > > IMHO we should stop saying SGX driver uses MAP_ANONYMOUS semantics, because the > > > > truth is it just takes advantage of MAP_FIXED and carelessly ignores the pgoff > > > > due to the nature of SGX w/o considering from core-MM's perspective. > > > > > > > > And IMHO there are two ways to fix: > > > > > > > > 1) From now on, we ask SGX apps to use the correct pgoff in their > > > > mmap(/dev/sgx_enclave). This shouldn't impact the existing SGX apps because SGX > > > > driver doesn't use vma->pgoff anyway. > > > > > > Heh, just "asking" won't help. And breaking userspace, i.e. requiring all apps > > > to update, will just get you yelled at :-) > > > > We can document properly and assume the new apps will follow. As I mentioned > > above, the old apps, which doesn't/shouldn't have been using fadvice() anyway, > > doesn't need to be changed, i.e., they should continue to work. :) > > > > No? > > You can't build an ABI on assumptions. E.g. even if userspace *intends* to behave, > it wouldn't take much to compute the wrong offset (math is hard). But I don't think we have an well established ABI now? Nothing is documented. No? > > > > > 2) For the sake of avoiding having to ask existing SGX apps to change their > > > > mmap()s, we _officially_ say that userspace isn't required to pass a correct > > > > pgoff to mmap() (i.e. passing 0 as did in existing apps), but the kernel should > > > > fix the vma->pgoff internally. > > > > > > I recommend you don't do this. Overwriting vm_pgoff would probably work, but it's > > > going to make a flawed API even messier. E.g. you'll have painted SGX into a > > > corner if you ever want to decouple vma->start/end from encl->base. I highly > > > doubt that will ever happen given how ELRANGE works, but I don't think a hack-a-fix > > > buys you enough to justify any more ugliness. > > > > I also found it's not feasible to cleanly fix the pgoff from userspace (I > > thought the pgoff could be fixed at very early stage of do_mmap() so everything > > later just worked, but it's not the case). In do_mmap() the pgoff from > > userspace is already used to VMA merging/splitting etc before creating the > > target VMA. > > > > Hmm.. Now I think we shouldn't silently fix pgoff in SGX driver as it may > > surprise the core-mm later because the core-mm has already done some job around > > VMAs before vma->pgoff is changed. > > > > > > > > > I do prefer option 2) because it has no harm to anyone: 1) No changes to > > > > existing SGX apps; 2) It aligns with the core-MM to so all existing mmap() > > > > operations should work as expected, meaning no surprise; 3) And this patchset > > > > from Haitao to use fadvice() to accelerate EAUG flow just works. > > > > > > I think you can have your cake and eat it too. IIUC, the goal is to get fadvise() > > > working, and to do that without creating a truly heinous uAPI, you need an accurate > > > vm_pgoff. So, use a carrot and stick approach. > > > > > > If userspace properly defines vm_pgoff during mmap() and doesn't specify MAP_ANONYMOUS, > > > then they get to use fadvise() (give 'em a carrot). But if *any* mmap() on the > > > enclave doesn't followo those rules, mark the enclave as tainted or whatever and > > > disallow fadvise() (hit 'em with a stick). > > > > If we are supposed to use mmap(MAP_ANONYMOUS) to reserve ELRANGE, then I think > > userspace will just use MAP_FIXED for all mmap()s to /dev/sgx_enclave. > > There is no "supposed" to. Using mmap(NULL) is purely a suggestion to avoid > running into overheads and checks that are ultimately unnecessary. The only > requirement is that userspace provide a compatible chunk for virtual memory, > *how* userspace finds that chunk can be done in a number of ways. mmap(NULL) > just happens to be the most convenient one (IMO). Even this, I think we should document it IMO, at least whether mmap(/dev/sgx_enclave) can be used to reserve ELRANGE. > > > To detect whether a VMA has the correct pgoff, we need to somehow mark it in > > sgx_mmap(), but currently we don't have facility to do that. Even we can do, > > the core madvice() -> vfs_fadvice() cannot be aware of such VMA either, and > > in sgx_fadvice() we may have problem locating the correct VMA to do EAUG > > (because the vfs_fadvice() uses 0 pgoff to calculate the file offset). > > > > So, now I think we should just let userspace itself to pass the correct pgoff > > when it wants to use fadvice(), which is the option 1) I mentioned above. The > > kernel doesn't/shouldn't need to fix vma->pgoff. > > > > In another words, I think: > > > > - For the old apps, doesn't matter, continue to work; > > - For new apps which want to use fadvice() to accelerate EAUG, it should pass� > > the correct pgoff in mmap(/dev/sgx_enclave) even with MAP_FIXED. > > > > And we don't say "SGX driver uses MAP_ANONYMOUS semantics for > > mmap(/dev/sgx_enclave) any more". > > > > Does this make sense? > > Yes, but as above, IMO the kernel needs to enforce that userspace has properly > set the offset, otherwise userspace will end up with completely nonsensical > behavior if it fails to set the correct offset. The proposed sgx_fadvise() already > takes mmap_lock for read, i.e. is mutually exclusive with sgx_mmap(), so encforcing > the requirement should be quite straightforward, e.g. > > mmap_read_lock(current->mm); > > vma = find_vma(current->mm, start); The "start" here may already be wrong because before vfs_fadvice() uses VMA's pgoff to calculate the start before passing it to sgx_fadvice(). > if (!vma) > goto unlock; > if (vma->vm_private_data != encl) > goto unlock; > if (encl->has_mismatched_offsets) <====== > goto unlock; Sorry I am a little bit slow, how do you set "has_mismatched_offsests" to true? > > pos = start; > if (pos < vma->vm_start || end > vma->vm_end) { > /* Don't allow any gaps */ > goto unlock; > }