On Tue, Oct 30, 2018 at 2:02 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote: > > >> On Oct 30, 2018, at 1:43 PM, Igor Stoppa <igor.stoppa@xxxxxxxxx> wrote: >> >>> On 30/10/2018 21:20, Matthew Wilcox wrote: >>>> On Tue, Oct 30, 2018 at 12:28:41PM -0600, Tycho Andersen wrote: >>>>> On Tue, Oct 30, 2018 at 10:58:14AM -0700, Matthew Wilcox wrote: >>>>> On Tue, Oct 30, 2018 at 10:06:51AM -0700, Andy Lutomirski wrote: >>>>>>> On Oct 30, 2018, at 9:37 AM, Kees Cook <keescook@xxxxxxxxxxxx> wrote: >>>>>> I support the addition of a rare-write mechanism to the upstream kernel. >>>>>> And I think that there is only one sane way to implement it: using an >>>>>> mm_struct. That mm_struct, just like any sane mm_struct, should only >>>>>> differ from init_mm in that it has extra mappings in the *user* region. >>>>> >>>>> I'd like to understand this approach a little better. In a syscall path, >>>>> we run with the user task's mm. What you're proposing is that when we >>>>> want to modify rare data, we switch to rare_mm which contains a >>>>> writable mapping to all the kernel data which is rare-write. >>>>> >>>>> So the API might look something like this: >>>>> >>>>> void *p = rare_alloc(...); /* writable pointer */ >>>>> p->a = x; >>>>> q = rare_protect(p); /* read-only pointer */ >> >> With pools and memory allocated from vmap_areas, I was able to say >> >> protect(pool) >> >> and that would do a swipe on all the pages currently in use. >> In the SELinux policyDB, for example, one doesn't really want to individually protect each allocation. >> >> The loading phase happens usually at boot, when the system can be assumed to be sane (one might even preload a bare-bone set of rules from initramfs and then replace it later on, with the full blown set). >> >> There is no need to process each of these tens of thousands allocations and initialization as write-rare. >> >> Would it be possible to do the same here? > > I don’t see why not, although getting the API right will be a tad complicated. > >> >>>>> >>>>> To subsequently modify q, >>>>> >>>>> p = rare_modify(q); >>>>> q->a = y; >>>> >>>> Do you mean >>>> >>>> p->a = y; >>>> >>>> here? I assume the intent is that q isn't writable ever, but that's >>>> the one we have in the structure at rest. >>> Yes, that was my intent, thanks. >>> To handle the list case that Igor has pointed out, you might want to >>> do something like this: >>> list_for_each_entry(x, &xs, entry) { >>> struct foo *writable = rare_modify(entry); >> >> Would this mapping be impossible to spoof by other cores? >> > > Indeed. Only the core with the special mm loaded could see it. > > But I dislike allowing regular writes in the protected region. We really only need four write primitives: > > 1. Just write one value. Call at any time (except NMI). > > 2. Just copy some bytes. Same as (1) but any number of bytes. > > 3,4: Same as 1 and 2 but must be called inside a special rare write region. This is purely an optimization. > > Actually getting a modifiable pointer should be disallowed for two reasons: > > 1. Some architectures may want to use a special write-different-address-space operation. Heck, x86 could, too: make the actual offset be a secret and shove the offset into FSBASE or similar. Then %fs-prefixed writes would do the rare writes. > > 2. Alternatively, x86 could set the U bit. Then the actual writes would use the uaccess helpers, giving extra protection via SMAP. > > We don’t really want a situation where an unchecked pointer in the rare write region completely defeats the mechanism. We still have to deal with certain structures under the write-rare window. For example, see: https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/commit/?h=kspp/write-rarely&id=60430b4d3b113aae4adab66f8339074986276474 They are wrappers to non-inline functions that have the same sanity-checking. -- Kees Cook