Re: [PATCH 4/3 v2] x86/mm/doc: Enhance the x86-64 virtual memory layout descriptions

Andy Lutomirski <luto@xxxxxxxxxx> · Sat, 6 Oct 2018 15:17:21 -0700

On Sat, Oct 6, 2018 at 10:03 AM Ingo Molnar <mingo@xxxxxxxxxx> wrote:
>
>
> There's one PTI related layout asymmetry I noticed between 4-level and 5-level kernels:
>
>   47-bit:
> > +                                                            |
> > +                                                            | Kernel-space virtual memory, shared between all processes:
> > +____________________________________________________________|___________________________________________________________
> > +                  |            |                  |         |
> > + ffff800000000000 | -128    TB | ffff87ffffffffff |    8 TB | ... guard hole, also reserved for hypervisor
> > + ffff880000000000 | -120    TB | ffffc7ffffffffff |   64 TB | direct mapping of all physical memory (page_offset_base)
> > + ffffc80000000000 |  -56    TB | ffffc8ffffffffff |    1 TB | ... unused hole
> > + ffffc90000000000 |  -55    TB | ffffe8ffffffffff |   32 TB | vmalloc/ioremap space (vmalloc_base)
> > + ffffe90000000000 |  -23    TB | ffffe9ffffffffff |    1 TB | ... unused hole
> > + ffffea0000000000 |  -22    TB | ffffeaffffffffff |    1 TB | virtual memory map (vmemmap_base)
> > + ffffeb0000000000 |  -21    TB | ffffebffffffffff |    1 TB | ... unused hole
> > + ffffec0000000000 |  -20    TB | fffffbffffffffff |   16 TB | KASAN shadow memory
> > + fffffc0000000000 |   -4    TB | fffffdffffffffff |    2 TB | ... unused hole
> > +                  |            |                  |         | vaddr_end for KASLR
> > + fffffe0000000000 |   -2    TB | fffffe7fffffffff |  0.5 TB | cpu_entry_area mapping
> > + fffffe8000000000 |   -1.5  TB | fffffeffffffffff |  0.5 TB | LDT remap for PTI
> > + ffffff0000000000 |   -1    TB | ffffff7fffffffff |  0.5 TB | %esp fixup stacks
> > +__________________|____________|__________________|_________|____________________________________________________________
> > +                                                            |
>
>   56-bit:
> > +                                                            |
> > +                                                            | Kernel-space virtual memory, shared between all processes:
> > +____________________________________________________________|___________________________________________________________
> > +                  |            |                  |         |
> > + ff00000000000000 |  -64    PB | ff0fffffffffffff |    4 PB | ... guard hole, also reserved for hypervisor
> > + ff10000000000000 |  -60    PB | ff8fffffffffffff |   32 PB | direct mapping of all physical memory (page_offset_base)
> > + ff90000000000000 |  -28    PB | ff9fffffffffffff |    4 PB | LDT remap for PTI
> > + ffa0000000000000 |  -24    PB | ffd1ffffffffffff | 12.5 PB | vmalloc/ioremap space (vmalloc_base)
> > + ffd2000000000000 |  -11.5  PB | ffd3ffffffffffff |  0.5 PB | ... unused hole
> > + ffd4000000000000 |  -11    PB | ffd5ffffffffffff |  0.5 PB | virtual memory map (vmemmap_base)
> > + ffd6000000000000 |  -10.5  PB | ffdeffffffffffff | 2.25 PB | ... unused hole
> > + ffdf000000000000 |   -8.25 PB | fffffdffffffffff |   ~8 PB | KASAN shadow memory
> > + fffffc0000000000 |   -4    TB | fffffdffffffffff |    2 TB | ... unused hole
> > +                  |            |                  |         | vaddr_end for KASLR
> > + fffffe0000000000 |   -2    TB | fffffe7fffffffff |  0.5 TB | cpu_entry_area mapping
> > + fffffe8000000000 |   -1.5  TB | fffffeffffffffff |  0.5 TB | ... unused hole
> > + ffffff0000000000 |   -1    TB | ffffff7fffffffff |  0.5 TB | %esp fixup stacks
>
> The two layouts are very similar beyond the shift in the offset and the region sizes, except
> one big asymmetry: is the placement of the LDT remap for PTI.
>
> Is there any fundamental reason why the LDT area is mapped into a 4 petabyte (!) area on 56-bit
> kernels, instead of being at the -1.5 TB offset like on 47-bit kernels?
>
> The only reason I can see is that this way is that it's currently coded at the PGD level only:
>
> static void map_ldt_struct_to_user(struct mm_struct *mm)
> {
>         pgd_t *pgd = pgd_offset(mm, LDT_BASE_ADDR);
>
>         if (static_cpu_has(X86_FEATURE_PTI) && !mm->context.ldt)
>                 set_pgd(kernel_to_user_pgdp(pgd), *pgd);
> }
>
> ( BTW., the 4 petabyte size of the area is misleading: a 5-level PGD entry covers 256 TB of
>   virtual memory, i.e 0.25 PB, not 4 PB. So in reality we have a 0.25 PB area there, used up
>   by the LDT mapping in a single PGD entry, plus a 3.75 PB hole after that. )
>
> ... but unless I'm missing something it's not really fundamental for it to be at the PGD level
> - it could be two levels lower as well, and it could move back to the same place where it's on
> the 47-bit kernel.
>

The subtlety is that, if it's lower than the PGD level, there end up
being some tables that are private to each LDT-using mm that map
things other than the LDT.  Those tables cover the same address range
as some corresponding tables in init_mm, and if those tables in
init_mm change after the LDT mapping is set up, the changes won't
propagate.

So it probably could be made to work, but it would take some extra care.