Re: [PATCH v5 1/4] riscv: Move kernel mapping to vmalloc zone

Alex Ghiti <alex@xxxxxxxx> · Thu, 23 Jul 2020 01:36:45 -0400

Le 7/21/20 à 7:36 PM, Palmer Dabbelt a écrit :
On Tue, 21 Jul 2020 16:11:02 PDT (-0700), benh@xxxxxxxxxxxxxxxxxxx wrote:
On Tue, 2020-07-21 at 14:36 -0400, Alex Ghiti wrote:
> > I guess I don't understand why this is necessary at all.
> > Specifically: why
> > can't we just relocate the kernel within the linear map?  That would
> > let the
> > bootloader put the kernel wherever it wants, modulo the physical
> > memory size we
> > support.  We'd need to handle the regions that are coupled to the
> > kernel's
> > execution address, but we could just put them in an explicit memory
> > region
> > which is what we should probably be doing anyway.
>
> Virtual relocation in the linear mapping requires to move the kernel
> physically too. Zong implemented this physical move in its KASLR RFC
> patchset, which is cumbersome since finding an available physical spot
> is harder than just selecting a virtual range in the vmalloc range.
>
> In addition, having the kernel mapping in the linear mapping prevents
> the use of hugepage for the linear mapping resulting in performance 
loss
> (at least for the GB that encompasses the kernel).
>
> Why do you find this "ugly" ? The vmalloc region is just a bunch of
> available virtual addresses to whatever purpose we want, and as 
noted by
> Zong, arm64 uses the same scheme.

I don't get it :-)

At least on powerpc we move the kernel in the linear mapping and it
works fine with huge pages, what is your problem there ? You rely on
punching small-page size holes in there ?

That was my original suggestion, and I'm not actually sure it's 
invalid.  It
would mean that both the kernel's physical and virtual addresses are set 
by the
bootloader, which may or may not be workable if we want to have an 
sv48+sv39
kernel.  My initial approach to sv48+sv39 kernels would be to just throw 
away
the sv39 memory on sv48 kernels, which would preserve the linear map but 
mean
that there is no single physical address that's accessible for both.  That
would require some coordination between the bootloader and the kernel as to
where it should be loaded, but maybe there's a better way to design the 
linear
map.  Right now we have a bunch of unwritten rules about where things 
need to
be loaded, which is a recipe for disaster.

We could copy the kernel around, but I'm not sure I really like that 
idea.  We
do zero the BSS right now, so it's not like we entirely rely on the 
bootloader
to set up the kernel image, but with the hart race boot scheme we have 
right
now we'd at least need to leave a stub sitting around.  Maybe we just throw
away SBI v0.1, though, that's why we called it all legacy in the first 
place.

My bigger worry is that anything that involves running the kernel at 
arbitrary
virtual addresses means we need a PIC kernel, which means every global 
symbol
needs an indirection.  That's probably not so bad for shared libraries, 
but the
kernel has a lot of global symbols.  PLT references probably aren't so 
scary,
as we have an incoherent instruction cache so the virtual function 
predictor
isn't that hard to build, but making all global data accesses GOT-relative
seems like a disaster for performance.  This fixed-VA thing really just 
exists
so we don't have to be full-on PIC.

In theory I think we could just get away with pretending that medany is 
PIC,
which I believe works as long as the data and text offset stays 
constant, you
you don't have any symbols between 2GiB and -2GiB (as those may stay fixed,
even in medany), and you deal with GP accordingly (which should work 
itself out
in the current startup code).  We rely on this for some of the early 
boot code
(and will soon for kexec), but that's a very controlled code base and we've
already had some issues.  I'd be much more comfortable adding an explicit
semi-PIC code model, as I tend to miss something when doing these sorts of
things and then we could at least add it to the GCC test runs and 
guarantee it
actually works.  Not really sure I want to deal with that, though.  It 
would,
however, be the only way to get random virtual addresses during kernel
execution.

At least in the old days, there were a number of assumptions that
the kernel text/data/bss resides in the linear mapping.

Ya, it terrified me as well.  Alex says arm64 puts the kernel in the 
vmalloc
region, so assuming that's the case it must be possible.  I didn't get that
from reading the arm64 port (I guess it's no secret that pretty much all 
I do
is copy their code)

See https://elixir.bootlin.com/linux/latest/source/arch/arm64/mm/mmu.c#L615.

If you change that you need to ensure that it's still physically
contiguous and you'll have to tweak __va and __pa, which might induce
extra overhead.

I'm operating under the assumption that we don't want to add an 
additional load
to virt2phys conversions.  arm64 bends over backwards to avoid the load, 
and
I'm assuming they have a reason for doing so.  Of course, if we're PIC then
maybe performance just doesn't matter, but I'm not sure I want to just 
give up.
Distros will probably build the sv48+sv39 kernels as soon as they show 
up, even
if there's no sv48 hardware for a while.