On 09/13/2018 11:22 AM, Thomas Gleixner wrote:
On Mon, 10 Sep 2018, Brijesh Singh wrote:
x86/kvmclock: Remove memblock dependency
introduced SEV guest regression.
The guest physical address holding the wall_clock and hv_clock_boot
are shared with the hypervisor must be mapped with C=0 when SEV
is active. To clear the C-bit we use kernel_physical_mapping_init() to
split the large pages. The above commit moved the kvmclock initialization
very early and kernel_physical_mapping_init() fails to allocate memory
while spliting the large page.
To solve it, we add a special .data..decrypted section, this section can be
used to hold the shared variables. Early boot code maps this section with
C=0. The section is pmd aligned and sized to avoid the need to split the pages.
Caller can use __decrypted attribute to add the variables in .data..decrypted
section.
I went through this patch series and to be honest this is in no way -rc3+
material.
I really have to ask why this whole change to sme_encrypt_kernel() is
required at all.
The main reason behind making the changes in sme_encrypt_kernel() was
to perverse the initialized values (if any). Since we are adding
__decrypted attribute hence I thought it would make sense to support
the usecase where in future someone may use this attribute on
initialized variables.
Now the question is, do we really need this to fix the regression? the
answer is NO. Since there is only one user of this new attribute and
lucky it does not initialize the variables hence it is safe to move the
variable in .bss..decrypted section and memset(0).
Lets look at the problem at hand:
We need exactly TWO pages to be shareable at early boot when the
kernel runs in a guest and wants to utilize kvmclock.
No, we do not need the whole NR_CPUS sized thing at that point at
all. The secondary CPUs are brought up way later and there is absolutely
no point in having the ugly decrypted..aux hack. That conditional freeing
is really horrible and prone to break sooner than later.
Further the TWO pages are BSS pages which does not require to handle them
special other than fixing up the PMD and doing a memset(0) on the
actually utilized TWO pages.
We really can enforce that such decrypted variables have to be in the BSS
for now and revisit this when there is an absolute requirement to have
statically initialized ones, which I doubt we have.
So the way simpler solution is:
1) Add a .bss..decrypted section which is PMD aligned and mark the
kvmclock hv_clock_boot and wallclock struct with __bss_decrypted.
> 2) Fixup the .bss..decrypted section PMD in the SEV case at the end of
sme_encrypt_kernel() and do a memset(0) on that. That does not require
any of the restructuring, really.
Somewhere during the discussion, I was asked to make sure that
__decrypted attribute can be used by others in future and don't tie it
with just the kvmclock.
To fix the regression we don't need to have this complexity. I am okay
to implement your proposal. I would like to fix this regression sooner
than later ;)
3) Have a function which is called after the page allocator is up which
does:
I am glad you are pointing this one. In my initial patch I was doing a
kmalloc() of page-size, during review we did discussed to do allocation
once using num_possible_cpus()[like what you have proposed]. But then
discussion moved towards saving memory and that when the secondary array
concept came into the picture. Since .data..decrypted has some unused
memory hence we were getting creative in reusing it.
static struct pvclock_vsyscall_time_info *hvclock_mem;
void __init kvmclock_init_mem(void)
{
unsigned int ncpus = num_possible_cpus() - HVC_BOOT_ARRAY_SIZE;
unsigned int order = get_order(ncpus * sizeof(*hvclock_mem));
struct page *p;
p = alloc_pages(order, GFP_KERNEL);
if (p) {
hvclock_mem = page_address(p);
set_memory_decrypted((unsigned long) hvclock_mem,
PAGE_SIZE << order);
}
}
4) In kvmclock_setup_percpu() do the following:
if (cpu < HVC_BOOT_ARRAY_SIZE)
p = &hv_clock_boot[cpu];
else if (hvclock_mem)
p = hvclock_mem + cpu - HVC_BOOT_ARRAY_SIZE;
else
return -ENOMEM;
5) Unconditionally free the pages after the used .bbs..decrypted variables,
so the unused rest of the 2M page is not lost.
That should be a halfways slim sized and non instrusive changeset.
I am okay to implement and test your recommendation, if anyone
disagree then please let me know. thanks