Re: [PATCH hyperv-next v5 03/11] Drivers: hv: Enable VTL mode for arm64

Roman Kisel <romank@xxxxxxxxxxxxxxxxxxx> · Wed, 12 Mar 2025 14:30:54 -0700

On 3/12/2025 1:31 PM, Wei Liu wrote:
On Wed, Mar 12, 2025 at 11:33:11AM -0700, Roman Kisel wrote:

On 3/10/2025 3:18 PM, Michael Kelley wrote:
From: Arnd Bergmann <arnd@xxxxxxxx> Sent: Monday, March 10, 2025 2:21 PM

On Mon, Mar 10, 2025, at 22:01, Michael Kelley wrote:
From: Arnd Bergmann <arnd@xxxxxxxx> Sent: Saturday, March 8, 2025 1:05 PM
   config HYPERV_VTL_MODE
   	bool "Enable Linux to boot in VTL context"
-	depends on X86_64 && HYPERV
+	depends on (X86_64 || ARM64)
   	depends on SMP
+	select OF_EARLY_FLATTREE
+	select OF
   	default n
   	help

Having the dependency below the top-level Kconfig entry feels a little
counterintuitive. You could flip that back as it was before by doing

        select HYPERV_VTL_MODE if !ACPI
        depends on ACPI || SMP

in the HYPERV option, leaving the dependency on HYPERV in
HYPERV_VTL_MODE.

I would argue that we don't ever want to implicitly select
HYPERV_VTL_MODE because of some other config setting or
lack thereof.  VTL mode is enough of a special case that it should
only be explicitly selected. If someone omits ACPI, then HYPERV
should not be selectable unless HYPERV_VTL_MODE is explicitly
selected.

The last line of the comment for HYPERV_VTL_MODE says
"A kernel built with this option must run at VTL2, and will not run
as a normal guest."  In other words, don't choose this unless you
100% know that VTL2 is what you want.

It sounds like the latter is the real problem: enabling a feature
should never prevent something else from working. Can you describe
what VTL context is and why it requires an exception to a rather
fundamental rule here? If you build a kernel that runs on every
single piece of arm64 hardware and every hypervisor, why can't
you add HYPERV_VTL_MODE to that as an option?

In the VTL mode, we're running the kernel as secure firmware inside the
guest (one might see VTL2 working as Intel SMM or Secure World on ARM).

[...]

Ideally, a Linux kernel image could detect at runtime what VTL it is
running at, and "do the right thing". Unfortunately, on x86 Linux this
has proved difficult (or perhaps impossible) because the amount of
boot-time setup required to ask the question about the current VTL
is significant. The idiosyncrasies and historical baggage of x86 requires
that Linux do some x86-specific initialization steps for VTL > 0
before the question can be asked. Hence the introduction of
CONFIG_HYPERV_VTL_MODE, and the behavior that when it is
selected, the kernel image won't run normally in VTL 0.

I'll go out on a limb and say that I suspect on arm64 a runtime
determination based on querying the VTL *could* be made (though
I'm not the person writing the code). But taking advantage of that
on arm64 produces an undesirable dichotomy with x86.

On arm64 that is much easier, I agree. On x86 we'd need a kludge of

static void __naked __init __aligned(4096) early_hvcall_pg(void)
{
	/*
	 * Fill the early hvcall page with `0xF1` aka `INT1` to catch
	 * programming errors. The hypervisor will overlay the page with
	 * the vendor-specific code sequences to make hypercalls on x86(_64).
	 */
	asm (".skip 4096, 0xf1");
}

static u8 __init early_hvcall_pg_input[4096] __attribute__((aligned(4096)));
static u8 __init early_hvcall_pg_output[4096]
__attribute__((aligned(4096)));

static void __init early_connect_to_hv(void)
{
	union hv_x64_msr_hypercall_contents hypercall_msr;
	u64 guest_id;

	guest_id = hv_generate_guest_id(LINUX_VERSION_CODE);
	wrmsrl(HV_X64_MSR_GUEST_OS_ID, guest_id);
	rdmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);
	hypercall_msr.enable = 1;
	hypercall_msr.guest_physical_address =
__phys_to_pfn(virt_to_phys(early_hvcall_pg));
	wrmsrl(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);
}

or variations thereof.

OT here but what's stopping us from doing this on x86?

At the first glance, seems like nothing I think. For the conf scenarios
like TDX and SEV-SNP, due to the early hvcall I/O pages above allocated
in BSS, might need to mark the pages as decrypted and zero them out so
they look like proper BSS section (the page contents are scrambled after
flipping the page encryption bit iirc).

It seems to me there is some value in setting up the hypercall page as
early as possible. The same page can be used through the lifetime of the
partition. The early input and output pages should be reclaimed.

Wholeheartedly agree!

Also, since the hypervisor will insert an overlay page, it makes sense
to not allocate a page from Linux at all. When I ported Xen to run as
a guest on Hyper-V, I used that approach. The setup worked just fine.

All being said, things work today, so I'm in no hurry to change things.

I'll try fleshing this out soon-ish if no one beats me to that :)

Wei.

--
Thank you,
Roman