Hello Kirill,
On 1/24/2024 6:55 AM, Kirill A. Shutemov wrote:
TDX guests allocate shared buffers to perform I/O. It is done by
allocating pages normally from the buddy allocator and converting them
to shared with set_memory_decrypted().
The second kernel has no idea what memory is converted this way. It only
sees E820_TYPE_RAM.
Accessing shared memory via private mapping is fatal. It leads to
unrecoverable TD exit.
On kexec walk direct mapping and convert all shared memory back to
private. It makes all RAM private again and second kernel may use it
normally.
The conversion occurs in two steps: stopping new conversions and
unsharing all memory. In the case of normal kexec, the stopping of
conversions takes place while scheduling is still functioning. This
allows for waiting until any ongoing conversions are finished. The
second step is carried out when all CPUs except one are inactive and
interrupts are disabled. This prevents any conflicts with code that may
access shared memory.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@xxxxxxxxxxxxxxx>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@xxxxxxxxx>
---
arch/x86/coco/tdx/tdx.c | 124 +++++++++++++++++++++++++++++++++++++++-
1 file changed, 122 insertions(+), 2 deletions(-)
<snip>
+static void tdx_kexec_stop_conversion(bool crash)
+{
+ /* Stop new private<->shared conversions */
+ conversion_allowed = false;
+
+ /*
+ * Make sure conversion_allowed is cleared before checking
+ * conversions_in_progress.
+ */
+ barrier();
+
+ /*
+ * Crash kernel reaches here with interrupts disabled: can't wait for
+ * conversions to finish.
+ *
+ * If race happened, just report and proceed.
+ */
+ if (!crash) {
+ unsigned long timeout;
+
+ /*
+ * Wait for in-flight conversions to complete.
+ *
+ * Do not wait more than 30 seconds.
+ */
+ timeout = 30 * USEC_PER_SEC;
+ while (atomic_read(&conversions_in_progress) && timeout--)
+ udelay(1);
+ }
+
+ if (atomic_read(&conversions_in_progress))
+ pr_warn("Failed to finish shared<->private conversions\n");
+}
+
+static void tdx_kexec_unshare_mem(void)
+{
+ unsigned long addr, end;
+ long found = 0, shared;
+
+ /*
+ * Walk direct mapping and convert all shared memory back to private,
+ */
+
+ addr = PAGE_OFFSET;
+ end = PAGE_OFFSET + get_max_mapped();
+
+ while (addr < end) {
+ unsigned long size;
+ unsigned int level;
+ pte_t *pte;
+
+ pte = lookup_address(addr, &level);
+ size = page_level_size(level);
+
+ if (pte && pte_decrypted(*pte)) {
+ int pages = size / PAGE_SIZE;
+
+ /*
+ * Touching memory with shared bit set triggers implicit
+ * conversion to shared.
+ *
+ * Make sure nobody touches the shared range from
+ * now on.
+ */
+ set_pte(pte, __pte(0));
+
+ if (!tdx_enc_status_changed(addr, pages, true)) {
+ pr_err("Failed to unshare range %#lx-%#lx\n",
+ addr, addr + size);
+ }
+
+ found += pages;
+ }
+
+ addr += size;
+ }
+
+ __flush_tlb_all();
+
+ shared = atomic_long_read(&nr_shared);
+ if (shared != found) {
+ pr_err("shared page accounting is off\n");
+ pr_err("nr_shared = %ld, nr_found = %ld\n", shared, found);
+ }
+}
In case of SNP and crash/kdump case, we need to prevent the boot_ghcb
being converted to shared (in snp_kexec_unshare_mem()) as the boot_ghcb
is required to handle all I/O for disabling IO-APIC/lapic, hpet, etc.,
as the enc_kexec_unshare_mem() callback is invoked before the apics,
hpet, etc. are disabled.
Is there any reason why enc_kexec_unshare_mem() callback is invoked in
crash case before the IO-APIC/lapic, hpet, etc. are shutdown/disabled ?
In case of kexec, enc_kexec_unshare_mem() callback is invoked after the
IO-APIC/lapic, hpet, iommu, etc. have already been disabled/shutdown,
hence, this callback can transition all guest shared memory (including
boot_ghcb) back to private.
Thanks, Ashish
_______________________________________________
kexec mailing list
kexec@xxxxxxxxxxxxxxxxxxx
http://lists.infradead.org/mailman/listinfo/kexec