On Mon, Jan 15, 2024 at 12:53:42PM +0200, Nikolay Borisov wrote: > > > On 23.12.23 г. 1:52 ч., Kirill A. Shutemov wrote: > > TDX guests allocate shared buffers to perform I/O. It is done by > > allocating pages normally from the buddy allocator and converting them > > to shared with set_memory_decrypted(). > > > > The second kernel has no idea what memory is converted this way. It only > > sees E820_TYPE_RAM. > > > > Accessing shared memory via private mapping is fatal. It leads to > > unrecoverable TD exit. > > > > On kexec walk direct mapping and convert all shared memory back to > > private. It makes all RAM private again and second kernel may use it > > normally. > > > > The conversion occurs in two steps: stopping new conversions and > > unsharing all memory. In the case of normal kexec, the stopping of > > conversions takes place while scheduling is still functioning. This > > allows for waiting until any ongoing conversions are finished. The > > second step is carried out when all CPUs except one are inactive and > > interrupts are disabled. This prevents any conflicts with code that may > > access shared memory. > > > > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@xxxxxxxxxxxxxxx> > > Reviewed-by: Rick Edgecombe <rick.p.edgecombe@xxxxxxxxx> > > --- > > arch/x86/coco/tdx/tdx.c | 119 +++++++++++++++++++++++++++++++- > > arch/x86/include/asm/x86_init.h | 2 + > > arch/x86/kernel/crash.c | 6 ++ > > arch/x86/kernel/reboot.c | 13 ++++ > > 4 files changed, 138 insertions(+), 2 deletions(-) > > > > diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c > > index 8a49484a2917..5c64db168edd 100644 > > --- a/arch/x86/coco/tdx/tdx.c > > +++ b/arch/x86/coco/tdx/tdx.c > > @@ -6,8 +6,10 @@ > > #include <linux/cpufeature.h> > > #include <linux/debugfs.h> > > +#include <linux/delay.h> > > #include <linux/export.h> > > #include <linux/io.h> > > +#include <linux/kexec.h> > > #include <asm/coco.h> > > #include <asm/tdx.h> > > #include <asm/vmx.h> > > @@ -15,6 +17,7 @@ > > #include <asm/insn.h> > > #include <asm/insn-eval.h> > > #include <asm/pgtable.h> > > +#include <asm/set_memory.h> > > /* MMIO direction */ > > #define EPT_READ 0 > > @@ -41,6 +44,9 @@ > > static atomic_long_t nr_shared; > > +static atomic_t conversions_in_progress; > > +static bool conversion_allowed = true; > > Given the usage model of this variable, shouldn't it be simply accessed via > READ/WRITE_ONCE macros? What do you see it changing? > > + > > static inline bool pte_decrypted(pte_t pte) > > { > > return cc_mkdec(pte_val(pte)) == pte_val(pte); > > @@ -726,6 +732,14 @@ static bool tdx_tlb_flush_required(bool private) > > static bool tdx_cache_flush_required(void) > > { > > + /* > > + * Avoid issuing CLFLUSH on set_memory_decrypted() if conversions > > + * stopped. Otherwise it can race with unshare_all_memory() and trigger > > + * implicit conversion to shared. > > + */ > > + if (!conversion_allowed) > > + return false; > > + > > /* > > * AMD SME/SEV can avoid cache flushing if HW enforces cache coherence. > > * TDX doesn't have such capability. > > @@ -809,12 +823,25 @@ static bool tdx_enc_status_changed(unsigned long vaddr, int numpages, bool enc) > > static int tdx_enc_status_change_prepare(unsigned long vaddr, int numpages, > > bool enc) > > { > > + atomic_inc(&conversions_in_progress); > > + > > + /* > > + * Check after bumping conversions_in_progress to serialize > > + * against tdx_shutdown(). > > + */ > > + if (!conversion_allowed) { > > + atomic_dec(&conversions_in_progress); > > + return -EBUSY; > > + } > > nit: Can you make the inc of conversions_in_progress be done here, this > eliminated the atomic_dec in case they aren't. Somewhat simplifies the > logic. Okay, fair enough. Will change. > > + > > /* > > * Only handle shared->private conversion here. > > * See the comment in tdx_early_init(). > > */ > > - if (enc && !tdx_enc_status_changed(vaddr, numpages, enc)) > > + if (enc && !tdx_enc_status_changed(vaddr, numpages, enc)) { > > + atomic_dec(&conversions_in_progress); > > return -EIO; > > + } > > return 0; > > } > > @@ -826,17 +853,102 @@ static int tdx_enc_status_change_finish(unsigned long vaddr, int numpages, > > * Only handle private->shared conversion here. > > * See the comment in tdx_early_init(). > > */ > > - if (!enc && !tdx_enc_status_changed(vaddr, numpages, enc)) > > + if (!enc && !tdx_enc_status_changed(vaddr, numpages, enc)) { > > + atomic_dec(&conversions_in_progress); > > return -EIO; > > + } > > if (enc) > > atomic_long_sub(numpages, &nr_shared); > > else > > atomic_long_add(numpages, &nr_shared); > > + atomic_dec(&conversions_in_progress); > > + > > return 0; > > } > > +static void tdx_kexec_stop_conversion(bool crash) > > +{ > > + /* Stop new private<->shared conversions */ > > + conversion_allowed = false; > > What's the logic behind this compiler barrier? Disallow compiler to push the assignment past atomic_read() loop below. Not sure if anything else prevents such reorder without the barrier. And I don't think WRITE_ONCE() will do the trick. It only prevents multiple writes, but doesn't prevent reorders agains accesses non-READ_ONCE()/WRITE_ONCE() accesses. > > + barrier(); > > + > > + /* > > + * Crash kernel reaches here with interrupts disabled: can't wait for > > + * conversions to finish. > > + * > > + * If race happened, just report and proceed. > > + */ > > + if (!crash) { > > + unsigned long timeout; > > + > > + /* > > + * Wait for in-flight conversions to complete. > > + * > > + * Do not wait more than 30 seconds. > > + */ > > + timeout = 30 * USEC_PER_SEC; > > + while (atomic_read(&conversions_in_progress) && timeout--) > > + udelay(1); > > + } > > + > > + if (atomic_read(&conversions_in_progress)) > > + pr_warn("Failed to finish shared<->private conversions\n"); > > +} > > + > > <snip> > > > diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h > > index c9503fe2d13a..3196ff20a29e 100644 > > --- a/arch/x86/include/asm/x86_init.h > > +++ b/arch/x86/include/asm/x86_init.h > > @@ -154,6 +154,8 @@ struct x86_guest { > > int (*enc_status_change_finish)(unsigned long vaddr, int npages, bool enc); > > bool (*enc_tlb_flush_required)(bool enc); > > bool (*enc_cache_flush_required)(void); > > + void (*enc_kexec_stop_conversion)(bool crash); > > + void (*enc_kexec_unshare_mem)(void); > > These are only being initialized in the TDX case, but called in all cases > when CC_ATTR_GUEST_MEM_ENCRYPT is true, which includes AMD. So it would > cause a crash, no ? Shouldn't you also introduce noop handlers initialized > in the default x86_platform struct in arch/x86/kernel/x86_init.c ? kexec on AMD will not work without them, I think. But noops makes sense anyway. Will fix. -- Kiryl Shutsemau / Kirill A. Shutemov _______________________________________________ kexec mailing list kexec@xxxxxxxxxxxxxxxxxxx http://lists.infradead.org/mailman/listinfo/kexec