On Mon Jun 5, 2023 at 4:48 PM AEST, Jordan Niethe wrote: > The new PAPR nested guest API introduces the concept of a Guest State > Buffer for communication about L2 guests between L1 and L0 hosts. > > In the new API, the L0 manages the L2 on behalf of the L1. This means > that if the L1 needs to change L2 state (e.g. GPRs, SPRs, partition > table...), it must request the L0 perform the modification. If the > nested host needs to read L2 state likewise this request must > go through the L0. > > The Guest State Buffer is a Type-Length-Value style data format defined > in the PAPR which assigns all relevant partition state a unique > identity. Unlike a typical TLV format the length is redundant as the > length of each identity is fixed but is included for checking > correctness. > > A guest state buffer consists of an element count followed by a stream > of elements, where elements are composed of an ID number, data length, > then the data: > > Header: > > <---4 bytes---> > +----------------+----- > | Element Count | Elements... > +----------------+----- > > Element: > > <----2 bytes---> <-2 bytes-> <-Length bytes-> > +----------------+-----------+----------------+ > | Guest State ID | Length | Data | > +----------------+-----------+----------------+ > > Guest State IDs have other attributes defined in the PAPR such as > whether they are per thread or per guest, or read-only. > > Introduce a library for using guest state buffers. This includes support > for actions such as creating buffers, adding elements to buffers, > reading the value of elements and parsing buffers. This will be used > later by the PAPR nested guest support. This is a tour de force in one of these things, so I hate to be the "me smash with club" guy, but what if you allocated buffers with enough room for all the state (or 99% of cases, in which case an overflow would make an hcall)? What's actually a fast-path that we don't get from the interrupt return buffer? Getting and setting a few regs for MMIO emulation? > Signed-off-by: Jordan Niethe <jpn@xxxxxxxxxxxxxxxxxx> > --- > v2: > - Add missing #ifdef CONFIG_VSXs > - Move files from lib/ to kvm/ > - Guard compilation on CONFIG_KVM_BOOK3S_HV_POSSIBLE > - Use kunit for guest state buffer tests > - Add configuration option for the tests > - Use macros for contiguous id ranges like GPRs > - Add some missing EXPORTs to functions > - HEIR element is a double word not a word > --- > arch/powerpc/Kconfig.debug | 12 + > arch/powerpc/include/asm/guest-state-buffer.h | 901 ++++++++++++++++++ > arch/powerpc/include/asm/kvm_book3s.h | 2 + > arch/powerpc/kvm/Makefile | 3 + > arch/powerpc/kvm/guest-state-buffer.c | 563 +++++++++++ > arch/powerpc/kvm/test-guest-state-buffer.c | 321 +++++++ > 6 files changed, 1802 insertions(+) > create mode 100644 arch/powerpc/include/asm/guest-state-buffer.h > create mode 100644 arch/powerpc/kvm/guest-state-buffer.c > create mode 100644 arch/powerpc/kvm/test-guest-state-buffer.c > > diff --git a/arch/powerpc/Kconfig.debug b/arch/powerpc/Kconfig.debug > index 6aaf8dc60610..ed830a714720 100644 > --- a/arch/powerpc/Kconfig.debug > +++ b/arch/powerpc/Kconfig.debug > @@ -82,6 +82,18 @@ config MSI_BITMAP_SELFTEST > bool "Run self-tests of the MSI bitmap code" > depends on DEBUG_KERNEL > > +config GUEST_STATE_BUFFER_TEST > + def_tristate n > + prompt "Enable Guest State Buffer unit tests" > + depends on KUNIT > + depends on KVM_BOOK3S_HV_POSSIBLE > + default KUNIT_ALL_TESTS > + help > + The Guest State Buffer is a data format specified in the PAPR. > + It is by hcalls to communicate the state of L2 guests between > + the L1 and L0 hypervisors. Enable unit tests for the library > + used to create and use guest state buffers. > + > config PPC_IRQ_SOFT_MASK_DEBUG > bool "Include extra checks for powerpc irq soft masking" > depends on PPC64 > diff --git a/arch/powerpc/include/asm/guest-state-buffer.h b/arch/powerpc/include/asm/guest-state-buffer.h > new file mode 100644 > index 000000000000..65a840abf1bb > --- /dev/null > +++ b/arch/powerpc/include/asm/guest-state-buffer.h > @@ -0,0 +1,901 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +/* > + * Interface based on include/net/netlink.h > + */ > +#ifndef _ASM_POWERPC_GUEST_STATE_BUFFER_H > +#define _ASM_POWERPC_GUEST_STATE_BUFFER_H > + > +#include <linux/gfp.h> > +#include <linux/bitmap.h> > +#include <asm/plpar_wrappers.h> > + > +/************************************************************************** > + * Guest State Buffer Constants > + **************************************************************************/ > +#define GSID_BLANK 0x0000 The namespaces are a little abbreviated. KVM_PAPR_ might be nice if you're calling the API that. > + > +#define GSID_HOST_STATE_SIZE 0x0001 /* Size of Hypervisor Internal Format VCPU state */ > +#define GSID_RUN_OUTPUT_MIN_SIZE 0x0002 /* Minimum size of the Run VCPU output buffer */ > +#define GSID_LOGICAL_PVR 0x0003 /* Logical PVR */ > +#define GSID_TB_OFFSET 0x0004 /* Timebase Offset */ > +#define GSID_PARTITION_TABLE 0x0005 /* Partition Scoped Page Table */ > +#define GSID_PROCESS_TABLE 0x0006 /* Process Table */ > + > +#define GSID_RUN_INPUT 0x0C00 /* Run VCPU Input Buffer */ > +#define GSID_RUN_OUTPUT 0x0C01 /* Run VCPU Out Buffer */ > +#define GSID_VPA 0x0C02 /* HRA to Guest VCPU VPA */ > + > +#define GSID_GPR(x) (0x1000 + (x)) > +#define GSID_HDEC_EXPIRY_TB 0x1020 > +#define GSID_NIA 0x1021 > +#define GSID_MSR 0x1022 > +#define GSID_LR 0x1023 > +#define GSID_XER 0x1024 > +#define GSID_CTR 0x1025 > +#define GSID_CFAR 0x1026 > +#define GSID_SRR0 0x1027 > +#define GSID_SRR1 0x1028 > +#define GSID_DAR 0x1029 It's a shame you have to rip up all your wrapper functions now to shoehorn these in. If you included names analogous to the reg field names in the kvm structures, the wrappers could do macro expansions that get them. #define __GSID_WRAPPER_dar GSID_DAR Or similar. And since of course you have to explicitly enumerate all these, I wouldn't mind defining the types and lengths up-front rather than down in the type function. You'd like to be able to go through the spec and eyeball type, number, size. [snip] > +/** > + * gsb_paddress() - the physical address of buffer > + * @gsb: guest state buffer > + * > + * Returns the physical address of the buffer. > + */ > +static inline u64 gsb_paddress(struct gs_buff *gsb) > +{ > + return __pa(gsb_header(gsb)); > +} > +/** > + * __gse_put_reg() - add a register type guest state element to a buffer > + * @gsb: guest state buffer to add element to > + * @iden: guest state ID > + * @val: host endian value > + * > + * Adds a register type guest state element. Uses the guest state ID for > + * determining the length of the guest element. If the guest state ID has > + * bits that can not be set they will be cleared. > + */ > +static inline int __gse_put_reg(struct gs_buff *gsb, u16 iden, u64 val) > +{ > + val &= gsid_mask(iden); > + if (gsid_size(iden) == sizeof(u64)) > + return gse_put_u64(gsb, iden, val); > + > + if (gsid_size(iden) == sizeof(u32)) { > + u32 tmp; > + > + tmp = (u32)val; > + if (tmp != val) > + return -EINVAL; > + > + return gse_put_u32(gsb, iden, tmp); > + } > + return -EINVAL; > +} There is a clever accessor that derives the length from the type, but then you fall back to this. > +/** > + * gse_put - add a guest state element to a buffer > + * @gsb: guest state buffer to add to > + * @iden: guest state identity > + * @v: generic value > + */ > +#define gse_put(gsb, iden, v) \ > + (_Generic((v), \ > + u64 : __gse_put_reg, \ > + long unsigned int : __gse_put_reg, \ > + u32 : __gse_put_reg, \ > + struct gs_buff_info : gse_put_buff_info, \ > + struct gs_proc_table : gse_put_proc_table, \ > + struct gs_part_table : gse_put_part_table, \ > + vector128 : gse_put_vector128)(gsb, iden, v)) > + > +/** > + * gse_get - return the data of a guest state element > + * @gsb: guest state element to add to > + * @v: generic value pointer to return in > + */ > +#define gse_get(gse, v) \ > + (*v = (_Generic((v), \ > + u64 * : __gse_get_reg, \ > + unsigned long * : __gse_get_reg, \ > + u32 * : __gse_get_reg, \ > + vector128 * : gse_get_vector128)(gse))) I don't see the benefit of this. Caller always knows the type doesn't it? It seems like the right function could be called directly. It makes the calling convention a bit clunky too. I know there's similar precedent for uaccess functions, but not sure I like it for this. > +struct gs_buff *gsb_new(size_t size, unsigned long guest_id, > + unsigned long vcpu_id, gfp_t flags) > +{ > + struct gs_buff *gsb; > + > + gsb = kzalloc(sizeof(*gsb), flags); > + if (!gsb) > + return NULL; > + > + size = roundup_pow_of_two(size); > + gsb->hdr = kzalloc(size, GFP_KERNEL); > + if (!gsb->hdr) > + goto free; > + > + gsb->capacity = size; > + gsb->len = sizeof(struct gs_header); > + gsb->vcpu_id = vcpu_id; > + gsb->guest_id = guest_id; > + > + gsb->hdr->nelems = cpu_to_be32(0); > + > + return gsb; > + > +free: > + kfree(gsb); > + return NULL; > +} > +EXPORT_SYMBOL(gsb_new); Should all be GPL exports. Needs more namespace too, I reckon (not just exports but any kernel-wide name this short and non-descriptive needs a kvmppc or kvm_papr or something). Thanks, Nick