[PATCH v2 0/7] KSTATE: a mechanism to migrate some part of the kernel state across kexec

Andrey Ryabinin <arbn@xxxxxxxxxxxxxxx> · Mon, 10 Mar 2025 13:03:11 +0100

 Main changes from v1 [1]:
  - Get rid of abusing crashkernel and implent proper way to pass memory to new kernel
  - Lots of misc cleanups/refactorings.

kstate (kernel state) is a mechanism to describe internal some part of the
kernel state, save it into the memory and restore the state after kexec
in the new kernel.

The end goal here and the main use case for this is to be able to
update host kernel under VMs with VFIO pass-through devices running
on that host. Since we are pretty far from that end goal yet, this
only establishes some basic infrastructure to describe and migrate complex
in-kernel states.

The idea behind KSTATE resembles QEMU's migration framework [1], which
solves quite similar problem - migrate state of VM/emulated devices
across different versions of QEMU.

This is an altenative to Kexec Hand Over (KHO [3]).

So, why not KHO?

 - The main reason is KHO doesn't provide simple and convenient internal
    API for the drivers/subsystems to preserve internal data.
    E.g. lets consider we have some variable of type 'struct a'
    that needs to be preserved:
	struct a {
	        int i;
        	unsigned long *p_ulong;
	        char s[10];
        	struct page *page;
	};

     The KHO-way requires driver/subsystem to have a bunch of code
     dealing with FDT stuff, something like

     a_kho_write()
     {
	     ...
	     fdt_property(fdt, "i", &a.i, sizeof(a.i));
	     fdt_property(fdt, "ulong", a.p_ulong, sizeof(*a.p_ulong));
	     fdt_property(fdt, "s", &a.s, sizeof(a.s));
	     if (err)
	     ...
     }
     a_kho_restore()
     {
             ...
     	     a.i = fdt_getprop(fdt, offset, "i", &len);
	     if (!a.i || len != sizeof(a.i))
	     	goto err
	     *a.p_ulong = fdt_getprop....
     }

    Each driver/subsystem has to solve this problem in their own way.
    Also if we use fdt properties for individual fields, that might be wastefull
    in terms of used memory, as these properties use strings as keys.

   While with KSTATE solves the same problem in more elegant way, with this:
	struct kstate_description a_state = {
        	.name = "a_struct",
	        .version_id = 1,
        	.id = KSTATE_TEST_ID,
	        .state_list = LIST_HEAD_INIT(test_state.state_list),
        	.fields = (const struct kstate_field[]) {
                	KSTATE_BASE_TYPE(i, struct a, int),
	                KSTATE_BASE_TYPE(s, struct a, char [10]),
        	        KSTATE_POINTER(p_ulong, struct a),
                	KSTATE_PAGE(page, struct a),
	                KSTATE_END_OF_LIST()
        	},
	};

	{
		static unsigned long ulong
		static struct a a_data = { .p_ulong = &ulong };

		kstate_register(&test_state, &a_data);
	}

       The driver needs only to have a proper 'kstate_description' and call kstate_register()
       to save/restore a_data.
       Basically 'struct kstate_description' provides instructions how to save/restore 'struct a'.
       And kstate_register() does all this save/restore stuff under the hood.

 - Another bonus point - kstate can preserve migratable memory, which is required
    to preserve guest memory

So now to the part how this works.

State of kernel data (usually it's some struct) is described by the
'struct kstate_description' containing the array of individual
fields descpriptions - 'struct kstate_field'. Each field
has set of bits in ->flags which instructs how to save/restore
a certain field of the struct. E.g.:
  - KS_BASE_TYPE flag tells that field can be just copied by value,

  - KS_POINTER means that the struct member is a pointer to the actual
     data, so it needs to be dereference before saving/restoring data
     to/from kstate data steam.

  - KS_STRUCT - contains another struct,  field->ksd must point to
      another 'struct kstate_dscription'

  - KS_CUSTOM - Some non-trivial field that requires custom kstate_field->save()
               ->restore() callbacks to save/restore data.

  - KS_ARRAY_OF_POINTER - array of pointers, the size of array determined by the
                         field->count() callback
  - KS_ADDRESS - field is a pointer to either vmemmap area (struct page) or
     linear address. Store offset

  - KS_END - special flag indicating the end of migration stream data.

kstate_register() call accepts kstate_description along with an instance
of an object and registers it in the global 'states' list.

During kexec reboot phase we go through the list of 'kstate_description's
and each instance of kstate_description forms the 'struct kstate_entry'
which save into the kstate's data stream.

The 'kstate_entry' contains information like ID of kstate_description, version
of it, size of migration data and the data itself. The ->data is formed in
accordance to the kstate_field's of the corresponding kstate_description.

After the reboot, when the kstate_register() called it parses migration
stream, finds the appropriate 'kstate_entry' and restores the contents of
the object in accordance with kstate_description and ->fields.

 [1] https://lkml.kernel.org/r/20241002160722.20025-1-arbn@xxxxxxxxxxxxxxx
 [2] https://www.qemu.org/docs/master/devel/migration/main.html#vmstate
 [3] https://lkml.kernel.org/r/20250206132754.2596694-1-rppt@xxxxxxxxxx

Andrey Ryabinin (7):
  kstate: Add kstate - a mechanism to describe and migrate kernel state
    across kexec
  kstate, kexec, x86: transfer kstate data across kexec
  kexec: exclude control pages from the destination addresses
  kexec, kstate: delay loading of kexec segments
  x86, kstate: Add the ability to preserve memory pages across kexec.
  kexec, kstate: save kstate data before kexec'ing
  kstate, test: add test module for testing kstate subsystem.

 arch/x86/Kconfig                  |   1 +
 arch/x86/kernel/kexec-bzimage64.c |   4 +
 arch/x86/kernel/setup.c           |   2 +
 include/linux/kexec.h             |   3 +
 include/linux/kstate.h            | 216 ++++++++++++++
 kernel/Kconfig.kexec              |  13 +
 kernel/Makefile                   |   1 +
 kernel/kexec_core.c               |  30 ++
 kernel/kexec_file.c               | 159 +++++++----
 kernel/kexec_internal.h           |   9 +
 kernel/kstate.c                   | 458 ++++++++++++++++++++++++++++++
 lib/Makefile                      |   2 +
 lib/test_kstate.c                 |  86 ++++++
 13 files changed, 925 insertions(+), 59 deletions(-)
 create mode 100644 include/linux/kstate.h
 create mode 100644 kernel/kstate.c
 create mode 100644 lib/test_kstate.c

-- 
2.45.3