Re: [PATCH v4 05/14] kexec: Add Kexec HandOver (KHO) generation helpers

Jason Gunthorpe <jgg@xxxxxxxxxx> · Mon, 10 Feb 2025 16:22:20 -0400

On Thu, Feb 06, 2025 at 03:27:45PM +0200, Mike Rapoport wrote:
> diff --git a/Documentation/ABI/testing/sysfs-kernel-kho b/Documentation/ABI/testing/sysfs-kernel-kho
> new file mode 100644
> index 000000000000..f13b252bc303
> --- /dev/null
> +++ b/Documentation/ABI/testing/sysfs-kernel-kho
> @@ -0,0 +1,53 @@
> +What:		/sys/kernel/kho/active
> +Date:		December 2023
> +Contact:	Alexander Graf <graf@xxxxxxxxxx>
> +Description:
> +		Kexec HandOver (KHO) allows Linux to transition the state of
> +		compatible drivers into the next kexec'ed kernel. To do so,
> +		device drivers will serialize their current state into a DT.
> +		While the state is serialized, they are unable to perform
> +		any modifications to state that was serialized, such as
> +		handed over memory allocations.
> +
> +		When this file contains "1", the system is in the transition
> +		state. When contains "0", it is not. To switch between the
> +		two states, echo the respective number into this file.

I don't think this is a great interface for the actual state machine..

> +What:		/sys/kernel/kho/dt_max
> +Date:		December 2023
> +Contact:	Alexander Graf <graf@xxxxxxxxxx>
> +Description:
> +		KHO needs to allocate a buffer for the DT that gets
> +		generated before it knows the final size. By default, it
> +		will allocate 10 MiB for it. You can write to this file
> +		to modify the size of that allocation.

Seems gross, why can't it use a non-contiguous page list to generate
the FDT? :\

See below for a suggestion..

> +static int kho_serialize(void)
> +{
> +	void *fdt = NULL;
> +	int err = -ENOMEM;
> +
> +	fdt = kvmalloc(kho_out.dt_max, GFP_KERNEL);
> +	if (!fdt)
> +		goto out;
> +
> +	if (fdt_create(fdt, kho_out.dt_max)) {
> +		err = -EINVAL;
> +		goto out;
> +	}
> +
> +	err = fdt_finish_reservemap(fdt);
> +	if (err)
> +		goto out;
> +
> +	err = fdt_begin_node(fdt, "");
> +	if (err)
> +		goto out;
> +
> +	err = fdt_property_string(fdt, "compatible", "kho-v1");
> +	if (err)
> +		goto out;
> +
> +	/* Loop through all kho dump functions */
> +	err = blocking_notifier_call_chain(&kho_out.chain_head, KEXEC_KHO_DUMP, fdt);
> +	err = notifier_to_errno(err);

I don't see this really working long term. I think we'd like each
component to be able to serialize at its own pace under userspace
control.

This design requires that the whole thing be wrapped in a notifier
callback just so we can make use of the fdt APIs.

It seems like a poor fit me.

IMHO if you want to keep using FDT I suggest that each serializing
component (ie driver, ftrace whatever) allocate its own FDT fragment
from scratch and the main KHO one just link to the memories that holds
those fragements.

Ie the driver experience would be more like

 kho = kho_start_storage("my_compatible_string,v1", some_kind_of_instance_key);

 fdt...(kho->fdt..)

 kho_finish_storage(kho);

Where this ends up creating a stand alone FDT fragment:

/dts-v1/;
/ {
  compatible = "linux-kho,my_compatible_string,v1";
  instance = some_kind_of_instance_key;
  key-value-1 = <..>;
  key-value-1 = <..>;
};

And then kho_finish_storage() would remember the phys/length until the
kexec fdt is produced as the very last step.

This way we could do things like fdbox an iommufd and create the above
FDT fragment completely seperately from any notifier chain and,
crucially, disconnected from the fdt_create() for the kexec payload.

Further, if you split things like this (it will waste some small
amount of memory) you can probably get to a point where no single FDT
is more than 4k. That looks like it would simplify/robustify alot of
stuff?

Jason