Re: [RFC] kexec: Use bpf to allow kexec to load PE format boot image

Philipp Rudo <prudo@xxxxxxxxxx> · Wed, 19 Feb 2025 17:24:32 +0100

Hi Pingfan,

sorry for the late reply.

On Thu, 6 Feb 2025 14:03:40 +0800
Pingfan Liu <piliu@xxxxxxxxxx> wrote:

> Hi Philipp,
> 
> Thanks for your feedback. Please see my answers below.
> 
> I'm also reaching out to the BPF maintainers with two concerns: how to
> ensure the integrity of BPF programs and whether introducing some
> additional BPF helpers for the kexec subsystem would be acceptable.
> Those helpers are used to exchange the data between BPF and the kexec
> kernel part.
> 
> 
> On Sat, Feb 1, 2025 at 1:46 AM Philipp Rudo <prudo@xxxxxxxxxx> wrote:
> >
> > Hi Pingfan,
> >
> > thanks for sharing your thoughts. Please see my comments below.
> >
> > On Tue, 14 Jan 2025 09:28:25 +0800
> > Pingfan Liu <piliu@xxxxxxxxxx> wrote:
> >  
> > > Nowadays UEFI PE bootable image is more and more popular on the distribution.
> > > But it is still an open issue to load that kind of image by kexec with IMA enabled
> > >
> > > *** A brief review of the history ***
> > > There are two categatories methods to handle this issue.
> > >   -1. UEFI service emulator for UEFI stub
> > >   -2. PE format parser
> > >
> > > For the first one, I have tried a purgatory-style emulator [1]. But it
> > > confronts the hardware scaling trouble.  For the second one, there are two
> > > choices, one is to implement it inside the kernel, the other is inside the user
> > > space.  Both zboot-format [2] and UKI-format [3] parsers are rejected due to
> > > the concern that the variant format parsers will inflate the kernel code.  And
> > > finally, we have these kinds of parsers in the user space 'kexec-tools'.
> > >
> > >
> > > From the beginning, it has been perceived that the user space parser can not
> > > satisfy the requirement of security-boot without an extra embeded signature.
> > > This issue was suspended at that time.
> > >
> > > But now, more and more users expect the security feature and want the
> > > kexec_file_load to guarantee it by IMA.  I tried to fix that issue by the extra
> > > embeded signature method. But it is also disliked.
> > >
> > > Enlighted by Philipp suggestion about implementing systemd-stub in bpf opcode in the discussion to [1],
> > > I turn to the bpf and hope that parsers in bpf-program can resolve this issue.
> > >
> > > [1]: https://lore.kernel.org/lkml/20240819145417.23367-1-piliu@xxxxxxxxxx/T/
> > > [2]: https://lore.kernel.org/kexec/20230306030305.15595-1-kernelfans@xxxxxxxxx/
> > > [3]: https://lore.kernel.org/lkml/20230911052535.335770-1-kernel@xxxxxxxx/
> > > [4]: https://lore.kernel.org/linux-arm-kernel/20230921133703.39042-2-kernelfans@xxxxxxxxx/T/
> > >
> > >
> > >
> > >
> > > *** Reflect the problem and a new proposal ***
> > >
> > > The UEFI emulator is anchored at the UEFI spec. That will incur lots of work
> > > due to various hardware support.  For example, to support TPM, the emulator
> > > should implement PCI/I2C bus protocol.
> > >
> > > But if the problem is confined to the original linux kernel boot protocol, it will be simple.
> > > Only three things should be considered: the kernel image, the initrd and the command line.
> > > If we can get them in a security way, we can tackle the problem.
> > >
> > > The integrity of the file is ensured under the protection of the signature
> > > envelope.  If the kexeced files are parsed in the user space, the envelopes are
> > > opened and invalid.  So they should sink into the kernel space, be verified and
> > > be manipulated there.  And to manipulate the various format file, we need
> > > bpf-program, which know their format.
> > >
> > > There are three parties in this solution
> > > -1. The kexec-tools itself is protected by IMA, and it creates a bpf-map and
> > > update UKI addon file names into the map. Later, the bpf-program will call
> > > bpf-helper to pad these files into initrd
> > >
> > > -2. The bpf-program is contained in a dedicated '.bpf' section in PE file. When
> > > kexec_file_load a PE image, it extract the '.bpf' section and reflect it to the
> > > user space through procfs. And kexec-tools starts the program.  By this way,
> > > the bpf-program itself is free from tampering.
> > >
> > > The bpf-program completes two things:
> > >       -1.parse the image format
> > >       -2.call bpf kexec helpers to manipulate signed files
> > >
> > > -3. The bpf helpers. There will be three helpers introduced.
> > > The first one for the data exchange between the bpf-program and the kernel.
> > > The second one for the decompressor.
> > > The third one for the manipulation of the cpio  
> >
> > I find this design very complicated. Especially I don't like that the
> > bpf program is exported back to user space to be loaded separately.  
> 
> I wish I could avoid the complexity, but unfortunately, I couldn't.
> I'll explain everything later, all at once.
> 
> 
> > This does not only requires us to protect kexec-tools by IMA but also
> > all the tools and libraries that are involved in running kexec-tools  
> 
> Yes, that is true. But this may not be so important since all files,
> which are passed in by kexec-tools are protected by signature.

True, but that only means that the files are protected when read from
disk. But the bpf program will change/patch those files. Take zBoot for
example. The bpf program will parse the image, decompress it and, pass
the decompressed image back to kexec. For kexec there is no way to
verify if the decompressed image is genuine. So a malicious bpf program
could inject code into the image without kexec noticing. That's why we
need to make sure that the bpf program handling the image is actually
the one that is parsed from the image.

> > (libc, ld, etc.). But even that will probably not be enough when you
> > look at all the different ways user space programs can interact with
> > each other and change each others behavior (see the xz-backdoor for  
> 
> IIUC, xz-backdoor is not a proper example, which is tampered at the
> source code level.

I disagree, IMHO the xz-backdoor is a good example here. But I should
have explained it better. You are right, that the malicious code was
added to the xz sources. But xz was only the vehicle to ship the
malicious code. The actual target of it was ssh. What makes it worse is
that there is no dependency of ssh to xz. The attack only worked because
systemd depended on xz and loaded the library before the sshd was
started. Similar attacks could also target for the kexec-tools and
inject a malicious bpf program in your design.

> > example). So when we would probably need to protect all of user space
> > if we want to use this design in a secure boot environment, which is
> > out of scope for the feature.
> >
> > Alternatively, we would need to verify in the kernel that the bpf
> > program loaded by the kexec-tools is identical to the one included in
> > the kernel image. But then what's the point in exporting it in the
> > first place? Especially as already today there is the
> > kernel/bpf/syscall.c:kern_sys_bpf function that allows to run the
> > bpf syscall from within the kernel (with some limitations, but it(
> > allows to load a program).
> >  
> 
> Here, let me explain why I cannot avoid the complexity and how this
> design ensures the integrity of the BPF program:
> 
> I'm not entirely sure, but the function kern_sys_bpf(int cmd, union
> bpf_attr *attr, unsigned int size) requires a bpf_attr, which is
> typically prepared by libbpf. If the BPF program is invoked directly
> from the kernel, there must be some code to handle what libbpf
> normally does. That would be quite complex.

Yes, that's unfortunately true. But we will only need to support a
small subset of the functionality from libbpf. So the complexity should
(hopefully) be manageable. At least when you compare it with the
complexity securing the user space kexec-tools would have. 

> On the other hand, the built-in BPF program is protected by the
> signature of the UKI. When it is loaded by kexec-tools through the BPF
> API, it still undergoes integrity checking (refer to
> kernel/bpf/core.c: bpf_prog_calc_tag()). Based on this, I argue that
> the IMA on kexec-tools is not as critical.

Yes, the contained bpf program is protected. But as explained above the
problem is that we cannot be sure that the bpf program that is loaded
by kexec-tools is the one contained in the image. And the bpf verifier
won't help here either as it only guarantees that the program doesn't
harm the currently running kernel. But it still could harm the kernel
that is being loaded...

In addition I don't think that the prog->tag will help here. IIUC it is
used to give each program a unique id but not to protect against
loading non-desired programs. And I don't think it would make sense to
use it this way. Especially as the sha1 hash used for the tag is no
longer considered secure.

> > All in all I think it is better to keep the current design, i.e.
> > kexec-tools only makes one systemcall and the rest is done in the
> > kernel.
> >  
> As explained above, the main obstruction is to implement the libbpf
> logic inside the kernel.

I can fully understand that. But as described above I believe it is the
easier and better solution. Especially when we only implement the
subset that is needed for kexec to work.

> > In addition, while I agree that ideally we include the new feature
> > into kexec_file_load, I think it's better to define a new system call
> > for images containing bpf in the beginning. With that we have a blank
> > slate we can mess with without the need to take care of keeping the old
> > code working. Plus, it leaves us a fallback to load a dump kernel when
> > we mess up. Once we have a working prototype and a better understanding
> > on what is needed we can still merge it back into kexec_file_load.
> >  
> 
> A good suggestion, I will try it.
> 
> > > ***  Overview of the design in Pseudocode ***
> > >
> > >
> > > ThreadA: kexec thread which invokes kexec_file_load
> > > ThreadB: the dedicated thread in kexec-tools to load bpf-prog
> > > ------
> > > Diag 1. the interaction between bpf-prog loader and its executer
> > >
> > >
> > > ThreadA                                               ThreadB
> > >
> > >                                               wait on eventfd_A
> > >
> > >
> > > expose bpf-prog through procfs
> > > & signal eventfd_A
> > > & wait on eventfd_B
> > >
> > >                                               read the bpf-prog from procfs
> > >                                               & initialize the bpf and install it to the fentry
> > >                                               & signal eventfd_B
> > >                                               & wait on eventfd_A again
> > >
> > > fentry executes bpf-prog to parse image
> > > & generate output for the next stop
> > >
> > >
> > > -------------------
> > > Diag 2. bpf-prog
> > >
> > > SEC("fentry/kexec_pe_parser_hook")
> > > int BPF_PROG(pe_parser, struct kimage *image, ...)
> > > {
> > >
> > >       buf = bpf_ringbuf_reserve(rb, size);
> > >       buf_result = bpf_ringbuf_reserve(rb, res_sz);
> > >       /* Ask kernel to copy the resource content to here */
> > >       bpf_helper_carrier(resource_name, buf, size, in);
> > >
> > >       /* Parse the format laying on buf */
> > >       ...
> > >       /* call extra bpf-helpers */
> > >       ...
> > >
> > >       /* Ask kernel to copy the resource content from here */
> > >       bpf_helper_carrier(resource_name, buf_result, res_sz, out);
> > >
> > > }
> > >
> > > At present, bpf map functions provides the mechanism to exchange the data between the user space and bpf-prog.
> > > But for bpf-prog and the kernel, there is no good choice. So I introduce a bpf helper function
> > >       bpf_helper_carrier(resource_name, buf, size, in)
> > >
> > > The above code implements the data exchange between the kernel and bpf-prog.
> > > By this way, the data parsing process is not exposed to the user space any longer.
> > >
> > >
> > >
> > > extra bpf-helpers:
> > >
> > >       /* Decompress the compressed kernel image */
> > >       bpf_helper_decompress(src, src_size, dst, dst_sz)
> > >
> > >       /*
> > >        * Verify the signature of @addon_filename, padding it to initrd's dir @dst_dir
> > >        */
> > >       bpf_helper_supplement_initrd(dst_dir, addon_filename)  
> >
> > UKI addons can also append entries to the kernel command line. IMHO it
> > will be easiest when we maintain the initrd and command line in the
> > kernel, i.e. the syscall "prepopulates" the initrd and cmdline either
> > from the UKI or what kexec-tools provides. The bpf program then only
> > updates them. That's not ideal but it keeps the bpf program simple in
> > the beginning so we (hopefully) don't run into the limitations bpf
> > programs have. Once we have a working prototype we can still move
> > functionality over to the bpf program.
> >  
> 
> Sorry, but I'm not sure whether I understand you clearly. Are you
> suggesting to pass UKI add-ons through a new kexec syscall API?

Yes, that's my idea. IMHO the "ideal" design should look something like
this

1. user space prepares the required data (e.g. kernel command line,
   which initrd/addons to use etc.) into a private data structure, e.g.
   a bpf_map.
2. user space calls kexec and passes the file descriptor for the image
   and the "private data".
3. kexec verifies the image and parses the bpf program from it
4. kexec loads and runs the bpf program passing the private data to it.
5. the bpf program does its job and returns the prepared image, initrd
   and, kernel cmdline back to kexec.
6. kexec then simply continues with what it already does today.

What I like about this design is that the kernel doesn't need to
know the exact structure of the "private data". It's something user
space and the stub need to agree on and thus can be different for each
image/stub. That gives a lot of flexibility for future changes.
In a way it's a step back to the legacy kexec_load. With the difference
that the image isn't prepared in user space but by a bpf program that
is protected by the image signature and thus works with secureboot.

The downside is that the bpf program would be quite complex. That's why
I believe its better to start with an approach where kexec implements a
big portion of the needed functionality. Once that is working we can
move the functionality over to the bpf program piece by piece and see
if/where we run into problems with the limitations of bpf programs.

> > The way I see it this should work with three helper functions.
> > One to read+verify a file and one each to append data to the initrd or
> > command line.
> >  
> Yes, there should be another helper to manipulate kernel cmdline addons.
> 
> > >       Note: Due to the UEFI environment (such as edk2) only providing basic
> > >         file operations for FAT filesystems, any UEFI-stub PE image (like systemd-stub)
> > >         is restricted to these basic operation services.  As a result, the
> > >         functionality of such bpf-kexec helpers is inherently limited.  
> >
> > Is this limitation really necessary? The way I see it this is a
> > limitation to keep the UEFI environment simple. But when we run kexec
> > the kernel is fully booted. So we can make use of all the file systems
> > included in the kernel.
> >  
> Yes, as for the capability, we can utilize all file systems. My main
> concern is the stability of the BPF helpers. I believe people are
> reluctant to update them frequently.

Yeah, bpf-helpers need to be stable. But is there a reason not to add a
bpf-helper that operates on file descriptors? kexec_file_load uses them
and is part of the uapi. So it has the same (if not higher)
requirement for stability.

Thanks
Philipp

> Thanks,
> 
> Pingfan
> 
> > Thanks
> > Philipp
> >  
> > > *** Thoughts about the basic operation ***
> > >
> > > The basic operations have influence on the stability of bpf-kexec-helpers.
> > >
> > > The kexec_file_load faces three kinds of elements: linux-kernel, initrd and cmdline.
> > >
> > > For the kernel, on arm64 or riscv, in order to get the bootable image from the compressed data,
> > > there should be a bpf-helper function as a wrapper of __decompress()
> > >
> > > For initrd, systemd-sysext may require padding extra file into initrd
> > >
> > > For cmdline, it may require some string trim or conjoin.
> > >
> > > Overall, these user requirements are foreseeable and straightforward,
> > > suggesting that bpf-kexec-helpers will likely remain stable without significant
> > > changes.
> > >
> > >
> > > Cc: Alexei Starovoitov <ast@xxxxxxxxxx>
> > > Cc: Daniel Borkmann <daniel@xxxxxxxxxxxxx>
> > > Cc: John Fastabend <john.fastabend@xxxxxxxxx>
> > > Cc: Jeremy Linton <jeremy.linton@xxxxxxx>
> > > Cc: Catalin Marinas <catalin.marinas@xxxxxxx>
> > > Cc: Will Deacon <will@xxxxxxxxxx>
> > > Cc: Mark Rutland <mark.rutland@xxxxxxx>
> > > Cc: Simon Horman <horms@xxxxxxxxxx>
> > > Cc: Gerd Hoffmann <kraxel@xxxxxxxxxx>
> > > Cc: Vitaly Kuznetsov <vkuznets@xxxxxxxxxx>
> > > Cc: Philipp Rudo <prudo@xxxxxxxxxx>
> > > Cc: Jan Hendrik Farr <kernel@xxxxxxxx>
> > > Cc: Baoquan He <bhe@xxxxxxxxxx>
> > > Cc: Dave Young <dyoung@xxxxxxxxxx>
> > > Cc: Eric Biederman <ebiederm@xxxxxxxxxxxx>
> > > Cc: Pingfan Liu <piliu@xxxxxxxxxx>
> > > To: kexec@xxxxxxxxxxxxxxxxxxx
> > > To: bpf@xxxxxxxxxxxxxxx
> > >  
> >  
>