Re: [RFC PATCH bpf-next RESEND 00/16] bpf: Checkpoint/Restore In eBPF (CRIB)

Juntong Deng <juntong.deng@xxxxxxxxxxx> · Wed, 31 Jul 2024 14:09:02 +0100

On 2024/7/23 00:47, Alexei Starovoitov wrote:
On Thu, Jul 11, 2024 at 12:10:17PM +0100, Juntong Deng wrote:

In restore_udp_socket I had to add a struct bpf_crib_skb_info for
restoring packets, this is because there is currently no BPF_CORE_WRITE.

I am not sure what the current attitude of the kernel community
towards BPF_CORE_WRITE is, personally I think it is well worth adding,
as we need a portable way to change the value in the kernel.

This not only allows more complexity in the CRIB restoring part to
be transferred from CRIB kfuncs to CRIB ebpf programs, but also allows
ebpf to unlock more possible application scenarios.

There are lots of interesting ideas in this patch set, but it seems they are
doing the 'C-checkpoint' part of CRIx and something like BPF_CORE_WRITE
is necessary for 'R-restore'.
I'm afraid BPF_CORE_WRITE cannot be introduced without breaking all safety nets.
It will make bpf just as unsafe as any kernel module if bpf progs can start
writing into arbitrary kernel data structures. So it's a show stopper.
If you think there is a value in adding all these iterators for 'checkpoint'
part alone we can discuss and generalize individual patches.

Thanks for your review!

I agree, BPF_CORE_WRITE will compromise the safety of ebpf programs,
which may be a Pandora's box.

But without BPF_CORE_WRITE, CRIB cannot achieve portable restoration,
so the restoration part is put on hold for now.

In the next version of the patch set, I will focus on implementing
checkpointing (dumping) via CRIB for better dumping performance and more
extensibility (which still has a lot of benefits).

High level feedback:

- no need for BPF_PROG_TYPE_CRIB program type. Existing syscall type should fit.

- If we use BPF_PROG_TYPE_SYSCALL for CRIB programs, will it cause
confusion in the functionality of bpf program types?
(BPF_PROG_TYPE_SYSCALL was originally designed to execute syscalls)

- Is it good to expose all kfuncs needed for checkpointing to
BPF_PROG_TYPE_SYSCALL? (Maybe we need a separate ebpf program type to
restrict the kfuncs that can be used)

- Maybe CRIB needs more specific ebpf program running restrictions?
(for example, not allowed to modify the context, dumped data can only
be returned via ringbuf, context is only used to identify the process
that needs to dump and the part of the data that needs to be dumped)

The above three points were my considerations when I originally added
BPF_PROG_TYPE_CRIB, maybe we can have more discussion?

- proposed file/socket iterators are somewhat unnecessary in this open coded form.
   there is already file/socket iterator. From the selftests it looks like it
   can be used to do 'checkpoint' part already.

If you mean iterators like iter/task_file, iter/tcp, etc., then I think
that is not flexible enough for checkpointing.

This is because the context of bpf iterators is fixed and bpf iterators
cannot be nested. This means that a bpf iterator program can only
complete a specific small iterative dump task, and cannot dump
multi-level data.

An example, when we need to dump all the sockets of a process, we need
to iterate over all the files (sockets) of the process, and iterate over
the all packets in the queue of each socket, and iterate over all data
in each packet.

If we use bpf iterator, since the iterator can not be nested, we need to
use socket iterator program to get all the basic information of all
sockets (pass pid as filter), and then use packet iterator program to
get the basic information of all packets of a specific socket (pass pid,
fd as filter), and then use packet data iterator program to get all the
data of a specific packet (pass pid, fd, packet index as filter).

This would be complicated and require a lot of (each iteration)
bpf program startup and exit (leading to poor performance).

By comparison, open coded iterator is much more flexible, we can iterate
in any context, at any time, and iteration can be nested, so we can
achieve more flexible and more elegant dumping through open coded
iterators.

With open coded iterators, all of the above can be done in a single
bpf program, and with nested iterators, everything becomes compact
and simple.

Also, bpf iterators transmit data to user space through seq_file,
which involves a lot of open (bpf_iter_create), read, close syscalls,
context switching, memory copying, and cannot achieve the performance
of using ringbuf.

The bpf iterator is more like an advanced procfs, but still not CRIB.

- KF_ITER_GETTER is a good addition, but we should be able to do it without these flags.
   kfunc-s should be able to accept iterator as an argument. Some __suffix annotation
   may be necessary to help verifier if BTF type alone of the argument won't be enough.

I agree, kfuncs can accept iterators as arguments and we can
use __suffix.

But here is a question, should we consider these kfuncs as iter kfuncs?

That is, should we impose specific constraints on these functions?
For example, specific naming patterns (bpf_iter_<type>_ prefix),
GETTER methods cannot take extra arguments (like next methods), etc.

Currently the verifier applies these constraints based on flags.

- KF_OBTAIN looks like a broken hammer to bypass safety. Like:

   > Currently we cannot pass the pointer returned by the iterator next
   > method as argument to the KF_TRUSTED_ARGS kfuncs, because the pointer
   > returned by the iterator next method is not "valid".

   It's true, but should be fixable directly. Make return pointer of iter_next() to be trusted.

I agree that KF_OBTAIN currently is not a good solution.

For case 1, I tried the ref_obj_id method mentioned by Kumar and
it worked, solving the ownership and lifetime problems. I will include
it in the next version of the patch.

For case 2, Kumar mentioned that it had been fixed by Matt, but I found
there are still some problems.

More details can be found in my reply to Kumar (in the same email thread)

For iter_next(), I currently have an idea to add new flags to allow
iter_next() to decide whether the return value is trusted or not,
such as KF_RET_TRUSTED.

What do you think?

Also, for these improvements to the chain of trust, do you think I
should send them out as separate patches? (rather than as part of
the CRIB patch set)

- iterators for skb data don't feel right. bpf_dynptr_from_skb() should do the trick already.

I agree that using bpf_dynptr would be better, but probably would
not change much...

This is because, we cannot guarantee that the user provided a large
enough buffer, the buffer provided by the user may not be able to hold
all the data of the packet (but we need to dump the whole packet, the
packet may be very large, which is different from the case of reading
only a fixed size protocol header for filtering), which means we need to
read the data in batches (iteratively), so we still need an iterator.

(Back to the BPF_PROG_TYPE_CRIB discussion, BPF_PROG_TYPE_SYSCALL cannot
use bpf_dynptr_from_skb, but should we expose bpf_dynptr_from_skb to
BPF_PROG_TYPE_SYSCALL? Maybe we need a separate program type...)

- start with a small patch set.
   30 files changed, 3080 insertions(+), 12 deletions(-)
   isn't really reviewable.

Sorry, I will reduce the size of the patch set in the next version.

I will remove the proof of concept part, keep only the real tests,
and start trying to integrate CRIB with CRIU.