Re: [RFC PATCH bpf-next RESEND 00/16] bpf: Checkpoint/Restore In eBPF (CRIB)

Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx> · Wed, 31 Jul 2024 18:16:12 +0200

On Wed, 31 Jul 2024 at 15:10, Juntong Deng <juntong.deng@xxxxxxxxxxx> wrote:
>
> On 2024/7/23 00:47, Alexei Starovoitov wrote:
> > On Thu, Jul 11, 2024 at 12:10:17PM +0100, Juntong Deng wrote:
> >>
> >> In restore_udp_socket I had to add a struct bpf_crib_skb_info for
> >> restoring packets, this is because there is currently no BPF_CORE_WRITE.
> >>
> >> I am not sure what the current attitude of the kernel community
> >> towards BPF_CORE_WRITE is, personally I think it is well worth adding,
> >> as we need a portable way to change the value in the kernel.
> >>
> >> This not only allows more complexity in the CRIB restoring part to
> >> be transferred from CRIB kfuncs to CRIB ebpf programs, but also allows
> >> ebpf to unlock more possible application scenarios.
> >
> > There are lots of interesting ideas in this patch set, but it seems they are
> > doing the 'C-checkpoint' part of CRIx and something like BPF_CORE_WRITE
> > is necessary for 'R-restore'.
> > I'm afraid BPF_CORE_WRITE cannot be introduced without breaking all safety nets.
> > It will make bpf just as unsafe as any kernel module if bpf progs can start
> > writing into arbitrary kernel data structures. So it's a show stopper.
> > If you think there is a value in adding all these iterators for 'checkpoint'
> > part alone we can discuss and generalize individual patches.
> >
>
> Thanks for your review!
>
> I agree, BPF_CORE_WRITE will compromise the safety of ebpf programs,
> which may be a Pandora's box.
>
> But without BPF_CORE_WRITE, CRIB cannot achieve portable restoration,
> so the restoration part is put on hold for now.

I think then you should rethink how to do restoration in CRIB.
It is better to anticipate some solution, even if you don't implement
it right away.
It will be necessary for an end-to-end solution.

I think BPF_CORE_WRITE will be a non-starter.
The best that can be done is teaching the kernel specific fields which
are writable, but that doesn't scale.

The conventional method with CRIU was to replicate the same steps that
led to a certain state for a kernel object.
The way you propose is more akin to direct serialization/deserialization.

It's better to rely on restoration helpers if it's necessary.

>
> In the next version of the patch set, I will focus on implementing
> checkpointing (dumping) via CRIB for better dumping performance and more
> extensibility (which still has a lot of benefits).
>
> > High level feedback:
> >
> > - no need for BPF_PROG_TYPE_CRIB program type. Existing syscall type should fit.
> >
>
> - If we use BPF_PROG_TYPE_SYSCALL for CRIB programs, will it cause
> confusion in the functionality of bpf program types?
> (BPF_PROG_TYPE_SYSCALL was originally designed to execute syscalls)
>
> - Is it good to expose all kfuncs needed for checkpointing to
> BPF_PROG_TYPE_SYSCALL? (Maybe we need a separate ebpf program type to
> restrict the kfuncs that can be used)
>

I think it's become a more generic program type to invoke kfuncs from
task context.
E.g. the upcoming sched-ext uses it to invoke kfuncs to gracefully
exit a scheduler etc.
We didn't have a separate BPF_PROG_TYPE_SCX_TASK_CTX.

Also, you should not think about the kfuncs as being specific to
checkpointing or CRIB.
Try to keep them generic so they could be useful beyond use cases that
you have thought of right now.
Place reasonable constraints that are limited to enforcing safe use
from BPF programs.
Just as an illustration, all of the existing BPF infra/iterators you
used to implement CRIB were never designed with checkpointing in mind.

> - Maybe CRIB needs more specific ebpf program running restrictions?
> (for example, not allowed to modify the context, dumped data can only
> be returned via ringbuf, context is only used to identify the process
> that needs to dump and the part of the data that needs to be dumped)
>

Why would you want to do that?
Maybe someone needs a different way or comes up with a better way of
communicating with userspace?

There are different ways of doing checkpointing.
Some checkpointing systems propose ideas of keeping histories of a
process's state.
This allows possibly reverting process state to some point in the past
(as a way to rollback execution).
There, when you checkpoint stuff, you would store all state as part of
your BPF program in maps or arenas, and may not even send it to user
space.
Then it can be finally dumped when necessary to a user space agent, or
discarded.

> The above three points were my considerations when I originally added
> BPF_PROG_TYPE_CRIB, maybe we can have more discussion?
>

I think it feels unnecessary. So far I don't see a strong reason.
Adding a new program type means remembering everywhere in the verifier
how it needs to be handled.
Or whether it requires a special consideration for some check (esp.
since you want to restrict a lot of stuff).
It's better to decompose CRIB into individual useful parts that are
useful beyond checkpointing.
It's better to think in terms of providing useful basic building
blocks, how people use them shouldn't strongly be dictated by us.

> > - proposed file/socket iterators are somewhat unnecessary in this open coded form.
> >    there is already file/socket iterator. From the selftests it looks like it
> >    can be used to do 'checkpoint' part already.
> >
>
> If you mean iterators like iter/task_file, iter/tcp, etc., then I think
> that is not flexible enough for checkpointing.
>
> This is because the context of bpf iterators is fixed and bpf iterators
> cannot be nested. This means that a bpf iterator program can only
> complete a specific small iterative dump task, and cannot dump
> multi-level data.
>
> An example, when we need to dump all the sockets of a process, we need
> to iterate over all the files (sockets) of the process, and iterate over
> the all packets in the queue of each socket, and iterate over all data
> in each packet.
>
> If we use bpf iterator, since the iterator can not be nested, we need to
> use socket iterator program to get all the basic information of all
> sockets (pass pid as filter), and then use packet iterator program to
> get the basic information of all packets of a specific socket (pass pid,
> fd as filter), and then use packet data iterator program to get all the
> data of a specific packet (pass pid, fd, packet index as filter).
>
> This would be complicated and require a lot of (each iteration)
> bpf program startup and exit (leading to poor performance).
>
> By comparison, open coded iterator is much more flexible, we can iterate
> in any context, at any time, and iteration can be nested, so we can
> achieve more flexible and more elegant dumping through open coded
> iterators.
>
> With open coded iterators, all of the above can be done in a single
> bpf program, and with nested iterators, everything becomes compact
> and simple.

This does make sense to me, but I'd still try to limit what you
introduce to only what you need initially, for the next version, and
let's go from there.

>
> Also, bpf iterators transmit data to user space through seq_file,
> which involves a lot of open (bpf_iter_create), read, close syscalls,
> context switching, memory copying, and cannot achieve the performance
> of using ringbuf.
>
> The bpf iterator is more like an advanced procfs, but still not CRIB.
>
> > - KF_ITER_GETTER is a good addition, but we should be able to do it without these flags.
> >    kfunc-s should be able to accept iterator as an argument. Some __suffix annotation
> >    may be necessary to help verifier if BTF type alone of the argument won't be enough.
> >
>
> I agree, kfuncs can accept iterators as arguments and we can
> use __suffix.
>
> But here is a question, should we consider these kfuncs as iter kfuncs?
>
> That is, should we impose specific constraints on these functions?
> For example, specific naming patterns (bpf_iter_<type>_ prefix),
> GETTER methods cannot take extra arguments (like next methods), etc.
>
> Currently the verifier applies these constraints based on flags.
>
> > - KF_OBTAIN looks like a broken hammer to bypass safety. Like:
> >
> >    > Currently we cannot pass the pointer returned by the iterator next
> >    > method as argument to the KF_TRUSTED_ARGS kfuncs, because the pointer
> >    > returned by the iterator next method is not "valid".
> >
> >    It's true, but should be fixable directly. Make return pointer of iter_next() to be trusted.
> >
>
> I agree that KF_OBTAIN currently is not a good solution.
>
> For case 1, I tried the ref_obj_id method mentioned by Kumar and
> it worked, solving the ownership and lifetime problems. I will include
> it in the next version of the patch.
>
> For case 2, Kumar mentioned that it had been fixed by Matt, but I found
> there are still some problems.
>
> More details can be found in my reply to Kumar (in the same email thread)
>
> For iter_next(), I currently have an idea to add new flags to allow
> iter_next() to decide whether the return value is trusted or not,
> such as KF_RET_TRUSTED.
>
> What do you think?

Why shouldn't the return value always be trusted?
We eventually want to switch over to trusted by default everywhere.
It would be nice not to go further in the opposite direction (i.e.
having to manually annotate trusted) if we can avoid it.

>
> Also, for these improvements to the chain of trust, do you think I
> should send them out as separate patches? (rather than as part of
> the CRIB patch set)
>
> > - iterators for skb data don't feel right. bpf_dynptr_from_skb() should do the trick already.
> >
>
> I agree that using bpf_dynptr would be better, but probably would
> not change much...
>
> This is because, we cannot guarantee that the user provided a large
> enough buffer, the buffer provided by the user may not be able to hold
> all the data of the packet (but we need to dump the whole packet, the
> packet may be very large, which is different from the case of reading
> only a fixed size protocol header for filtering), which means we need to
> read the data in batches (iteratively), so we still need an iterator.
>
> (Back to the BPF_PROG_TYPE_CRIB discussion, BPF_PROG_TYPE_SYSCALL cannot
> use bpf_dynptr_from_skb, but should we expose bpf_dynptr_from_skb to
> BPF_PROG_TYPE_SYSCALL? Maybe we need a separate program type...)

It can be exposed, we'd just have to ensure syscall programs get
access to an skb that can be passed into the kfunc that is
well-formed.
Someone who wants to call that kfunc on an skb will instead use
prog_type_crib from user space, if we separate them. So it adds no
value.