Re: [RFC PATCH bpf-next 05/16] bpf: create file or anonymous dumpers

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 4/15/20 6:48 PM, Andrii Nakryiko wrote:
On Wed, Apr 15, 2020 at 9:46 AM Alexei Starovoitov
<alexei.starovoitov@xxxxxxxxx> wrote:

On Tue, Apr 14, 2020 at 09:45:08PM -0700, Andrii Nakryiko wrote:

FD is closed, dumper program is detached and dumper is destroyed
(unless pinned in bpffs, just like with any other bpf_link.
3. At this point bpf_dumper_link can be treated like a factory of
seq_files. We can add a new BPF_DUMPER_OPEN_FILE (all names are for
illustration purposes) command, that accepts dumper link FD and
returns a new seq_file FD, which can be read() normally (or, e.g.,
cat'ed from shell).

In this case, link_query may not be accurate if a bpf_dumper_link
is created but no corresponding bpf_dumper_open_file. What we really
need to iterate through all dumper seq_file FDs.

If the goal is to iterate all the open seq_files (i.e., bpfdump active
sessions), then bpf_link is clearly not the right approach. But I
thought we are talking about iterating all the bpfdump programs
attachments, not **sessions**, in which case bpf_link is exactly the
right approach.

That's an important point. What is the pinned /sys/kernel/bpfdump/tasks/foo ?

Assuming it's not a rhetorical question, foo is a pinned bpf_dumper
link (in my interpretation of all this).

Every time 'cat' opens it a new seq_file is created with new FD, right ?

yes

Reading of that file can take infinite amount of time, since 'cat' can be
paused in the middle.

yep, correct (though most use case probably going to be very short-lived)

I think we're dealing with several different kinds of objects here.
1. "template" of seq_file that is seen with 'ls' in /sys/kernel/bpfdump/

Let's clarify here again, because this can be interpreted differently.

Are you talking about, e.g., /sys/fs/bpfdump/task directory that
defines what class of items should be iterated? Or you are talking
about named dumper: /sys/fs/bpfdump/task/my_dumper?

If the former, I agree that it's not a link. If the latter, then
that's what we've been so far calling "a named bpfdumper". Which is
what I argue is a link, pinned in bpfdumpfs (*not bpffs*).

UPD: reading further, seems like it's some third interpretation, so
please clarify.

2. given instance of seq_file after "template" was open

Right, corresponding to "bpfdump session" (has its own unique session_id).

3. bpfdumper program

Yep, BPF_PROG_LOAD returns FD to verified bpfdumper program.

4. and now links. One bpf_link from seq_file template to bpf prog and

So I guess "seq_file template" is /sys/kernel/bpfdump/tasks direntry
itself, which has to be specified as FD during BPF_PROG_LOAD, is that
right? If yes, I agree, "seq_file template" + attached bpf_prog is a
link.

   many other bpf_links from actual seq_file kernel object to bpf prog.

I think this one is not a link at all. It's a bpfdumper session. For
me this is equivalent of a single BPF program invocation on cgroup due
to a single packet. I understand that in this case it's multiple BPF
program invocations, so it's not exactly 1:1, but if we had an easy
way to do iteration from inside BPF program over all, say, tasks, that
would be one BPF program invocation with a loop inside. So to me one
seq_file session is analogous to a single BPF program execution (or,
say one hardware event triggering one execution of perf_event BPF
program).

   I think both kinds of links need to be iteratable via get_next_id.

At the same time I don't think 1 and 2 are links.
read-ing link FD should not trigger program execution. link is the connecting
abstraction. It shouldn't be used to trigger anything. It's static.
Otherwise read-ing cgroup-bpf link would need to trigger cgroup bpf prog too.
FD that points to actual seq_file is the one that should be triggering
iteration of kernel objects and corresponding execution of linked prog.

Yep, I agree totally, reading bpf_link FD directly as if it was
seq_file seems weird and would support only a single time to read.

That FD can be anon_inode returned from raw_tp_open (or something else)

raw_tp_open currently always returns bpf_link FDs, so if this suddenly
returns readable seq_file instead, that would be weird, IMO.


or FD from open("/sys/kernel/bpfdump/foo").

Agreed.


The more I think about all the objects involved the more it feels that the
whole process should consist of three steps (instead of two).
1. load bpfdump prog
2. create seq_file-template in /sys/kernel/bpfdump/
    (not sure which api should do that)

Hm... ok, I think seq_file-template means something else entirely.
It's not an attached BPF program, but also not a /sys/fs/bpfdump/task
"provider". What is it and what is its purpose? Also, how is it
cleaned up if application crashes between creating "seq_file-template"
and attaching BPF program to it?

3. use bpf_link_create api to attach bpfdumper prog to that seq_file-template

Then when the file is opened a new bpf_link is created for that reading session.
At the same time both kinds of links (to teamplte and to seq_file) should be
iteratable for observability reasons, but get_fd_from_id on them should probably
be disallowed, since holding such FD to these special links by other process
has odd semantics.

This special get_fd_from_id handling for bpfdumper links (in your
interpretation) looks like a sign that using bpf_link to represent a
specific bpfdumper session is not the right design.

As for obserabilitiy of bpfdumper sessions, I think using bpfdump
program + task/file provider will give a good way to do this,
actually, with no need to maintain a separate IDR just for bpfdumper
sessions.


Similarly for anon seq_file it should be three step process as well:
1. load bpfdump prog
2. create anon seq_file (api is tbd) that returns FD
3. use bpf_link_create to attach prog to seq_file FD

May be it's all overkill. These are just my thoughts so far.

Just to contrast, in a condensed form, what I was proposing:

For named dumper:
1. load bpfdump prog
2. attach prog to bpfdump "provider" (/sys/fs/bpfdump/task), get
bpf_link anon FD back

I actually tried to prototype earlier today.
for existing tracing program (non-raw-tracepoint, e.g., fentry/fexit),
when raw_tracepoint_open() is called,
bpf_trampoline_link_prog() is called, trampoline is actually
updated with bpf_program and program start running. You can hold
this link_fd at user application or pin it to /sys/fs/bpf.

That is what I refers to in my previous email whether
we can have 'cat'-able link or not. But looks it is pretty hard.

Alternatively, we could still return a bpf_link.
The only thing bpf_link did is to hold a reference count for bpf_prog
and nothing else. Later on, we can use this bpf_link to pin dumper
or open anonymous seq_file.

But since bpf_link just holds a reference for prog and nothing more
and that is why I mentioned not 100% sure whether bpf_link is needed
as I could achieve the same thing with bpf_prog. Further,
it does not provide ability to query open files (a bpf program
for task/file target should be able to do it.)

But if for API consistency, we prefer raw_tracepoint_open() to
return a link fd. I can still do it, I guess. Or maybe link_query
still useful in some way.


3. pin link in bpfdumpfs (e.g., /sys/fs/bpfdump/task/my_dumper)
4. each open() of /sys/fs/bpfdump/task/my_dumper produces new
bpfdumper session/seq_file

For anon dumper:
1. load bpfdump prog
2. attach prog to bpfdump "provider" (/sys/fs/bpfdump/task), get
bpf_link anon FD back
3. give bpf_link FD to some new API (say, BPF_DUMP_NEW_SESSION or
whatever name) to create seq_file/bpfdumper session, which will create
FD that can be read(). One can do that many times, each time getting
its own bpfdumper session.

First two steps are exactly the same, as it should be, because
named/anon dumper is still the same dumper. Note also that we can use
bpf_link FD of named dumper and BPF_DUMP_NEW_SESSION command to also
create sessions, which further underlines that the only difference
between named and anon dumper is this bpfdumpfs direntry that allows
to create new seq_file/session by doing normal open(), instead of
BPF's BPF_DUMP_NEW_SESSION.

Named vs anon dumper is like "named" vs "anon" bpf_link -- we don't
even talk in those terms about bpf_link, because the only difference
is pinned direntry in a special FS, really.




[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux