On 4/9/20 8:33 PM, Alexei Starovoitov wrote:
On Wed, Apr 08, 2020 at 04:25:38PM -0700, Yonghong Song wrote:
For task/file, the dumper prints out:
$ cat /sys/kernel/bpfdump/task/file/my1
tgid gid fd file
1 1 0 ffffffff95c97600
1 1 1 ffffffff95c97600
1 1 2 ffffffff95c97600
....
1895 1895 255 ffffffff95c8fe00
1932 1932 0 ffffffff95c8fe00
1932 1932 1 ffffffff95c8fe00
1932 1932 2 ffffffff95c8fe00
1932 1932 3 ffffffff95c185c0
...
+SEC("dump//sys/kernel/bpfdump/task/file")
+int BPF_PROG(dump_tasks, struct task_struct *task, __u32 fd, struct file *file,
+ struct seq_file *seq, u64 seq_num)
+{
+ static char const banner[] = " tgid gid fd file\n";
+ static char const fmt1[] = "%8d %8d";
+ static char const fmt2[] = " %8d %lx\n";
+
+ if (seq_num == 0)
+ bpf_seq_printf(seq, banner, sizeof(banner));
+
+ bpf_seq_printf(seq, fmt1, sizeof(fmt1), task->tgid, task->pid);
+ bpf_seq_printf(seq, fmt2, sizeof(fmt2), fd, (long)file->f_op);
+ return 0;
+}
I wonder what is the speed of walking all files in all tasks with an empty
program? If it's fast I can imagine a million use cases for such searching bpf
prog. Like finding which task owns particular socket. This could be a massive
feature.
With one redundant spin_lock removed it seems it will be one spin_lock per prog
invocation? May be eventually it can be amortized within seq_file iterating
logic. Would be really awesome if the cost is just refcnt ++/-- per call and
rcu_read_lock.
The main seq_read() loop is below:
while (1) {
size_t offs = m->count;
loff_t pos = m->index;
p = m->op->next(m, p, &m->index);
if (pos == m->index)
/* Buggy ->next function */
m->index++;
if (!p || IS_ERR(p)) {
err = PTR_ERR(p);
break;
}
if (m->count >= size)
break;
err = m->op->show(m, p);
if (seq_has_overflowed(m) || err) {
m->count = offs;
if (likely(err <= 0))
break;
}
}
If we remove the spin_lock() as in another email comment,
we won't have spin_lock() in seq_ops->next() function, only
refcnt ++/-- and rcu_read_{lock, unlock}s. The seq_ops->show() does
not have any spin_lock() either.
I have not got time to do a perf measurement yet.
Will do in the next revision.