Re: [PATCH 1/1 v3 bpf-next] bpf: increment and use correct thread iterator

Yonghong Song <yhs@xxxxxx> · Fri, 18 Dec 2020 10:21:41 -0800

On 12/18/20 10:06 AM, Jonathan Lemon wrote:
On Fri, Dec 18, 2020 at 08:53:22AM -0800, Yonghong Song wrote:


On 12/11/20 9:11 AM, Jonathan Lemon wrote:
From: Jonathan Lemon <bsd@xxxxxx>

On some systems, some variant of the following splat is
repeatedly seen.  The common factor in all traces seems
to be the entry point to task_file_seq_next().  With the
patch, all warnings go away.

      rcu: INFO: rcu_sched self-detected stall on CPU
      rcu: \x0926-....: (20992 ticks this GP) idle=d7e/1/0x4000000000000002 softirq=81556231/81556231 fqs=4876
      \x09(t=21033 jiffies g=159148529 q=223125)
      NMI backtrace for cpu 26
      CPU: 26 PID: 2015853 Comm: bpftool Kdump: loaded Not tainted 5.6.13-0_fbk4_3876_gd8d1f9bf80bb #1
      Hardware name: Quanta Twin Lakes MP/Twin Lakes Passive MP, BIOS F09_3A12 10/08/2018
      Call Trace:
       <IRQ>
       dump_stack+0x50/0x70
       nmi_cpu_backtrace.cold.6+0x13/0x50
       ? lapic_can_unplug_cpu.cold.30+0x40/0x40
       nmi_trigger_cpumask_backtrace+0xba/0xca
       rcu_dump_cpu_stacks+0x99/0xc7
       rcu_sched_clock_irq.cold.90+0x1b4/0x3aa
       ? tick_sched_do_timer+0x60/0x60
       update_process_times+0x24/0x50
       tick_sched_timer+0x37/0x70
       __hrtimer_run_queues+0xfe/0x270
       hrtimer_interrupt+0xf4/0x210
       smp_apic_timer_interrupt+0x5e/0x120
       apic_timer_interrupt+0xf/0x20
       </IRQ>
      RIP: 0010:get_pid_task+0x38/0x80
      Code: 89 f6 48 8d 44 f7 08 48 8b 00 48 85 c0 74 2b 48 83 c6 55 48 c1 e6 04 48 29 f0 74 19 48 8d 78 20 ba 01 00 00 00 f0 0f c1 50 20 <85> d2 74 27 78 11 83 c2 01 78 0c 48 83 c4 08 c3 31 c0 48 83 c4 08
      RSP: 0018:ffffc9000d293dc8 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
      RAX: ffff888637c05600 RBX: ffffc9000d293e0c RCX: 0000000000000000
      RDX: 0000000000000001 RSI: 0000000000000550 RDI: ffff888637c05620
      RBP: ffffffff8284eb80 R08: ffff88831341d300 R09: ffff88822ffd8248
      R10: ffff88822ffd82d0 R11: 00000000003a93c0 R12: 0000000000000001
      R13: 00000000ffffffff R14: ffff88831341d300 R15: 0000000000000000
       ? find_ge_pid+0x1b/0x20
       task_seq_get_next+0x52/0xc0
       task_file_seq_get_next+0x159/0x220
       task_file_seq_next+0x4f/0xa0
       bpf_seq_read+0x159/0x390
       vfs_read+0x8a/0x140
       ksys_read+0x59/0xd0
       do_syscall_64+0x42/0x110
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      RIP: 0033:0x7f95ae73e76e
      Code: Bad RIP value.
      RSP: 002b:00007ffc02c1dbf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
      RAX: ffffffffffffffda RBX: 000000000170faa0 RCX: 00007f95ae73e76e
      RDX: 0000000000001000 RSI: 00007ffc02c1dc30 RDI: 0000000000000007
      RBP: 00007ffc02c1ec70 R08: 0000000000000005 R09: 0000000000000006
      R10: fffffffffffff20b R11: 0000000000000246 R12: 00000000019112a0
      R13: 0000000000000000 R14: 0000000000000007 R15: 00000000004283c0

The attached patch does 3 things:

1) If unable to obtain the file structure for the current task,
     proceed to the next task number after the one returned from
     task_seq_get_next(), instead of the next task number from the
     original iterator.

Looks like this fix is the real fix for the above warnings.
Basically, say we have
    info->tid = 10 and returned curr_tid = 3000 and tid 3000 has no files.
the current logic will go through
    - set curr_tid = 11 (info->tid++) and returned curr_tid = 3000
    - set curr_tid = 12 and returned curr_tid = 3000
    ...
    - set curr_tid = 3000 and returned curr_tid = 3000
    - set curr_tid = 3001 and return curr_tid >= 3001

All the above works are redundant work, and it may cause issues
for non preemptable kernel.

I suggest you factor out this change plus the following change
which suggested by Andrii early to a separate patch carried with
the below Fixes tag.

diff --git a/kernel/bpf/task_iter.c b/kernel/bpf/task_iter.c
index 0458a40edf10..56bcaef72e36 100644
--- a/kernel/bpf/task_iter.c
+++ b/kernel/bpf/task_iter.c
@@ -158,6 +158,7 @@ task_file_seq_get_next(struct
bpf_iter_seq_task_file_info *info)
                 if (!curr_task) {
                         info->task = NULL;
                         info->files = NULL;
+                       info->tid = curr_tid + 1;
                         return NULL;
                 }

Sure this isn't supposed to be 'curr_tid'?  task_seq_get_next() stops
when there are no more threads found.  This increments the thread id
past the search point, and would seem to introduce a potential off-by-one
error.

This is for the case where read() syscall return length 0, but user
space still keep read(). You are right, we may skip one newly created 
one indeed.

Although such a usecase is not common, but info->tid = curr_tid
certainly more correct than info->tid = curr_tid + 1.

So thanks for suggestion, LGTM.


That is:
    curr_tid = 3000.
    call task_seq_get_next() --> return NULL, curr_tid = 3000.
       (so there is no tid >= 3000)
    set curr_tid = 3001.

    next restart (if there is one) skips a newly created 3000.