Re: [PATCH] writeback: call writeback tracepoints withoud holding list_lock in wb_writeback()

"Shi, Yang" <yang.shi@xxxxxxxxxx> · Thu, 25 Feb 2016 15:47:20 -0800

On 2/25/2016 3:31 PM, Steven Rostedt wrote:
On Thu, 25 Feb 2016 15:16:54 -0800
"Shi, Yang" <yang.shi@xxxxxxxxxx> wrote:

Actually, regardless whether this is the right fix for the splat, it
makes me be wondering if the spin lock which protects the whole for loop
is really necessary. It sounds feasible to move it into the for loop and
just protect the necessary area.

That's a separate issue, which may have its own merits that should be
decided by the writeback folks.

Yes, definitely. I will rework my commit log for this part.

INFO: lockdep is turned off.
Preemption disabled at:[<ffffffc000374a5c>] wb_writeback+0xec/0x830

Can you disassemble the vmlinux file to see exactly where that call is.
I use gdb to find the right locations.

   gdb> li *0xffffffc000374a5c
   gdb> disass 0xffffffc000374a5c

I use gdb to get the code too.

It does point to the spin_lock.

(gdb) list *0xffffffc000374a5c
0xffffffc000374a5c is in wb_writeback (fs/fs-writeback.c:1621).
1616
1617            oldest_jif = jiffies;
1618            work->older_than_this = &oldest_jif;
1619
1620            blk_start_plug(&plug);
1621            spin_lock(&wb->list_lock);
1622            for (;;) {
1623                    /*
1624                     * Stop writeback when nr_pages has been consumed
1625                     */

The disassemble:
     0xffffffc000374a58 <+232>:   bl      0xffffffc0001300b0

The above is the place it recorded. But I just realized, this isn't the
issue. I know where the problem is.

<migrate_disable>
     0xffffffc000374a5c <+236>:   mov     x0, x22
     0xffffffc000374a60 <+240>:   bl      0xffffffc000d5d518 <rt_spin_lock>

DECLARE_EVENT_CLASS(writeback_work_class,
           TP_PROTO(struct bdi_writeback *wb, struct wb_writeback_work *work),
           TP_ARGS(wb, work),
           TP_STRUCT__entry(
                   __array(char, name, 32)
                   __field(long, nr_pages)
                   __field(dev_t, sb_dev)
                   __field(int, sync_mode)
                   __field(int, for_kupdate)
                   __field(int, range_cyclic)
                   __field(int, for_background)
                   __field(int, reason)
                   __dynamic_array(char, cgroup, __trace_wb_cgroup_size(wb))

Ah, thanks for pointing that out. I missed that.

It sounds not correct if tracepoint doesn't allow sleep.

I considered to change sleeping lock to raw lock in kernfs_* functions,
but it sounds not reasonable since they are used heavily by cgroup.

It is the kernfs_* that can't sleep. Tracepoints use
rcu_read_lock_sched_notrace(), which disables preemption, and not only
that, hides itself from lockdep as the last place to disable preemption.

Ah, thanks for pointing out this.

Is there a way to not use the kernfs_* function? At least for -rt?

I'm not quite sure if there is straightforward replacement. However, I'm 
wondering if lock free version could be used by tracing.

For example, create __kernfs_path_len which doesn't acquire any lock for 
writeback tracing as long as there is not any race condition.

At least we could rule out preemption.

Thanks,
Yang

-- Steve

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>