Re: [PATCH RFC 4/5] block: add a statistic table for io latency

Guoqing Jiang <guoqing.jiang@xxxxxxxxxxxxxxx> · Sun, 12 Jul 2020 23:04:02 +0200

On 7/12/20 10:44 PM, Jens Axboe wrote:
On 7/12/20 2:39 PM, Guoqing Jiang wrote:
On 7/11/20 3:32 AM, Ming Lei wrote:
On Fri, Jul 10, 2020 at 12:29:28PM +0200, Guoqing Jiang wrote:
On 7/10/20 12:00 PM, Ming Lei wrote:
On Fri, Jul 10, 2020 at 10:55:24AM +0200, Guoqing Jiang wrote:
Hi Ming,

On 7/10/20 2:53 AM, Ming Lei wrote:
Hi Guoqing,

On Thu, Jul 09, 2020 at 08:48:08PM +0200, Guoqing Jiang wrote:
Hi Ming,

On 7/8/20 4:06 PM, Guoqing Jiang wrote:
On 7/8/20 4:02 PM, Guoqing Jiang wrote:
Hi Guoqing,

I believe it isn't hard to write a ebpf based script(bcc or
bpftrace) to
collect this kind of performance data, so looks not necessary to do it
in kernel.
Hi Ming,

Sorry, I don't know well about bcc or bpftrace, but I assume they
need to
read the latency value from somewhere inside kernel. Could you point
how can I get the latency value? Thanks in advance!
Hmm, I suppose biolatency is suitable for track latency, will look into
it.
I think biolatency can't trace data if it is not running,
Yeah, the ebpf prog is only injected when the trace is started.

also seems no
place
inside kernel have recorded such information for ebpf to read, correct me
if my understanding is wrong.
Just record the info by starting the bcc script in case you need that, is there
anything wrong with this usage? Always doing such stuff in kernel isn't fair for
users which don't care or need this info.
That is why we add a Kconfig option and set it to N by default. And I
suppose
with modern cpu, the cost with several more instructions would not be that
expensive even the option is enabled, just my $0.02.

And as cloud provider,we would like to know data when necessary instead
of collect data by keep script running because it is expensive than just
read
node IMHO.
It shouldn't be expensive. It might be a bit slow to inject the ebpf prog because
the code has to be verified, however once it is put inside kernel, it should have
been efficient enough. The kernel side prog only updates & stores the latency
summery data into bpf map, and the stored summery data can be read out anytime
by userspace.

Could you explain a bit why it is expensive? such as biolatency
I thought I am compare read a sys node + extra instructions in kernel with
launch a specific process for monitoring which need to occupy more
resources (memory) and context switch. And for biolatency, it calls the
bpf_ktime_get_ns to calculate latency for each IO which I assume the
ktime_get_ns will be triggered finally, and it is not cheap as you said.
You can replace one read of timestamp with rq->start_time_ns too, just
like what this patch does. You can write your bcc/bfptrace script,
which is quite easy to start. Once you learn its power, maybe you will love
it.
Yes, I definitely need to learn more about it :-). But even with the change,
I still believe read a node is cheaper than a script.

And seems biolatency can't trace bio based driver per below, and with
collect data in tree we can trace all block drivers.

# load BPF program
b = BPF(text=bpf_text)
if args.queued:
      b.attach_kprobe(event="blk_account_io_start", fn_name="trace_req_start")
else:
      b.attach_kprobe(event="blk_start_request", fn_name="trace_req_start")
      b.attach_kprobe(event="blk_mq_start_request", fn_name="trace_req_start")
b.attach_kprobe(event="blk_account_io_completion",
      fn_name="trace_req_completion")

Could it possible to extend it support trace both request and bio? Otherwise
we have to run another script to trace md raid.
It is pretty easy to extend support bio, just add kprobe on submit_bio
and bio_endio().

The thing is that we don't like the cost of ebpf based solution. And FWIW,
two years ago, we had compared about ebpf solution and record those
data in kernel.

A. in-kernel monitor: 1~5% performance drop
B. ebpf monitor: 10~15% performance drop

Note, we even copied each bio in approach A, which means the performance
could be more better since we don't clone bio now.

And I think the major concern is about the additional Kconfig option, since
Jens doesn't like it, so I guess no need to make the change it in upstream
kernel.
No, my main concern is trying to justify it with having a Kconfig option
to turn it off. Fact is, distros will likely turn it on, and then
everybody gets that overhead. There's a temptation to hide features like
that behind a Kconfig option with this exact justification, and it just
doesn't work like that in practice.

Good to know that, :-).

I might be amenable to the change if:

1) it doesn't add any (real) overhead to the normal fast path

It is possible to remove ktime_get_ns from current change, but I admit
we can't avoid small overhead (several instructions per each IO I think),
and it is not fair to people who doesn't care about those data.

2) probably needs to be opt-in. not via kconfig, but as a sysfs
    attribute. Like we have 'iostats' today.

Thanks for the suggestion, will investigate a better way.

Something I've mentioned in the past has been a blktrace flight recorder
mode, but that's largely been superseded by having bpf available. But
the point is that something like blktrace generally doesn't add ANY
overhead at all if blktrace isn't being run. Your solution ends up
collecting stats all the time, regardless of whether or not anyone is
actually looking at the data. That's a bad idea, and I'd be much happier
with a solution that only adds overhead when actually used, not all the
time.

Appreciate for your input which make me know your concern better.

Thanks,
Guoqing