On Wed, Apr 21, 2021 at 1:55 PM Pavel Begunkov <asml.silence@xxxxxxxxx> wrote: > > On 4/21/21 12:50 PM, Jinpu Wang wrote: > [snip] > >>> Hi Jens, > >>> The problem with using blktrace at production may cause a performance > >>> drop ~30%. while with the block stats here, we only see ~3% when > >>> enabled. > >> > >> It's probably was asked before, but let's refresh as the discussion > >> erupted again. > >> > >> I get your problem with blktrace(8), IIRC it definitely can deteriorate > >> performance if run constantly, but did you try to write a bpf program > >> that does smarter accumulation in the kernel? Like making bpf to collect > >> a latency table (right as in your patches do) and flushing it to the > >> disk periodically? > > Hi Pavel, > > > > Thanks for the suggestion. > > > > We did test with ebpf with kprobe in the past (~kernel 4.4/4.14), we > > saw 10% performance drop, that's the reason we develop this > > stats patches. > > > > But I just did another test with bpftrace on k 5.10.30, I do not see > > performance lost. > > It must be ebpf is improving very much since then. > > > > So to summarize, we can use bpftrace to do the drop in latest kernel, > > there is no need to have it build into the kernel. > > Perfect, and I'm sure it will be even more convenient for you, for > instance to gather other stats or do it somehow differently Yeah, agree. Thanks again! > > > Thanks! > > > > the bpftrace I used during testing: > > root@x4-left:~# cat /usr/sbin/biolatency.bt > > #!/usr/bin/env bpftrace > > /* > > * biolatency.bt Block I/O latency as a histogram. > > * For Linux, uses bpftrace, eBPF. > > * > > * This is a bpftrace version of the bcc tool of the same name. > > * > > * Copyright 2018 Netflix, Inc. > > * Licensed under the Apache License, Version 2.0 (the "License") > > * > > * 13-Sep-2018 Brendan Gregg Created this. > > */ > > > > BEGIN > > { > > printf("Tracing block device I/O... Hit Ctrl-C to end.\n"); > > } > > > > kprobe:blk_account_io_start > > { > > @start[arg0] = nsecs; > > } > > > > kprobe:blk_account_io_done > > /@start[arg0]/ > > > > { > > @usecs = hist((nsecs - @start[arg0]) / 1000); > > delete(@start[arg0]); > > } > > > > > > > >> > >> > >>> We did a benchmark with rnbd. > >>> > >>> Fio config: > >>> [global] > >>> description=Emulation of Storage Server Access Pattern > >>> bssplit=512/20:1k/16:2k/9:4k/12:8k/19:16k/10:32k/8:64k/4 > >>> fadvise_hint=0 > >>> rw=randrw:2 > >>> direct=1 > >>> random_distribution=zipf:1.2 > >>> #size=1G > >>> time_based=1 > >>> runtime=10 > >>> ramp_time=1 > >>> ioengine=libaio > >>> iodepth=128 > >>> iodepth_batch_submit=128 > >>> iodepth_batch_complete=128 > >>> numjobs=1 > >>> #gtod_reduce=1 > >>> group_reporting > >>> > >>> [job1] > >>> filename=/dev/rnbd0 > >>> > >>> blktrace command: > >>> # blktrace -a read -a write -d /dev/rnbd0 > >>> > >>> read IOPS drops 35%, similar for write IOPS. > >>> > >>> RNBD-No-blktrace RNBD-With-blktrace > >>> > >>> 102056.894311 -35.5% > >>> > >>> The tests are done with v5.4.30. > >>> Test hardware is > >>> root@x4-left:~/haris/sds-perf# uname -a > >>> Linux x4-left 5.10.30-pserver > >>> #5.10.30-1+feature+linux+5.10.y+20210414.1233+e3dd267~deb10 SMP > >>> x86_64 GNU/Linux > >>> root@x4-left:~/haris/sds-perf# lscpu > >>> Architecture: x86_64 > >>> CPU op-mode(s): 32-bit, 64-bit > >>> Byte Order: Little Endian > >>> Address sizes: 46 bits physical, 48 bits virtual > >>> CPU(s): 40 > >>> On-line CPU(s) list: 0-39 > >>> Thread(s) per core: 2 > >>> Core(s) per socket: 10 > >>> Socket(s): 2 > >>> NUMA node(s): 2 > >>> Vendor ID: GenuineIntel > >>> CPU family: 6 > >>> Model: 85 > >>> Model name: Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz > >>> Stepping: 4 > >>> CPU MHz: 800.571 > >>> CPU max MHz: 3000.0000 > >>> CPU min MHz: 800.0000 > >>> BogoMIPS: 4400.00 > >>> Virtualization: VT-x > >>> L1d cache: 32K > >>> L1i cache: 32K > >>> L2 cache: 1024K > >>> L3 cache: 14080K > >>> NUMA node0 CPU(s): 0-9,20-29 > >>> NUMA node1 CPU(s): 10-19,30-39 > >>> Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr > >>> pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe > >>> syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts > >>> rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq > >>> dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm > >>> pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes > >>> xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 > >>> cdp_l3 invpcid_single intel_ppin mba ibrs ibpb stibp tpr_shadow vnmi > >>> flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep > >>> bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap > >>> clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec > >>> xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local > >>> dtherm ida arat pln pts pku ospke > >>> root@x4-left:~/haris/sds-perf# ibstat > >>> CA 'mlx5_0' > >>> CA type: MT4115 > >>> Number of ports: 1 > >>> Firmware version: 12.26.4012 > >>> Hardware version: 0 > >>> Node GUID: 0xec0d9a0300c5fffc > >>> System image GUID: 0xec0d9a0300c5fffc > >>> Port 1: > >>> State: Active > >>> Physical state: LinkUp > >>> Rate: 100 > >>> Base lid: 3 > >>> LMC: 0 > >>> SM lid: 3 > >>> Capability mask: 0x2651e84a > >>> Port GUID: 0xec0d9a0300c5fffc > >>> Link layer: InfiniBand > >>> CA 'mlx5_1' > >>> CA type: MT4115 > >>> Number of ports: 1 > >>> Firmware version: 12.26.4012 > >>> Hardware version: 0 > >>> Node GUID: 0xec0d9a0300c5fffd > >>> System image GUID: 0xec0d9a0300c5fffc > >>> Port 1: > >>> State: Active > >>> Physical state: LinkUp > >>> Rate: 100 > >>> Base lid: 1 > >>> LMC: 0 > >>> SM lid: 1 > >>> Capability mask: 0x2651e84a > >>> Port GUID: 0xec0d9a0300c5fffd > >>> Link layer: InfiniBand > >>> > >>> Thanks! > > -- > Pavel Begunkov