Re: Several bugs/flaws in the current(?) bcache implementation

Coly Li <colyli@xxxxxxx> · Mon, 11 Nov 2019 23:56:04 +0800

On 2019/11/11 10:10 上午, Christian Balzer wrote:
> 
> 
> Hello,
> 
> When researching the issues below and finding out about the PDC changes
> since 4.9 this also provided a good explanation for the load spikes we see
> with 4.9, as the default writeback is way too slow to empty the dirty
> pages and thus there is never much of a buffer for sudden write spikes,
> causing the PDC to overshoot when trying to flush things out to the
> backing device.
> 
> With Debian Buster things obviously changed and the current kernel
> ---
> Linux version 4.19.0-6-amd64 (debian-kernel@xxxxxxxxxxxxxxxx) (gcc version
> 8.3.0 (Debian 8.3.0-6)) #1 SMP Debian 4.19.67-2+deb10u1 (2019-09-20) ---
> we get writeback_rate_minimum (undocumented, value in 512Byte blocks).
> That looked promising and indeed it helps, but there are major gotchas.
> For the tests below I did set this to 8192 aka 4MB/s, which is something
> the backing Areca RAID (4GB cache, 16 handles at 0% utilization.
> 
> 1. Quiescent insanity
> 
> When running fio (see full command line and results below) all looks/seems
> fine, aside from issue #2 of course.
> However if one stops fio and the system is fully quiescent (no writes)
> then the new PDC goes berserk, most likely a division by zero type bug.
> 
> writeback_rate_debug goes from (just after stopping fio):
> ---
> rate:           4.0M/sec
> dirty:          954.7M
> target:         36.7G
> proportional:   -920.7M
> integral:       -17.1M
> change:         0.0k/sec
> next io:        -7969ms
> ---
> 
> to:
> ---
> rate:           0.9T/sec
> dirty:          496.4M
> target:         36.7G
> proportional:   0.0k
> integral:       0.0k
> change:         0.0k/sec
> next io:        -2000ms
> ---
> completely overwhelming the backing device and causing (again) massive
> load spikes. Very unintuitive and unexpected.
> 
> Any IO (like a fio with 1 IOPS target) will prevent this and the preset
> writeback_rate_minimum will be honored until the cache is clean.
> 
> 

This is a feature indeed.. When there is no I/O for a reasonable long
time, the writeback rate limit will be set to 1TB/s, to permit the
writeback I/O to perform as fastly as possible.

And as you observed, once there is new I/O coming, the maximized
writeback I/O will canceled and back to be controlled by PDC code.

Is there any inconvenience for such behavior in your environment ?

> 
> 2. Initial and intermittent 5 second drops
> 
> When starting the fiojob there is pronounced pause of about 5 seconds
> before things proceed.
> Then during the run we get this:
> ---
> Starting 1 process
> Jobs: 1 (f=1), 0-1000 IOPS: [w(1)][2.4%][w=4000KiB/s][w=1000 IOPS][eta
> 04m:39s (repeats nicely for a while then we get the first slowdown)
> ...
> Jobs: 1 (f=1), 0-1000 IOPS: [w(1)][14.9%][w=2192KiB/s][w=548 IOPS][eta
> 03m:49s Jobs: 1 (f=1), 0-1000 IOPS: [w(1)][14.9%][eta
> 03m:55s] Jobs: 1 (f=1), 0-1000 IOPS: [w(1)][17.0%][w=21.3MiB/s][w=5451
> IOPS][eta 03m:40 ...
> {last slowdown)
> Jobs: 1 (f=1), 0-1000 IOPS: [w(1)][91.3%][w=3332KiB/s][w=833 IOPS][eta
> 00m:23s Jobs: 1 (f=1), 0-1000 IOPS: [w(1)][91.3%][eta
> 00m:23s] Jobs: 1 (f=1), 0-1000 IOPS: [w(1)][92.1%][w=6858KiB/s][w=1714
> IOPS][eta 00m:21 ---
> 
> These slowdowns happened 7 times during the run, alas not at particular
> regular intervals. A re-run with just a 10 IOPS rate shows a clearer
> pattern, the pauses are separated by 30 seconds of normal operation and
> take about 10(!!!) seconds.
> It's also quite visible in the latencies of the fiojob results.
> Neither the initial nor the intermittent pauses are present with the 4.9
> kernel bcache version.
> From a usability perspective, this very much counters the reason to use
> bcache in the first place.
> 

Can you run a top command on other terminal and check who is running
during the I/O slow down ? And what is the output of
/sys/block/bcache0/bcache/writeback_rate_debug (read the file every 30
seconds, don't be too frequently) when you feel the I/O is slow.

I encounter similar situation when GC thread is running, or too many
dirty data to throttle regular I/O requests. Not sure whether we are in
similar situation.

[snip]

> fio line/results:
> ---
> io --size=1G --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1
> --rw=randwrite --name=fiojob --blocksize=4K --iodepth=32 --rate_iops=1000
> --- fiojob: (groupid=0, jobs=1): err= 0: pid=18933: Mon Nov 11 08:09:51
> 2019 write: IOPS=1000, BW=4000KiB/s (4096kB/s)(1024MiB/262144msec); 0 zone
> resets slat (usec): min=6, max=5227.2k, avg=153.88, stdev=22260.74
>     clat (usec): min=50, max=5207.0k, avg=569.68, stdev=20016.99
>      lat (usec): min=101, max=5228.0k, avg=724.32, stdev=30068.91
>     clat percentiles (usec):
>      |  1.00th=[  176],  5.00th=[  227], 10.00th=[  273], 20.00th=[  318],
>      | 30.00th=[  343], 40.00th=[  363], 50.00th=[  375], 60.00th=[  383],
>      | 70.00th=[  396], 80.00th=[  453], 90.00th=[  709], 95.00th=[  955],
>      | 99.00th=[ 3032], 99.50th=[ 4555], 99.90th=[ 8717], 99.95th=[10552],
>      | 99.99th=[13042]
>    bw (  KiB/s): min=  159, max=44472, per=100.00%, avg=4657.40,
> stdev=4975.82, samples=450 iops        : min=   39, max=11118,
> avg=1164.35, stdev=1243.96, samples=450 lat (usec)   : 100=0.07%,
> 250=7.42%, 500=75.19%, 750=8.81%, 1000=3.82% lat (msec)   : 2=2.95%,
> 4=1.04%, 10=0.63%, 20=0.07%, 100=0.01% lat (msec)   : 250=0.01%
>   cpu          : usr=1.63%, sys=7.48%, ctx=444428, majf=0, minf=10
>   IO depths    : 1=83.0%, 2=0.5%, 4=0.2%, 8=0.1%, 16=0.1%, 32=16.2%,
>> =64=0.0% submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
>      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>> =64=0.0% issued rwts: total=0,262144,0,0 short=0,0,0,0 dropped=0,0,0,0
>      latency   : target=0, window=0, percentile=100.00%, depth=32
> 
> Run status group 0 (all jobs):
>   WRITE: bw=4000KiB/s (4096kB/s), 4000KiB/s-4000KiB/s (4096kB/s-4096kB/s),
> io=1024MiB (1074MB), run=262144-262144msec ---
> 
The cache device is Samsung 960EVO 500GB, but how the backing devices
are attached ? Each hard drive running as a single bcache device, and
multiple bcache devices attached to the 500GB SSD ?

And could you post the fio job file, then let me try to setup a similar
configuration and check what happens exactly.

Thanks.

-- 

Coly Li