Re: [PATCH 4/5] bcache: writeback: collapse contiguous IO better

Coly Li <i@xxxxxxx> · Tue, 10 Oct 2017 02:58:26 +0800

On 2017/10/7 上午2:36, Michael Lyle wrote:
> Sorry I missed this question:
> 
>> Is it the time from writeback starts to dirty reaches dirty target, or
>> the time from writeback starts to dirty reaches 0 ?
> 
> Not quite either.  I monitor the machine with zabbix; it's the time to
> when the backing disk reaches its background rate of activity / when
> writeback hits its minimum rate (writeback with the PI controller
> writes a little past the target because of the integral term).
> 
> Viewed one way: 5/80 is just a few percent of difference (6%).  But:
> I'm hopeful that further improvement can be made along this patch
> series, and in any event it's 5 minutes earlier that I/O will have an
> unencumbered response time after a pulse of load.

Hi Mike,

Finally, finally I finish all my test in the ideal situation which you
suggested to make dirty blocks to be contiguous better.

And it come to be clear why we had such a big difference in opinion: we
just looked at different part of a cow, you looked at tail, I looked at
head.

Here is the configurations I covered,
fio blocksize: 8kB, 16kB, 32kB, 64kB, 128kB, 256kB, 512kB, 1024kB
dirty data size:, 110GB, 500GB, 700GB
cache device size: 220GB, 1TB, 1.5TB, 1.8TB
cached device size: 1.8TB, 4TB, 7.2TB

(md linear is used to combine large device with multiple hard drives, so
large bio won't be split unless it goes across hard drive size boundary )

I don't test all the above combinations, my test cases are very limited
(still spent me several days), but most important ones are covered.

I use the following items to measure writeback performance:
- decreased dirty data amount (a.k.a throughput)
- writeback write requests per-second
- writeback write request merge numbers per-second

It turns out, in the ideal situation, bio reorder patches performance
drops if: 1) write I/O size increased 2) amount of dirty data increased

I do observe the perfect side of your patches, when I/O blocksize
<=256kB, I see great advantage of writeback performance when bio
reordered. Especially when blocksiz is 8K, writeback performance is 3x
at least (because without your patch the writeback is too slow and I
gave up after hours).

The performance regression happens when fio blocksize increased to 512K
and dirty data increased to 900GB. And when fio blocksize increased to
1MB and dirty data on cache increased to 900, writeback performance
regression becomes easily recognized.

An interesting behavior I observed is, for large blocksize and dirty
data, without bio reorder patches, writeback performance is much higher
than the bio reorder one. An example is,
http://blog.coly.li/wp-content/uploads/2017/10/writeback_throughput_on_linear_900_1800G_cache_half.png

The first 15 minutes, bcache without bio reorder performs much higher
than the reorder one. 15 minutes later, all the writeback rate decreased
to a similar level. That's said, most of the performance regression
happens at the beginning when writeback starts.

All the tests are under ideal situations, no writeback happens when fio
writes dirty data onto cache device. If in more generic situations when
less LBA contiguous dirty blocks on cache, I guess the writeback
regression might be more obvious.

When dirty blocks are not LBA contiguous on cache device, for small
dirty block size, I don't worry. Because you explained in previous
emails clearly, the worst case is the performance backing to the numbers
which has no bio reorder patch. But for large blocksize and large dirty
data, that's the common case when bcache is used for distributed
computing/storage systems like Ceph or Hadoop (multiple hard drives
attached to a large SSD cache, normally object file sizes are from 4MB
to 128MB).

Here is the command line I used to initialize bcache device:
#ake-bcache -B /dev/md0 -C /dev/nvme1n1p1
echo /dev/nvme1n1p1 > /sys/fs/bcache/register
echo /dev/md0 > /sys/fs/bcache/register
sleep 1
echo 0 > /sys/block/bcache0/bcache/cache/congested_read_threshold_us
echo 0 > /sys/block/bcache0/bcache/cache/congested_write_threshold_us
echo writeback > /sys/block/bcache0/bcache/cache_mode
echo 0 > /sys/block/bcache0/bcache/writeback_running

After fio writes enough dirty data onto cache device, I write 1 into
writeback_runing.

Here is the fio job file I used,
[global]
direct=1
thread=1
ioengine=libaio

[job]
filename=/dev/bcache0
readwrite=randwrite
numjobs=1
blocksize=<test block size>
iodepth=256
size=<dirty data amount>

I see the writeback performance advantage in ideal situation, it is
desired :-) But I also worry about the performance regression for large
dirty block size and dirty data.

I raise a green flag to your bio reorder patches :-) I do hope I could
have seen such performance data at first time :P

*Last one question*: Could you please to consider an option to
enable/disable the bio reorder code in sysfs? It can be enabled as
default. When people cares about large dirty data size writeback, they
can choose to disable the bio reorder policy.

I hope we can get an agreement and make this patch move forward.

Thanks for your patience, and continuous following up the discussion.

-- 
Coly Li