Re: Very slow bcache-register: 6.4TB takes 10+ minutes

Coly Li <colyli@xxxxxxx> · Mon, 21 Oct 2019 17:26:37 +0800

On 2019/10/21 4:37 下午, Teodor Milkov wrote:
> On 20.10.19 г. 9:34 ч., Coly Li wrote:
>> On 2019/10/17 11:21 下午, Teodor Milkov wrote:
>>> Hello,
>>>
>>> I've tried using bcache with a large 6.4TB NVMe device, but found it
>>> takes long time to register after clean reboot -- around 10 minutes.
>>> That's even with idle machine reboot.
>>>
>>> Things look like this soon after reboot:
>>>
>>> root@node420:~# ps axuww |grep md12
>>> root      9768 88.1  0.0   2268   744 pts/0    D+   16:20 0:25
>>> /lib/udev/bcache-register /dev/md12
>>>
>>>
>>> Device            r/s     w/s     rMB/s     wMB/s rrqm/s   wrqm/s
>>> %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
>>> nvme0n1        420.00    0.00     52.50      0.00 0.00     0.00   0.00
>>> 0.00    0.30    0.00   1.04   128.00 0.00   2.38  99.87
>>> nvme1n1        417.67    0.00     52.21      0.00 0.00     0.00   0.00
>>> 0.00    0.30    0.00   1.03   128.00 0.00   2.39 100.00
>>> md12           838.00    0.00    104.75      0.00 0.00     0.00   0.00
>>> 0.00    0.00    0.00   0.00   128.00 0.00   0.00   0.00
>>>
>>> As you can see nvme1n1, which is Micron 9200, is reading with the humble
>>> 52MB/s (417r/s), and that is very far bellow it's capabilities of
>>> 3500MB/s & 840K IOPS.
>>>
>>> At the same time it seems like the bcache-register process is saturating
>>> the CPU core it's running on, so maybe that's the bottleneck?
>>>
>>> Tested with kernels 4.9 and 4.19.
>>>
>>> 1. Is this current state of affairs -- i.e. this known/expected
>>> behaviour with such a large cache?
>>>
>> The CPU is busy on checking checksum of all btree nodes. It is as
>> expected but definitely should be improved.
> 
> Thank you for your quick and detailed response, Coly Li!
> 
> I didn't think of checksum calculation, because in my mind these are
> usually very fast nowadays.
> 
> For example I have tried on a very modest 7" laptop with mobile
> processor what would the perl Digest::CRC implementation of crc64 would
> be and it's crunching it at 228MB/s (see bellow).
> 
> There are reports for speeds up to 1600MB/s like the one at
> https://matt.sh/redis-crcspeed
> 
> At the same time my experience was -- bcache reading from NVMe only
> 52MB/s on a quite powerful Intel(R) Xeon(R) Gold 6140 CPU, which caught
> me unprepared.
> 
> $ yes $(strings /dev/urandom |dd bs=1M count=1) |pv -s 1000M -S |perl
> -ne 'use Digest::CRC qw(crc64);  $crc = crc64($_);'
> 0+1 records in
> 0+1 records out
> 4096 bytes (4,1 kB, 4,0 KiB) copied, 0,00504587 s, 812 kB/s
> 1000MiB 0:00:04 [ 228MiB/s] [=======================>] 100%
> 
> 
> $ grep "model name" /proc/cpuinfo
> model name      : Intel(R) Core(TM) m3-7Y30 CPU @ 1.00GHz
> 

See drivers/md/bcache/btree.c:bch_btree_check(), this function is called
in run_cache_set() when running a cache set after a reboot.

The bottle neck is not only crc64 itself, bch_btree_check() iterates all
internal B+tree nodes in linear order, that is,
 read bnode -> check csum -> read next bnode -> check csum
This is why I/O is slow. And there is only one thread performs csum
checking, this is why only a single CPU core being busy.

When cache is small, it won't be a problem. But now NVMe SSD is cheaper
and bigger, checking B+btree node in start up comes to be a performance
problem.

-- 

Coly Li