Re: Very slow bcache-register: 6.4TB takes 10+ minutes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2019/10/17 11:21 下午, Teodor Milkov wrote:
> Hello,
> 
> I've tried using bcache with a large 6.4TB NVMe device, but found it
> takes long time to register after clean reboot -- around 10 minutes.
> That's even with idle machine reboot.
> 
> Things look like this soon after reboot:
> 
> root@node420:~# ps axuww |grep md12
> root      9768 88.1  0.0   2268   744 pts/0    D+   16:20 0:25
> /lib/udev/bcache-register /dev/md12
> 
> 
> Device            r/s     w/s     rMB/s     wMB/s rrqm/s   wrqm/s 
> %rrqm  %wrqm r_await w_await aqu-sz rareq-sz wareq-sz  svctm  %util
> nvme0n1        420.00    0.00     52.50      0.00 0.00     0.00   0.00  
> 0.00    0.30    0.00   1.04   128.00 0.00   2.38  99.87
> nvme1n1        417.67    0.00     52.21      0.00 0.00     0.00   0.00  
> 0.00    0.30    0.00   1.03   128.00 0.00   2.39 100.00
> md12           838.00    0.00    104.75      0.00 0.00     0.00   0.00  
> 0.00    0.00    0.00   0.00   128.00 0.00   0.00   0.00
> 
> As you can see nvme1n1, which is Micron 9200, is reading with the humble
> 52MB/s (417r/s), and that is very far bellow it's capabilities of
> 3500MB/s & 840K IOPS.
> 
> At the same time it seems like the bcache-register process is saturating
> the CPU core it's running on, so maybe that's the bottleneck?
> 
> Tested with kernels 4.9 and 4.19.
> 
> 1. Is this current state of affairs -- i.e. this known/expected
> behaviour with such a large cache?
> 

The CPU is busy on checking checksum of all btree nodes. It is as
expected but definitely should be improved.

When the btree is very large, checking checksum of each btree node with
crc64 on single thread is very slow. On my machine it can be 20 minutes
around.

So far there is less method to improve crc64 speed, but it is possible
to checking checksum with multiple threads. Just need time to work on it.

> 2. If this isn't expected -- any ideas how to debug or fix it?
> 

As I mentioned on question 1, we need multiple threads to check the
checksum of each btree nodes, since it is read-only access on boot time
and no lock contention, it is possible to speed up much with more CPU
core involved in crc64 calculation in parallel.

> 3. What is max recommended cache size?
> 

So far we only have a single B+tree to contain and index all bkeys. If
the cached data is large, this could be slow. So I suggest to create
more partition and make individual cache set on each partition. In my
personal testing, I suggest the maximum cache set size as 2-4TB.

Multiple B+trees is on my to-do list, but I need to finish other tasks
with higher priority. So far I am working on big-endian machine support
still.

Thanks.

-- 

Coly Li



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Linux ARM Kernel]     [Linux Filesystem Development]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [ECOS]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux