On 2019/10/17 11:21 下午, Teodor Milkov wrote: > Hello, > > I've tried using bcache with a large 6.4TB NVMe device, but found it > takes long time to register after clean reboot -- around 10 minutes. > That's even with idle machine reboot. > > Things look like this soon after reboot: > > root@node420:~# ps axuww |grep md12 > root 9768 88.1 0.0 2268 744 pts/0 D+ 16:20 0:25 > /lib/udev/bcache-register /dev/md12 > > > Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s > %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util > nvme0n1 420.00 0.00 52.50 0.00 0.00 0.00 0.00 > 0.00 0.30 0.00 1.04 128.00 0.00 2.38 99.87 > nvme1n1 417.67 0.00 52.21 0.00 0.00 0.00 0.00 > 0.00 0.30 0.00 1.03 128.00 0.00 2.39 100.00 > md12 838.00 0.00 104.75 0.00 0.00 0.00 0.00 > 0.00 0.00 0.00 0.00 128.00 0.00 0.00 0.00 > > As you can see nvme1n1, which is Micron 9200, is reading with the humble > 52MB/s (417r/s), and that is very far bellow it's capabilities of > 3500MB/s & 840K IOPS. > > At the same time it seems like the bcache-register process is saturating > the CPU core it's running on, so maybe that's the bottleneck? > > Tested with kernels 4.9 and 4.19. > > 1. Is this current state of affairs -- i.e. this known/expected > behaviour with such a large cache? > The CPU is busy on checking checksum of all btree nodes. It is as expected but definitely should be improved. When the btree is very large, checking checksum of each btree node with crc64 on single thread is very slow. On my machine it can be 20 minutes around. So far there is less method to improve crc64 speed, but it is possible to checking checksum with multiple threads. Just need time to work on it. > 2. If this isn't expected -- any ideas how to debug or fix it? > As I mentioned on question 1, we need multiple threads to check the checksum of each btree nodes, since it is read-only access on boot time and no lock contention, it is possible to speed up much with more CPU core involved in crc64 calculation in parallel. > 3. What is max recommended cache size? > So far we only have a single B+tree to contain and index all bkeys. If the cached data is large, this could be slow. So I suggest to create more partition and make individual cache set on each partition. In my personal testing, I suggest the maximum cache set size as 2-4TB. Multiple B+trees is on my to-do list, but I need to finish other tasks with higher priority. So far I am working on big-endian machine support still. Thanks. -- Coly Li