[PATCH v3 0/8] Introduce CBD (CXL Block Device)

Dongsheng Yang <dongsheng.yang@xxxxxxxxx> · Tue, 7 Jan 2025 10:30:16 +0000

Hi Jens,
    Please help to take a look at this patchset. This is V3 for CBD (CXL Block Device).
CBD supports both single-host and multi-host scenarios, and allows the use of pmem
devices as block device cache, providing low latency and high-concurrency performance.
    (1) Better latency:
        1 iodepth, 1 numjobs, randwrite bs=4K: cbd 7.72us vs bcache
25.30us, about 300% improvement
    (2) Better iops:
        1 iodepth, 32 numjobs, randwrite bs=4K: cbd 1,400K IOPS vs
bcache 210K IOPS, about 600% improvement
    (3) Better stdev:
        1 iodepth, 1 numjobs, randwrite bs=4K: cbd stdev=36.45 vs bcache
stdev=937.81, about 30 times improvement.

V3 of the code can be found at: https://github.com/DataTravelGuide/linux.git tag cbd-v3

Changelog from V2:
	- Refactored the cbd_cache.c and cbd_internal.h files by splitting them into multiple files, making the structure clearer and increasing readability.
	- Added CRC verification for all data and metadata. This means that if all CRC verification options are enabled in Kconfig, all information written to pmem, including data and metadata, will have CRC checks.
	- Fixed some minor bugs discovered during long-term runs of xfstests.
	- Added the cbd-utils (https://github.com/DataTravelGuide/cbd-utils) project to user-space tools, providing the cbdctrl command for cbd-related management operations.

You can create a cbd using the following commands:

	# cbdctrl tp-reg --path /dev/pmem0 --host node0 --format --force
	# cbdctrl backend-start --path /dev/sda --start-dev --cache-size 1G
	/dev/cbd0
	# cbdctrl backend-list
	[
	    {
		"backend_id": 0,
		"host_id": 0,
		"backend_path": "/dev/sda",
		"alive": true,
		"cache_segs": 64,
		"cache_gc_percent": 70,
		"cache_used_segs": 1,
		"blkdevs": [
		    {
			"blkdev_id": 0,
			"host_id": 0,
			"backend_id": 0,
			"dev_name": "/dev/cbd0",
			"alive": true
		    }
		]
	    }
	]

Additional information about CBD cache:

    (1) What is CBD Cache
  cbd cache is a *lightweight* solution that uses persistent memory as block
device cache. It works similar with bcache, where bcache uses block
devices as cache device, but cbd cache only supports persistent memory
devices for caching. It accesses the cache device through DAX and
is designed with features specifically for persistent memory scenarios,
such as multi-cache tree structures and sync insertion of cached data.

+-----------------------------------------------------------------+
|                         single-host                             |
+-----------------------------------------------------------------+
|                                                                 |
|                                                                 |
|                                                                 |
|                                                                 |
|                                                                 |
|                        +-----------+     +------------+         |
|                        | /dev/cbd0 |     | /dev/cbd1  |         |
|                        |           |     |            |         |
|  +---------------------|-----------|-----|------------|-------+ |
|  |                     |           |     |            |       | |
|  |      /dev/pmem0     | cbd0 cache|     | cbd1 cache |       | |
|  |                     |           |     |            |       | |
|  +---------------------|-----------|-----|------------|-------+ |
|                        |+---------+|     |+----------+|         |
|                        ||/dev/sda ||     || /dev/sdb ||         |
|                        |+---------+|     |+----------+|         |
|                        +-----------+     +------------+         |
+-----------------------------------------------------------------+

Note: cbd cache is not intended to replace your bcache. Instead, it
offers an alternative solution specifically suited for scenarios where
you want to use persistent memory devices as block device cache.

Another caching technique for accessing persistent memory using DAX is
dm-writeback, but it is designed for scenarios based on device-mapper.
On the other hand, cbd cache and bcache are caching solutions for block
device scenarios. Therefore, I did not do a comparative analysis between
cbd cache and dm-writeback.

    (2) light software overhead cache write (low latency)

For cache write, handling a write request typically involves the
following steps: (1) Allocating cache space -> (2) Writing data to the
cache -> (3) Recording cache index metadata -> (4) Returning the result.

In cache modules using block devices as the cache (e.g., bcache), the
steps of (2) writing data to the cache and (3) recording cache index
metadata are asynchronous.

During step (2), submit_bio is issued to the cache block device, and
after the bi_end_io callback completes, a new process continues with
step (3). This incurs significant overhead for persistent memory cache.

However, cbd cache, which is designed for persistent memory, does not
require asynchronous operations. It can directly proceed with steps (3)
and (4) after completing the memcpy through DAX.

This makes a significant difference for small IO. In the case of 4K
random writes, cbd cache achieves a latency of only 7.72us (compared to
25.30us for bcache in the same test, offering a 300% improvement).

Further comparative results for various scenarios are shown in the table
below.

+------------+-------------------------+--------------------------+
| numjobs=1  |         randwrite       |       randread           |
| iodepth=1  +------------+------------+-------------+------------+
| (latency)  |  cbd cache |  bcache    |  cbd cache  |  bcache    |
+------------+------------+------------+-------------+------------+
|  bs=512    |    6.10us  |    23.08us |      4.82us |     5.57us |
+------------+------------+------------+-------------+------------+
|  bs=1K     |    6.35us  |    21.68us |      5.38us |     6.05us |
+------------+------------+------------+-------------+------------+
|  bs=4K     |    7.72us  |    25.30us |      6.06us |     6.00us |
+------------+------------+------------+-------------+------------+
|  bs=8K     |    8.92us  |    27.73us |      7.24us |     7.35us |
+------------+------------+------------+-------------+------------+
|  bs=16K    |   12.25us  |    34.04us |      9.06us |     9.10us |
+------------+------------+------------+-------------+------------+
|  bs=32K    |   16.77us  |    49.24us |     14.10us |    16.18us |
+------------+------------+------------+-------------+------------+
|  bs=64K    |   30.52us  |    63.72us |     30.69us |    30.38us |
+------------+------------+------------+-------------+------------+
|  bs=128K   |   51.66us  |   114.69us |     38.47us |    39.10us |
+------------+------------+------------+-------------+------------+
|  bs=256K   |  110.16us  |   204.41us |     79.64us |    99.98us |
+------------+------------+------------+-------------+------------+
|  bs=512K   |  205.52us  |   398.70us |    122.15us |   131.97us |
+------------+------------+------------+-------------+------------+
|  bs=1M     |  493.57us  |   623.31us |    233.48us |   246.56us |
+------------+------------+------------+-------------+------------+

    (3) multi-queue and multi cache tree (high iops)

For persistent memory, the hardware concurrency is very high. If an
indexing tree is used to manage space indexing, the indexing will become
a bottleneck for concurrency.

cbd cache independently manages its own indexing tree for each backend.
Meanwhile, the indexing tree for the cache corresponding to each backend
is divided into multiple RB trees based on the logical address space.
All IO operations will find the corresponding indexing tree based on
their offset. This design increases concurrency while ensuring that the
depth of the indexing tree does not become too large.