Hi Jens, Please help to take a look at this patchset. This is V3 for CBD (CXL Block Device). CBD supports both single-host and multi-host scenarios, and allows the use of pmem devices as block device cache, providing low latency and high-concurrency performance. (1) Better latency: 1 iodepth, 1 numjobs, randwrite bs=4K: cbd 7.72us vs bcache 25.30us, about 300% improvement (2) Better iops: 1 iodepth, 32 numjobs, randwrite bs=4K: cbd 1,400K IOPS vs bcache 210K IOPS, about 600% improvement (3) Better stdev: 1 iodepth, 1 numjobs, randwrite bs=4K: cbd stdev=36.45 vs bcache stdev=937.81, about 30 times improvement. V3 of the code can be found at: https://github.com/DataTravelGuide/linux.git tag cbd-v3 Changelog from V2: - Refactored the cbd_cache.c and cbd_internal.h files by splitting them into multiple files, making the structure clearer and increasing readability. - Added CRC verification for all data and metadata. This means that if all CRC verification options are enabled in Kconfig, all information written to pmem, including data and metadata, will have CRC checks. - Fixed some minor bugs discovered during long-term runs of xfstests. - Added the cbd-utils (https://github.com/DataTravelGuide/cbd-utils) project to user-space tools, providing the cbdctrl command for cbd-related management operations. You can create a cbd using the following commands: # cbdctrl tp-reg --path /dev/pmem0 --host node0 --format --force # cbdctrl backend-start --path /dev/sda --start-dev --cache-size 1G /dev/cbd0 # cbdctrl backend-list [ { "backend_id": 0, "host_id": 0, "backend_path": "/dev/sda", "alive": true, "cache_segs": 64, "cache_gc_percent": 70, "cache_used_segs": 1, "blkdevs": [ { "blkdev_id": 0, "host_id": 0, "backend_id": 0, "dev_name": "/dev/cbd0", "alive": true } ] } ] Additional information about CBD cache: (1) What is CBD Cache cbd cache is a *lightweight* solution that uses persistent memory as block device cache. It works similar with bcache, where bcache uses block devices as cache device, but cbd cache only supports persistent memory devices for caching. It accesses the cache device through DAX and is designed with features specifically for persistent memory scenarios, such as multi-cache tree structures and sync insertion of cached data. +-----------------------------------------------------------------+ | single-host | +-----------------------------------------------------------------+ | | | | | | | | | | | +-----------+ +------------+ | | | /dev/cbd0 | | /dev/cbd1 | | | | | | | | | +---------------------|-----------|-----|------------|-------+ | | | | | | | | | | | /dev/pmem0 | cbd0 cache| | cbd1 cache | | | | | | | | | | | | +---------------------|-----------|-----|------------|-------+ | | |+---------+| |+----------+| | | ||/dev/sda || || /dev/sdb || | | |+---------+| |+----------+| | | +-----------+ +------------+ | +-----------------------------------------------------------------+ Note: cbd cache is not intended to replace your bcache. Instead, it offers an alternative solution specifically suited for scenarios where you want to use persistent memory devices as block device cache. Another caching technique for accessing persistent memory using DAX is dm-writeback, but it is designed for scenarios based on device-mapper. On the other hand, cbd cache and bcache are caching solutions for block device scenarios. Therefore, I did not do a comparative analysis between cbd cache and dm-writeback. (2) light software overhead cache write (low latency) For cache write, handling a write request typically involves the following steps: (1) Allocating cache space -> (2) Writing data to the cache -> (3) Recording cache index metadata -> (4) Returning the result. In cache modules using block devices as the cache (e.g., bcache), the steps of (2) writing data to the cache and (3) recording cache index metadata are asynchronous. During step (2), submit_bio is issued to the cache block device, and after the bi_end_io callback completes, a new process continues with step (3). This incurs significant overhead for persistent memory cache. However, cbd cache, which is designed for persistent memory, does not require asynchronous operations. It can directly proceed with steps (3) and (4) after completing the memcpy through DAX. This makes a significant difference for small IO. In the case of 4K random writes, cbd cache achieves a latency of only 7.72us (compared to 25.30us for bcache in the same test, offering a 300% improvement). Further comparative results for various scenarios are shown in the table below. +------------+-------------------------+--------------------------+ | numjobs=1 | randwrite | randread | | iodepth=1 +------------+------------+-------------+------------+ | (latency) | cbd cache | bcache | cbd cache | bcache | +------------+------------+------------+-------------+------------+ | bs=512 | 6.10us | 23.08us | 4.82us | 5.57us | +------------+------------+------------+-------------+------------+ | bs=1K | 6.35us | 21.68us | 5.38us | 6.05us | +------------+------------+------------+-------------+------------+ | bs=4K | 7.72us | 25.30us | 6.06us | 6.00us | +------------+------------+------------+-------------+------------+ | bs=8K | 8.92us | 27.73us | 7.24us | 7.35us | +------------+------------+------------+-------------+------------+ | bs=16K | 12.25us | 34.04us | 9.06us | 9.10us | +------------+------------+------------+-------------+------------+ | bs=32K | 16.77us | 49.24us | 14.10us | 16.18us | +------------+------------+------------+-------------+------------+ | bs=64K | 30.52us | 63.72us | 30.69us | 30.38us | +------------+------------+------------+-------------+------------+ | bs=128K | 51.66us | 114.69us | 38.47us | 39.10us | +------------+------------+------------+-------------+------------+ | bs=256K | 110.16us | 204.41us | 79.64us | 99.98us | +------------+------------+------------+-------------+------------+ | bs=512K | 205.52us | 398.70us | 122.15us | 131.97us | +------------+------------+------------+-------------+------------+ | bs=1M | 493.57us | 623.31us | 233.48us | 246.56us | +------------+------------+------------+-------------+------------+ (3) multi-queue and multi cache tree (high iops) For persistent memory, the hardware concurrency is very high. If an indexing tree is used to manage space indexing, the indexing will become a bottleneck for concurrency. cbd cache independently manages its own indexing tree for each backend. Meanwhile, the indexing tree for the cache corresponding to each backend is divided into multiple RB trees based on the logical address space. All IO operations will find the corresponding indexing tree based on their offset. This design increases concurrency while ensuring that the depth of the indexing tree does not become too large.