Hi all, This is V2 for CBD (CXL Block Device). The most important feature mentioned in the RFC, cbd cache, is finally ready for review. The introduction of cbd cache makes cbd highly useful in single-host scenarios. At the same time, since the hardware features are not yet ready, I've included multi-host scenarios as an optional feature in Kconfig in this version, with a note indicating that multi-host requires shared memory to support hardware consistency. V2 of the code can be found at: https://github.com/DataTravelGuide/linux.git branch cbd (1) What is CBD Cache cbd cache is a *lightweight* solution that uses persistent memory as block device cache. It works similar with bcache, where bcache uses block devices as cache device, but cbd cache only supports persistent memory devices for caching. It accesses the cache device through DAX and is designed with features specifically for persistent memory scenarios, such as multi-cache tree structures and sync insertion of cached data. +-----------------------------------------------------------------+ | single-host | +-----------------------------------------------------------------+ | | | | | | | | | | | +-----------+ +------------+ | | | /dev/cbd0 | | /dev/cbd1 | | | | | | | | | +---------------------|-----------|-----|------------|-------+ | | | | | | | | | | | /dev/pmem0 | cbd0 cache| | cbd1 cache | | | | | | | | | | | | +---------------------|-----------|-----|------------|-------+ | | |+---------+| |+----------+| | | ||/dev/sda || || /dev/sdb || | | |+---------+| |+----------+| | | +-----------+ +------------+ | +-----------------------------------------------------------------+ Note: cbd cache is not intended to replace your bcache. Instead, it offers an alternative solution specifically suited for scenarios where you want to use persistent memory devices as block device cache. Another caching technique for accessing persistent memory using DAX is dm-writeback, but it is designed for scenarios based on device-mapper. On the other hand, cbd cache and bcache are caching solutions for block device scenarios. Therefore, I did not do a comparative analysis between cbd cache and dm-writeback. (2) light software overhead cache write (low latency) For cache write, handling a write request typically involves the following steps: (1) Allocating cache space -> (2) Writing data to the cache -> (3) Recording cache index metadata -> (4) Returning the result. In cache modules using block devices as the cache (e.g., bcache), the steps of (2) writing data to the cache and (3) recording cache index metadata are asynchronous. During step (2), submit_bio is issued to the cache block device, and after the bi_end_io callback completes, a new process continues with step (3). This incurs significant overhead for persistent memory cache. However, cbd cache, which is designed for persistent memory, does not require asynchronous operations. It can directly proceed with steps (3) and (4) after completing the memcpy through DAX. This makes a significant difference for small IO. In the case of 4K random writes, cbd cache achieves a latency of only 7.72us (compared to 25.30us for bcache in the same test, offering a 300% improvement). Further comparative results for various scenarios are shown in the table below. +------------+-------------------------+--------------------------+ | numjobs=1 | randwrite | randread | | iodepth=1 +------------+------------+-------------+------------+ | (latency) | cbd cache | bcache | cbd cache | bcache | +------------+------------+------------+-------------+------------+ | bs=512 | 6.10us | 23.08us | 4.82us | 5.57us | +------------+------------+------------+-------------+------------+ | bs=1K | 6.35us | 21.68us | 5.38us | 6.05us | +------------+------------+------------+-------------+------------+ | bs=4K | 7.72us | 25.30us | 6.06us | 6.00us | +------------+------------+------------+-------------+------------+ | bs=8K | 8.92us | 27.73us | 7.24us | 7.35us | +------------+------------+------------+-------------+------------+ | bs=16K | 12.25us | 34.04us | 9.06us | 9.10us | +------------+------------+------------+-------------+------------+ | bs=32K | 16.77us | 49.24us | 14.10us | 16.18us | +------------+------------+------------+-------------+------------+ | bs=64K | 30.52us | 63.72us | 30.69us | 30.38us | +------------+------------+------------+-------------+------------+ | bs=128K | 51.66us | 114.69us | 38.47us | 39.10us | +------------+------------+------------+-------------+------------+ | bs=256K | 110.16us | 204.41us | 79.64us | 99.98us | +------------+------------+------------+-------------+------------+ | bs=512K | 205.52us | 398.70us | 122.15us | 131.97us | +------------+------------+------------+-------------+------------+ | bs=1M | 493.57us | 623.31us | 233.48us | 246.56us | +------------+------------+------------+-------------+------------+ (3) multi-queue and multi cache tree (high iops) For persistent memory, the hardware concurrency is very high. If an indexing tree is used to manage space indexing, the indexing will become a bottleneck for concurrency. cbd cache independently manages its own indexing tree for each backend. Meanwhile, the indexing tree for the cache corresponding to each backend is divided into multiple RB trees based on the logical address space. All IO operations will find the corresponding indexing tree based on their offset. This design increases concurrency while ensuring that the depth of the indexing tree does not become too large.