[PATCH v2 0/8] Introduce CBD (CXL Block Device)

Dongsheng Yang <dongsheng.yang@xxxxxxxxx> · Wed, 18 Sep 2024 10:18:13 +0000

Hi all,
    This is V2 for CBD (CXL Block Device). The most important
feature mentioned in the RFC, cbd cache, is finally ready for review.

The introduction of cbd cache makes cbd highly useful in single-host
scenarios. At the same time, since the hardware features are not yet
ready, I've included multi-host scenarios as an optional feature in
Kconfig in this version, with a note indicating that multi-host requires
shared memory to support hardware consistency.

V2 of the code can be found at: https://github.com/DataTravelGuide/linux.git branch cbd

    (1) What is CBD Cache
  cbd cache is a *lightweight* solution that uses persistent memory as block
device cache. It works similar with bcache, where bcache uses block
devices as cache device, but cbd cache only supports persistent memory
devices for caching. It accesses the cache device through DAX and
is designed with features specifically for persistent memory scenarios,
such as multi-cache tree structures and sync insertion of cached data.

+-----------------------------------------------------------------+
|                         single-host                             |
+-----------------------------------------------------------------+
|                                                                 |
|                                                                 |
|                                                                 |
|                                                                 |
|                                                                 |
|                        +-----------+     +------------+         |
|                        | /dev/cbd0 |     | /dev/cbd1  |         |
|                        |           |     |            |         |
|  +---------------------|-----------|-----|------------|-------+ |
|  |                     |           |     |            |       | |
|  |      /dev/pmem0     | cbd0 cache|     | cbd1 cache |       | |
|  |                     |           |     |            |       | |
|  +---------------------|-----------|-----|------------|-------+ |
|                        |+---------+|     |+----------+|         |
|                        ||/dev/sda ||     || /dev/sdb ||         |
|                        |+---------+|     |+----------+|         |
|                        +-----------+     +------------+         |
+-----------------------------------------------------------------+

Note: cbd cache is not intended to replace your bcache. Instead, it
offers an alternative solution specifically suited for scenarios where
you want to use persistent memory devices as block device cache.

Another caching technique for accessing persistent memory using DAX is
dm-writeback, but it is designed for scenarios based on device-mapper.
On the other hand, cbd cache and bcache are caching solutions for block
device scenarios. Therefore, I did not do a comparative analysis between
cbd cache and dm-writeback.

    (2) light software overhead cache write (low latency)

For cache write, handling a write request typically involves the
following steps: (1) Allocating cache space -> (2) Writing data to the
cache -> (3) Recording cache index metadata -> (4) Returning the result.

In cache modules using block devices as the cache (e.g., bcache), the
steps of (2) writing data to the cache and (3) recording cache index
metadata are asynchronous.

During step (2), submit_bio is issued to the cache block device, and
after the bi_end_io callback completes, a new process continues with
step (3). This incurs significant overhead for persistent memory cache.

However, cbd cache, which is designed for persistent memory, does not
require asynchronous operations. It can directly proceed with steps (3)
and (4) after completing the memcpy through DAX.

This makes a significant difference for small IO. In the case of 4K
random writes, cbd cache achieves a latency of only 7.72us (compared to
25.30us for bcache in the same test, offering a 300% improvement).

Further comparative results for various scenarios are shown in the table
below.

+------------+-------------------------+--------------------------+
| numjobs=1  |         randwrite       |       randread           |
| iodepth=1  +------------+------------+-------------+------------+
| (latency)  |  cbd cache |  bcache    |  cbd cache  |  bcache    |
+------------+------------+------------+-------------+------------+
|  bs=512    |    6.10us  |    23.08us |      4.82us |     5.57us |
+------------+------------+------------+-------------+------------+
|  bs=1K     |    6.35us  |    21.68us |      5.38us |     6.05us |
+------------+------------+------------+-------------+------------+
|  bs=4K     |    7.72us  |    25.30us |      6.06us |     6.00us |
+------------+------------+------------+-------------+------------+
|  bs=8K     |    8.92us  |    27.73us |      7.24us |     7.35us |
+------------+------------+------------+-------------+------------+
|  bs=16K    |   12.25us  |    34.04us |      9.06us |     9.10us |
+------------+------------+------------+-------------+------------+
|  bs=32K    |   16.77us  |    49.24us |     14.10us |    16.18us |
+------------+------------+------------+-------------+------------+
|  bs=64K    |   30.52us  |    63.72us |     30.69us |    30.38us |
+------------+------------+------------+-------------+------------+
|  bs=128K   |   51.66us  |   114.69us |     38.47us |    39.10us |
+------------+------------+------------+-------------+------------+
|  bs=256K   |  110.16us  |   204.41us |     79.64us |    99.98us |
+------------+------------+------------+-------------+------------+
|  bs=512K   |  205.52us  |   398.70us |    122.15us |   131.97us |
+------------+------------+------------+-------------+------------+
|  bs=1M     |  493.57us  |   623.31us |    233.48us |   246.56us |
+------------+------------+------------+-------------+------------+

    (3) multi-queue and multi cache tree (high iops)

For persistent memory, the hardware concurrency is very high. If an
indexing tree is used to manage space indexing, the indexing will become
a bottleneck for concurrency.

cbd cache independently manages its own indexing tree for each backend.
Meanwhile, the indexing tree for the cache corresponding to each backend
is divided into multiple RB trees based on the logical address space.
All IO operations will find the corresponding indexing tree based on
their offset. This design increases concurrency while ensuring that the
depth of the indexing tree does not become too large.