Bcache: ================================================= write: IOPS=39.1k, BW=153MiB/s (160MB/s)(5120MiB/33479msec); 0 zone resets slat (usec): min=4, max=157364, avg=12.47, stdev=138.93 clat (nsec): min=1168, max=474615k, avg=11808.80, stdev=927287.74 lat (usec): min=11, max=474622, avg=24.28, stdev=937.81 clat percentiles (nsec): | 1.00th=[ 1256], 5.00th=[ 1304], 10.00th=[ 1320], | 20.00th=[ 1400], 30.00th=[ 1448], 40.00th=[ 1672], | 50.00th=[ 8640], 60.00th=[ 9152], 70.00th=[ 9664], | 80.00th=[ 10048], 90.00th=[ 11328], 95.00th=[ 19072], | 99.00th=[ 27776], 99.50th=[ 36608], 99.90th=[ 173056], | 99.95th=[ 856064], 99.99th=[2039808] bw ( KiB/s): min=28032, max=214664, per=99.69%, avg=156122.03, stdev=51649.87, samples=66 iops : min= 7008, max=53666, avg=39030.53, stdev=12912.50, samples=66 lat (usec) : 2=41.55%, 4=4.59%, 10=32.70%, 20=16.37%, 50=4.45% lat (usec) : 100=0.10%, 250=0.17%, 500=0.02%, 750=0.01%, 1000=0.01% lat (msec) : 2=0.03%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01% lat (msec) : 100=0.01%, 250=0.01%, 500=0.01% cpu : usr=11.93%, sys=38.61%, ctx=1311384, majf=0, minf=382 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=0,1310718,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: bw=153MiB/s (160MB/s), 153MiB/s-153MiB/s (160MB/s-160MB/s), io=5120MiB (5369MB), run=33479-33479msec Disk stats (read/write): bcache0: ios=0/1305444, sectors=0/10443552, merge=0/0, ticks=0/21789, in_queue=21789, util=65.13%, aggrios=0/0, aggsectors=0/0, aggrmerge=0/0, aggrticks=0/0, aggrin_queue=0, aggrutil=0.00% ram0: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% pmem0: ios=0/0, sectors=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00% CBD cache: ============================================== write: IOPS=133k, BW=520MiB/s (545MB/s)(5120MiB/9848msec); 0 zone resets slat (usec): min=3, max=2786, avg= 5.84, stdev=36.41 clat (nsec): min=852, max=132404, avg=959.09, stdev=436.60 lat (usec): min=4, max=2794, avg= 6.80, stdev=36.45 clat percentiles (nsec): | 1.00th=[ 884], 5.00th=[ 900], 10.00th=[ 908], 20.00th=[916], | 30.00th=[ 924], 40.00th=[ 924], 50.00th=[ 932], 60.00th=[940], | 70.00th=[ 948], 80.00th=[ 964], 90.00th=[ 1004], 95.00th=[1064], | 99.00th=[ 1192], 99.50th=[ 1432], 99.90th=[ 6688], 99.95th=[7712], | 99.99th=[12480] bw ( KiB/s): min=487088, max=552928, per=99.96%, avg=532154.95, stdev=18228.92, samples=19 iops : min=121772, max=138232, avg=133038.84, stdev=4557.32, samples=19 lat (nsec) : 1000=89.09% lat (usec) : 2=10.76%, 4=0.03%, 10=0.09%, 20=0.03%, 50=0.01% lat (usec) : 100=0.01%, 250=0.01% cpu : usr=23.93%, sys=76.03%, ctx=61, majf=0, minf=16 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=0,1310720,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: bw=520MiB/s (545MB/s), 520MiB/s-520MiB/s (545MB/s-545MB/s), io=5120MiB (5369MB), run=9848-9848msec Disk stats (read/write): cbd0: ios=0/1280334, sectors=0/10242672, merge=0/0, ticks=0/0, in_queue=0, util=43.07% (5) no need of formating for your existing disk As a lightweight block storage caching technology, cbd cache does not require storing metadata on backend disk. This allows users to easily add caching to existing disks without the need for any formatting operations and data migration. They can also easily stop using the cbd cache without complications, The backend disk can be used independently as a raw disk. (6) backend device is crash-consistency The writeback mechanism of cbd cache strictly follows a log-structured approach when writeback data. Even if dirty cache data is overwritten by new data (e.g., the old data from 0-4K is A, and new data overwrites 0-4K with B), the old data A is writeback first, followed by writeback the new data B to overwrite on the backend disk. This ensures that the backend disk maintains crash consistency. In the event of a failure of the pmem device, the data on the backend disk remains usable, though crash consistency is maintained while losing the data in the cache. This feature is particularly useful in cloud storage for disaster recovery scenarios. It is important to note that this approach may lead to cache space utilization issues if there are many overwrite operations. However, modern file systems, such as Btrfs and F2FS, take wear leveling of the disk into account, so they tend to avoid writing repeatedly to the same area. This means that there will not be a large number of overwrite writes for the disk. Additionally, modern databases, especially those using LSM engines, rarely perform overwrite operations. Additionally, there is an entry on the TODO list to provide a parameter backend_consistency=false to allow users to achieve better cache space utilization. That depends on how urgent the requirment is. (7) cache space for each disk is configurable For each backend, when enabling caching, we can specify cache space size for this backend. This is different from bcache, where all backing devices can dynamically share the cache space within a single cache device. This improves cache utilization by achieving optimal utilization through time-sharing. However, this can lead to an issue where cache behavior becomes unpredictable. In enterprise applications, it's important to have a more precise understanding of the performance of each disk. When multiple disks dynamically share the cache, the exact amount of cache each disk receives becomes uncertain. cbd cache assigns a dedicated cache space for each disk, ensuring that the cache is exclusive and not affected by others, making the cache behavior more predictable. (8) After all, all the performance test results mentioned above were executed using the `memmap=20G!4G` option to simulate the `/dev/pmem0` device. Additionally, the cbd code runs the cbd-tests by default, including the xfstests suite, it passes xfstests test suite. If anyone has a real CXL memory device, it would be great if you could help with the testing. Thanks! Dongsheng Yang (8): cbd: introduce cbd_transport cbd: introduce cbd_host cbd: introduce cbd_segment cbd: introduce cbd_channel cbd: introduce cbd_cache cbd: introduce cbd_blkdev cbd: introduce cbd_backend block: Init for CBD(CXL Block Device) module drivers/block/Kconfig | 2 + drivers/block/Makefile | 2 + drivers/block/cbd/Kconfig | 45 + drivers/block/cbd/Makefile | 3 + drivers/block/cbd/cbd_backend.c | 395 +++++ drivers/block/cbd/cbd_blkdev.c | 433 ++++++ drivers/block/cbd/cbd_cache.c | 2410 +++++++++++++++++++++++++++++ drivers/block/cbd/cbd_channel.c | 96 ++ drivers/block/cbd/cbd_handler.c | 242 +++ drivers/block/cbd/cbd_host.c | 129 ++ drivers/block/cbd/cbd_internal.h | 1193 ++++++++++++++ drivers/block/cbd/cbd_main.c | 224 +++ drivers/block/cbd/cbd_queue.c | 574 +++++++ drivers/block/cbd/cbd_segment.c | 349 +++++ drivers/block/cbd/cbd_transport.c | 957 ++++++++++++ 15 files changed, 7054 insertions(+) create mode 100644 drivers/block/cbd/Kconfig create mode 100644 drivers/block/cbd/Makefile create mode 100644 drivers/block/cbd/cbd_backend.c create mode 100644 drivers/block/cbd/cbd_blkdev.c create mode 100644 drivers/block/cbd/cbd_cache.c create mode 100644 drivers/block/cbd/cbd_channel.c create mode 100644 drivers/block/cbd/cbd_handler.c create mode 100644 drivers/block/cbd/cbd_host.c create mode 100644 drivers/block/cbd/cbd_internal.h create mode 100644 drivers/block/cbd/cbd_main.c create mode 100644 drivers/block/cbd/cbd_queue.c create mode 100644 drivers/block/cbd/cbd_segment.c create mode 100644 drivers/block/cbd/cbd_transport.c -- 2.34.1