Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device)

Dongsheng Yang <dongsheng.yang@xxxxxxxxxxxx> · Fri, 26 Apr 2024 22:53:43 +0800

在 2024/4/26 星期五 下午 9:48, Gregory Price 写道:
On Fri, Apr 26, 2024 at 09:25:53AM +0800, Dongsheng Yang wrote:

在 2024/4/24 星期三 下午 11:14, Gregory Price 写道:
On Wed, Apr 24, 2024 at 02:33:28PM +0800, Dongsheng Yang wrote:

在 2024/4/24 星期三 下午 12:29, Dan Williams 写道:
Dongsheng Yang wrote:
From: Dongsheng Yang <dongsheng.yang.linux@xxxxxxxxx>

Hi all,
	This patchset introduce cbd (CXL block device). It's based on linux 6.8, and available at:
	https://github.com/DataTravelGuide/linux

[..]
(4) dax is not supported yet:
	same with famfs, dax device is not supported here, because dax device does not support
dev_dax_iomap so far. Once dev_dax_iomap is supported, CBD can easily support DAX mode.

I am glad that famfs is mentioned here, it demonstrates you know about
it. However, unfortunately this cover letter does not offer any analysis
of *why* the Linux project should consider this additional approach to
the inter-host shared-memory enabling problem.

To be clear I am neutral at best on some of the initiatives around CXL
memory sharing vs pooling, but famfs at least jettisons block-devices
and gets closer to a purpose-built memory semantic.

So my primary question is why would Linux need both famfs and cbd? I am
sure famfs would love feedback and help vs developing competing efforts.

Hi,
	Thanks for your reply, IIUC about FAMfs, the data in famfs is stored in
shared memory, and related nodes can share the data inside this file system;
whereas cbd does not store data in shared memory, it uses shared memory as a
channel for data transmission, and the actual data is stored in the backend
block device of remote nodes. In cbd, shared memory works more like network
to connect different hosts.

Couldn't you basically just allocate a file for use as a uni-directional
buffer on top of FAMFS and achieve the same thing without the need for
additional kernel support? Similar in a sense to allocating a file on
network storage and pinging the remote host when it's ready (except now
it's fast!)

I'm not entirely sure I follow your suggestion. I guess it means that cbd
would no longer directly manage the pmem device, but allocate files on famfs
to transfer data. I didn't do it this way because I considered at least a
few points: one of them is, cbd_transport actually requires a DAX device to
access shared memory, and cbd has very simple requirements for space
management, so there's no need to rely on a file system layer, which would
increase architectural complexity.

However, we still need cbd_blkdev to provide a block device, so it doesn't
achieve "achieve the same without the need for additional kernel support".

Could you please provide more specific details about your suggestion?

Fundamentally you're shuffling bits from one place to another, the
ultimate target is storage located on another device as opposed to
the memory itself.  So you're using CXL as a transport medium.

Could you not do the same thing with a file in FAMFS, and put all of
the transport logic in userland? Then you'd just have what looks like
a kernel bypass transport mechanism built on top of a file backed by
shared memory.

Basically it's unclear to me why this must be done in the kernel.
Performance? Explicit bypass? Some technical reason I'm missing?

In user space, transferring data via FAMFS files poses no problem, but 
how do we present this data to users? We cannot expect users to revamp 
all their business I/O methods.

For example, suppose a user needs to run a database on a compute node. 
As the cloud infrastructure department, we need to allocate a block 
storage on the storage node and provide it to the database on the 
compute node through a certain transmission protocol (such as iSCSI, 
NVMe over Fabrics, or our current solution, cbd). Users can then create 
any file system they like on the block device and run the database on 
it. We aim to enhance the performance of this block device with cbd, 
rather than requiring the business department to adapt their database to 
fit our shared memory-facing storage node disks.

This is why we need to provide users with a block device. If it were 
only about data transmission, we wouldn't need a block device. But when 
it comes to actually running business operations, we need a block 
storage interface for the upper layer. Additionally, the block device 
layer offers many other rich features, such as RAID.

If accessing shared memory in user space is mandatory, there's another 
option: using user space block storage technologies like ublk. However, 
this would lead to performance issues as data would need to traverse 
back to the kernel space block device from the user space process.

In summary, we need a block device sharing mechanism, similar to what is 
provided by NBD, iSCSI, or NVMe over Fabrics, because user businesses 
rely on the block device interface and ecosystem.

Also, on a tangential note, you're using pmem/qemu to emulate the
behavior of shared CXL memory.  You should probably explain the
coherence implications of the system more explicitly.

The emulated system implements what amounts to hardware-coherent
memory (i.e. the two QEMU machines run on the same physical machine,
so coherency is managed within the same coherence domain).

If there is no explicit coherence control in software, then it is
important to state that this system relies on hardware that implements
snoop back-invalidate (which is not a requirement of a CXL 3.x device,
just a feature described by the spec that may be implemented).

In (5) of the cover letter, I mentioned that cbd addresses cache 
coherence at the software level:

(5) How do blkdev and backend interact through the channel?
	a) For reader side, before reading the data, if the data in this 
channel may be modified by the other party, then I need to flush the 
cache before reading to ensure that I get the latest data. For example, 
the blkdev needs to flush the cache before obtaining compr_head because 
compr_head will be updated by the backend handler.
	b) For writter side, if the written information will be read by others, 
then after writing, I need to flush the cache to let the other party see 
it immediately. For example, after blkdev submits cbd_se, it needs to 
update cmd_head to let the handler have a new cbd_se. Therefore, after 
updating cmd_head, I need to flush the cache to let the backend see it.

This part of the code is indeed implemented, however, as you pointed 
out, since I am currently using qemu/pmem for emulation, the effects of 
this code cannot be observed.

Thanx

~Gregory
.