Re: [PATCH RFC 0/7] block: Introduce CBD (CXL Block Device)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





在 2024/4/26 星期五 下午 9:48, Gregory Price 写道:
On Fri, Apr 26, 2024 at 09:25:53AM +0800, Dongsheng Yang wrote:


在 2024/4/24 星期三 下午 11:14, Gregory Price 写道:
On Wed, Apr 24, 2024 at 02:33:28PM +0800, Dongsheng Yang wrote:


在 2024/4/24 星期三 下午 12:29, Dan Williams 写道:
Dongsheng Yang wrote:
From: Dongsheng Yang <dongsheng.yang.linux@xxxxxxxxx>

Hi all,
	This patchset introduce cbd (CXL block device). It's based on linux 6.8, and available at:
	https://github.com/DataTravelGuide/linux

[..]
(4) dax is not supported yet:
	same with famfs, dax device is not supported here, because dax device does not support
dev_dax_iomap so far. Once dev_dax_iomap is supported, CBD can easily support DAX mode.

I am glad that famfs is mentioned here, it demonstrates you know about
it. However, unfortunately this cover letter does not offer any analysis
of *why* the Linux project should consider this additional approach to
the inter-host shared-memory enabling problem.

To be clear I am neutral at best on some of the initiatives around CXL
memory sharing vs pooling, but famfs at least jettisons block-devices
and gets closer to a purpose-built memory semantic.

So my primary question is why would Linux need both famfs and cbd? I am
sure famfs would love feedback and help vs developing competing efforts.

Hi,
	Thanks for your reply, IIUC about FAMfs, the data in famfs is stored in
shared memory, and related nodes can share the data inside this file system;
whereas cbd does not store data in shared memory, it uses shared memory as a
channel for data transmission, and the actual data is stored in the backend
block device of remote nodes. In cbd, shared memory works more like network
to connect different hosts.


Couldn't you basically just allocate a file for use as a uni-directional
buffer on top of FAMFS and achieve the same thing without the need for
additional kernel support? Similar in a sense to allocating a file on
network storage and pinging the remote host when it's ready (except now
it's fast!)

I'm not entirely sure I follow your suggestion. I guess it means that cbd
would no longer directly manage the pmem device, but allocate files on famfs
to transfer data. I didn't do it this way because I considered at least a
few points: one of them is, cbd_transport actually requires a DAX device to
access shared memory, and cbd has very simple requirements for space
management, so there's no need to rely on a file system layer, which would
increase architectural complexity.

However, we still need cbd_blkdev to provide a block device, so it doesn't
achieve "achieve the same without the need for additional kernel support".

Could you please provide more specific details about your suggestion?

Fundamentally you're shuffling bits from one place to another, the
ultimate target is storage located on another device as opposed to
the memory itself.  So you're using CXL as a transport medium.

Could you not do the same thing with a file in FAMFS, and put all of
the transport logic in userland? Then you'd just have what looks like
a kernel bypass transport mechanism built on top of a file backed by
shared memory.

Basically it's unclear to me why this must be done in the kernel.
Performance? Explicit bypass? Some technical reason I'm missing?


In user space, transferring data via FAMFS files poses no problem, but how do we present this data to users? We cannot expect users to revamp all their business I/O methods.

For example, suppose a user needs to run a database on a compute node. As the cloud infrastructure department, we need to allocate a block storage on the storage node and provide it to the database on the compute node through a certain transmission protocol (such as iSCSI, NVMe over Fabrics, or our current solution, cbd). Users can then create any file system they like on the block device and run the database on it. We aim to enhance the performance of this block device with cbd, rather than requiring the business department to adapt their database to fit our shared memory-facing storage node disks.

This is why we need to provide users with a block device. If it were only about data transmission, we wouldn't need a block device. But when it comes to actually running business operations, we need a block storage interface for the upper layer. Additionally, the block device layer offers many other rich features, such as RAID.

If accessing shared memory in user space is mandatory, there's another option: using user space block storage technologies like ublk. However, this would lead to performance issues as data would need to traverse back to the kernel space block device from the user space process.

In summary, we need a block device sharing mechanism, similar to what is provided by NBD, iSCSI, or NVMe over Fabrics, because user businesses rely on the block device interface and ecosystem.


Also, on a tangential note, you're using pmem/qemu to emulate the
behavior of shared CXL memory.  You should probably explain the
coherence implications of the system more explicitly.

The emulated system implements what amounts to hardware-coherent
memory (i.e. the two QEMU machines run on the same physical machine,
so coherency is managed within the same coherence domain).

If there is no explicit coherence control in software, then it is
important to state that this system relies on hardware that implements
snoop back-invalidate (which is not a requirement of a CXL 3.x device,
just a feature described by the spec that may be implemented).

In (5) of the cover letter, I mentioned that cbd addresses cache coherence at the software level:

(5) How do blkdev and backend interact through the channel?
a) For reader side, before reading the data, if the data in this channel may be modified by the other party, then I need to flush the cache before reading to ensure that I get the latest data. For example, the blkdev needs to flush the cache before obtaining compr_head because compr_head will be updated by the backend handler. b) For writter side, if the written information will be read by others, then after writing, I need to flush the cache to let the other party see it immediately. For example, after blkdev submits cbd_se, it needs to update cmd_head to let the handler have a new cbd_se. Therefore, after updating cmd_head, I need to flush the cache to let the backend see it.


This part of the code is indeed implemented, however, as you pointed out, since I am currently using qemu/pmem for emulation, the effects of this code cannot be observed.

Thanx

~Gregory
.





[Index of Archives]     [Linux RAID]     [Linux SCSI]     [Linux ATA RAID]     [IDE]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Device Mapper]

  Powered by Linux