Hello, I am a first-year student for my Master Degree at Nanjing University, China. My research area focuses on distributed storage. Our team has been worked with a project on Ceph for a half year. I am very interested in your idea that implement cache layer on top of userspace nvme driver. Here is my understand of your idea. Ceph’s OSDS used to lay on filestore which uses the local POSIX file system such XFS/BtrFS/EXT4. But it works poor for Ceph, so bluestore was designed to replace it.One major difference from filestore is that bluestore use rocksDB for managing metadata. RocksDB has ideal key/value interface, it support transactions, uses ordered enumeration, and can fast commits to log/journal. RocksDB also has common interface, we can always swap in another KB DB if we want. In bluestore rocksDB is used to save object metadata, write-ahead log, ceph key/value omap data and allocator metadata. At the bottom of the bluestore architecture is disk which can be HDD or SSD. On top of the disks is the driver layer which include kernel driver and userspace driver SPDK. With these drivers, the host provides a block device layer which was bluestore built on. SPDK is used to accelerate the speed of NVME SSD. NVME SSDs have very high speed, its IOPS can reaches millions times per second. If we use kernel driver, the system must switch between kernel and user mod frequently, and the CPU cost of interrupt can be very high. The result is that the kernel driver slow down the disk IOPS. SPDK runs in userspace and uses polling instead of interrupt Metadata cache is very critical for a better file system throughput, because metadata operations occupy more than half of the I/O. Currently bluestore has implemented kernel driver cache uses pagecache to accelerate metadata, userspace NVME driver lacks any sort of memory cache to help performance. So I apply this project to implement the userspace NVME cache. The implementation will uses the knowledge of SPDK,rocksDB and pagecache. The goal is to cache metadata saved in rocksDB using API provided by SPDK. Because rocksDB is a key-value storage that much like database, so we could use the cache strategy implemented by database such as SQLite. Fellow aspects are the most important we must consider: 1)Consistency, which is related to the correctness of the data. We could use a dirty list to mark the pages in cache that have been modified, and make some operations to obtain the consistency. 2)hit rate, which is critical to the efficiency, high hit rate gives high efficiency, hit rate is related to the policy of page replacement. 3)valid rate, which is concerned with the memory space use. It is somehow related to policy of page replacement too. 4)What to cache. I think the memTable of rocksDB could server as the cache for a write operation. So the cache should focus on the metadata querry operation. -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html