On Thursday, October 15, 2020 2:03 AM, Christoph Hellwig <hch@xxxxxxxxxxxxx> wrote: > I don't think this belongs into the kernel. It is a classic case for > infrastructure that should be built in userspace. If anything is > missing to implement it in userspace with equivalent performance we > need to improve out interfaces, although io_uring should cover pretty > much everything you need. Hi Christoph, We previously considered moving the mpool object store code to user-space. However, by implementing mpool as a device driver, we get several benefits in terms of scalability, performance, and functionality. In doing so, we relied only on standard interfaces and did not make any changes to the kernel. (1) mpool's "mcache map" facility allows us to memory-map (and later unmap) a collection of logically related objects with a single system call. The objects in such a collection are created at different times, physically disparate, and may even reside on different media class volumes. For our HSE storage engine application, there are commonly 10's to 100's of objects in a given mcache map, and 75,000 total objects mapped at a given time. Compared to memory-mapping objects individually, the mcache map facility scales well because it requires only a single system call and single vm_area_struct to memory-map a complete collection of objects. (2) The mcache map reaper mechanism proactively evicts object data from the page cache based on object-level metrics. This provides significant performance benefit for many workloads. For example, we ran YCSB workloads B (95/5 read/write mix) and C (100% read) against our HSE storage engine using the mpool driver in a 5.9 kernel. For each workload, we ran with the reaper turned-on and turned-off. For workload B, the reaper increased throughput 1.77x, while reducing 99.99% tail latency for reads by 39% and updates by 99%. For workload C, the reaper increased throughput by 1.84x, while reducing the 99.99% read tail latency by 63%. These improvements are even more dramatic with earlier kernels. (3) The mcache map facility can memory-map objects on NVMe ZNS drives that were created using the Zone Append command. This patch set does not support ZNS, but that work is in progress and we will be demonstrating our HSE storage engine running on mpool with ZNS drives at FMS 2020. (4) mpool's immutable object model allows the driver to support concurrent reading of object data directly and memory-mapped without a performance penalty to verify coherence. This allows background operations, such as LSM-tree compaction, to operate efficiently and without polluting the page cache. (5) Representing an mpool as a /dev/mpool/<mpool-name> device file provides a convenient mechanism for controlling access to and managing the multiple storage volumes, and in the future pmem devices, that may comprise an logical mpool. Thanks, Nabeel