On Mon, Oct 19, 2020 at 3:30 PM Nabeel Meeramohideen Mohamed (nmeeramohide) <nmeeramohide@xxxxxxxxxx> wrote: > > Hi Dan, > > On Friday, October 16, 2020 4:12 PM, Dan Williams <dan.j.williams@xxxxxxxxx> wrote: > > > > On Fri, Oct 16, 2020 at 2:59 PM Nabeel Meeramohideen Mohamed > > (nmeeramohide) <nmeeramohide@xxxxxxxxxx> wrote: > > > > > > On Thursday, October 15, 2020 2:03 AM, Christoph Hellwig > > <hch@xxxxxxxxxxxxx> wrote: > > > > I don't think this belongs into the kernel. It is a classic case for > > > > infrastructure that should be built in userspace. If anything is > > > > missing to implement it in userspace with equivalent performance we > > > > need to improve out interfaces, although io_uring should cover pretty > > > > much everything you need. > > > > > > Hi Christoph, > > > > > > We previously considered moving the mpool object store code to user-space. > > > However, by implementing mpool as a device driver, we get several benefits > > > in terms of scalability, performance, and functionality. In doing so, we relied > > > only on standard interfaces and did not make any changes to the kernel. > > > > > > (1) mpool's "mcache map" facility allows us to memory-map (and later unmap) > > > a collection of logically related objects with a single system call. The objects in > > > such a collection are created at different times, physically disparate, and may > > > even reside on different media class volumes. > > > > > > For our HSE storage engine application, there are commonly 10's to 100's of > > > objects in a given mcache map, and 75,000 total objects mapped at a given > > time. > > > > > > Compared to memory-mapping objects individually, the mcache map facility > > > scales well because it requires only a single system call and single > > vm_area_struct > > > to memory-map a complete collection of objects. > > > Why can't that be a batch of mmap calls on io_uring? > > Agreed, we could add the capability to invoke mmap via io_uring to help mitigate the > system call overhead of memory-mapping individual objects, versus our mache map > mechanism. However, there is still the scalability issue of having a vm_area_struct > for each object (versus one for each mache map). > > We ran YCSB workload C in two different configurations - > Config 1: memory-mapping each individual object > Config 2: memory-mapping a collection of related objects using mcache map > > - Config 1 incurred ~3.3x additional kernel memory for the vm_area_struct slab - > 24.8 MB (127188 objects) for config 1, versus 7.3 MB (37482 objects) for config 2. > > - Workload C exhibited around 10-25% better tail latencies (4-nines) for config 2, > not sure if it's due the reduced complexity of searching VMAs during page faults. So this gets to the meta question that is giving me pause on this whole proposal: What does Linux get from merging mpool? What you have above is a decent scalability bug report. That type of pressure to meet new workload needs is how Linux interfaces evolve. However, rather than evolve those interfaces mpool is a revolutionary replacement that leaves the bugs intact for everyone that does not switch over to mpool. Consider io_uring as an example where the kernel resisted trends towards userspace I/O engines and instead evolved a solution that maintained kernel control while also achieving similar performance levels. The exercise is useful to identify places where Linux has deficiencies, but wholesale replacing an entire I/O submission model is a direction that leaves the old apis to rot.