On Mon, 9 Nov 2015 11:44:54 -0800 Shaohua Li <shli@xxxxxx> wrote: > In jemalloc, a free(3) doesn't immediately free the memory to OS even > the memory is page aligned/size, and hope the memory can be reused soon. > Later the virtual address becomes fragmented, and more and more free > memory are aggregated. If the free memory size is large, jemalloc uses > madvise(DONT_NEED) to actually free the memory back to OS. > > The madvise has significantly overhead paritcularly because of TLB > flush. jemalloc does madvise for several virtual address space ranges > one time. Instead of calling madvise for each of the ranges, we > introduce a new syscall to purge memory for several ranges one time. In > this way, we can merge several TLB flush for the ranges to one big TLB > flush. This also reduce mmap_sem locking and kernel/userspace switching. > > I'm running a simple memory allocation benchmark. 32 threads do random > malloc/free/realloc. Corresponding jemalloc patch to utilize this API is > attached. > Without patch: > real 0m18.923s > user 1m11.819s > sys 7m44.626s > each cpu gets around 3000K/s TLB flush interrupt. Perf shows TLB flush > is hotest functions. mmap_sem read locking (because of page fault) is > also heavy. > > with patch: > real 0m15.026s > user 0m48.548s > sys 6m41.153s > each cpu gets around 140k/s TLB flush interrupt. TLB flush isn't hot at > all. mmap_sem read locking (still because of page fault) becomes the > sole hot spot. > > Another test malloc a bunch of memory in 48 threads, then all threads > free the memory. I measure the time of the memory free. > Without patch: 34.332s > With patch: 17.429s > > Current implementation only supports MADV_DONTNEED. Should be trival to > support MADV_FREE if necessary later. I'd like to see a full description of the proposed userspace interface: arguments, data structures, return values, etc. A propotype manpage, basically. I'd also like to see an analysis of which other userspace allocators will benefit from this. glibc? tcmalloc? > > ... > > +/* > + * The vector madvise(). Like madvise except running for a vector of virtual > + * address ranges > + */ > +SYSCALL_DEFINE3(madvisev, const struct iovec __user *, uvector, > + unsigned long, nr_segs, int, behavior) > +{ > + struct iovec iovstack[UIO_FASTIOV]; > + struct iovec *iov = NULL; > + unsigned long start, end = 0; > + int unmapped_error = 0; > + size_t len; > + struct mmu_gather tlb; > + int error; > + int i; > + > + if (behavior != MADV_DONTNEED) > + return -EINVAL; > + > + error = rw_copy_check_uvector(CHECK_IOVEC_ONLY, uvector, nr_segs, > + UIO_FASTIOV, iovstack, &iov); > + if (error <= 0) > + goto out; > + /* Make sure address in ascend order */ > + sort(iov, nr_segs, sizeof(struct iovec), iov_cmp_func, NULL); Do we really need to sort the addresses? That's something which can be done in userspace and we can easily add a check-for-sortedness to the below loop. It depends on whether userspace can easily generate a sorted array. If basically all userspace will always need to run sort() then it doesn't matter much whether it's done in the kernel or in userspace. But if *some* userspace can naturally generate its array in sorted form then neither userspace nor the kernel needs to run sort() and we should take this out. -- To unsubscribe from this list: send the line "unsubscribe linux-api" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html